The Multiple faces of ‘Feature importance’ in XGBoost

Amjad Abu-Rmileh
February 8, 2019

(Photo courtesy: www.timeshighereducation.com)

Be careful when interpreting your features importance in XGBoost, since the ‘feature importance’ results might be misleading!

This post gives a quick example on why it is very important to understand your data and do not use your feature importance results blindly, because the default ‘feature importance’ produced by XGBoost might not be what you are looking for.

The figure shows the significant difference between importance values, given to same features, by different importance metrics.

Assuming that you’re fitting an XGBoost for a classification problem, an importance matrix will be produced. The importance matrix is actually a table with the first column including the names of all the features actually used in the boosted trees, the other columns of the matrix are the resulting ‘importance’ values calculated with different importance metrics [3]:

“The Gain implies the relative contribution of the corresponding feature to the model calculated by taking each feature’s contribution for each tree in the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.

The Coverage metric means the relative number of observations related to this feature. For example, if you have 100 observations, 4 features and 3 trees, and suppose feature1 is used to decide the leaf node for 10, 5, and 2 observations in tree1, tree2 and tree3 respectively; then the metric will count cover for this feature as 10+5+2 = 17 observations. This will be calculated for all the 4 features and the cover will be 17 expressed as a percentage for all features’ cover metrics.

The Frequency (R)/Weight (python) is the percentage representing the relative number of times a particular feature occurs in the trees of the model. In the above example, if feature1 occurred in 2 splits, 1 split and 3 splits in each of tree1, tree2 and tree3; then the weight for feature1 will be 2+1+3 = 6. The frequency for feature1 is calculated as its percentage weight over weights of all features.

The Gain is the most relevant attribute to interpret the relative importance of each feature.

Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

Coverage’ measures the relative quantity of observations concerned by a feature.”[3]

Why is it important to understand your feature importance results?

Suppose that you have a binary feature, say gender, which is highly correlated with your target variable. Furthermore, you observed that the inclusion/ removal of this feature form your training set highly affects the final results. If you investigate the importance given to such feature by different metrics, you might see some contradictions:

Most likely, the variable gender has much smaller number of possible values (often only two: male/female) compared to other predictors in your data. So this binary feature can be used at most once in each tree, while, let say, age (with a higher number of possible values) might appear much more often on different levels of the trees. Therefore, such binary feature will get a very low importance based on the frequency/weight metric, but a very high importance based on both the gain, and coverage metrics!

A comparison between feature importance calculation in scikit-learn Random Forest (or GradientBoosting) and XGBoost is provided in [1]. Looking into the documentation of scikit-lean ensembles, the weight/frequency feature importance is not implemented. This might indicate that this type of feature importance is less indicative of the predictive contribution of a feature for the whole model.

So, before using the results coming out from the default features importance function, which is the weight/frequency, take few minutes to think about it, and make sure it makes sense. If it doesn’t, maybe you should consider exploring other available metrics.

Note: if you are using python,you can access the different available metrics with a line of code:

#Available importance_types = [‘weight’, ‘gain’, ‘cover’, ‘total_gain’, ‘total_cover’]
f = ‘gain’
XGBClassifier.get_booster().get_score(importance_type= f)

References:

Feature Importance of Random Forest vs XGBoost
The feature importance in both cases is the same: given a tree go over all the nodes of the tree and do the following…forums.fast.ai

Why is the default value for feature_importance 'weight' in python but R uses 'gain'? · Issue #2706…
I was reading through the docs and noticed that in the R-package section…github.com

How do i interpret the output of XGBoost importance?
begingroup$ Thanks Sandeep for your detailed answer. I would like to correct that cover is calculated across all splits…datascience.stackexchange.com

Explaining Feature Importance by example of a Random Forest
In many (business) cases it is equally important to not only have an accurate, but also an interpretable model…towardsdatascience.com