Feature importance between numerical and categorical features

vett93 · September 14, 2018, 2:16am

Let’s assume that we use the default “weight” for the feature importance type. It seems that the plot_importance function biases against categorical features. My understanding is that XGBoost requires that categorical features go through one-hot encoding. Consequently, each categorical feature transforms into N sub-categorical features, where N is the number of possible outcomes for this categorical feature.

Then each sub-categorical feature would compete with the rest of sub-categorical features and all numerical features. It is much easier for a numerical feature to get higher importance ranking, isn’t it?

vett93 · September 24, 2018, 6:31pm

Can someone please comment on this issue? LightGBM does not require one-hot encoding. Would it be better fitted for comparing numerical vs. categorical features?

hcho3 · September 24, 2018, 8:42pm

I think you are right. What we can do is to set importance_type to weight and then add up the frequencies of sub-categorical features to obtain the frequency of each categorical feature.

I won’t comment on the effectiveness of one-hot encoding here. As for direct handling of categorical data, the XGBoost community has explicitly decided against special handling for categorical features: https://github.com/dmlc/xgboost/issues/1721#issuecomment-311395865. This is because it will take a huge code overhaul to handle categorical features without one-hot encoding.