Hightly correlated features with different feature importance

chloe-wang · October 15, 2020, 7:37am

hi,

I just found two features in my dataset have very high correlation(0.99).However, they have very different importance in the final trained model. Not sure how it happened.

Thanks.

kewellcjj · October 22, 2020, 3:39pm

I’m not an expert in xgboost. But if two features are seemingly identical, then using one of them is sufficient.

For a traditional statistical model like OLS, highly correlated features will lead to high variance of the model. One way to deal with this is to add regularization terms like in L1 or L2 penalty. In the end, only one of the features will be selected. This does not fully explain or could even be irrelevant to how xgboost (a tree based model) is implemented that leads to your observation. What I’m saying is that it makes sense to only have one feature out of the two to be “selected” in the model, to make the model simpler.