Binary vs Continuous Features for feature splitting

Hi Team,

I have a few binary_features which as very important for my learning task, along with >20 continuous features. When i look at the feature importance - binary features are not making it to top-k. On debugging I found that xgboost is splitting on continuous features more than binary ones.


In this video - the presenter claims that the treeExtra repo (from Amazon) does this by normalising by entropy while computing the split


I wish to know if this is already a part of xgboost? if not, where should I make these changes in xgboost repo to help the binary_features rank higher. Thank you

Does your model subsample features? Have you looked at the feature_weights parameter?

Yes I am able to get the model-splits to work based on the feature_weights param.

But is there a way to tune these optimally weights for tree construction? Any huristics on how to set these feature_weights?

I suggest you give more weight to your categorical features. Perhaps start making the weights equal for the set of categorical features and for the set of continuous features. Do this only if you believe your model fails to fit the data because it misses these very important categorical features.

Also, do know that trees will always split on a variable proportional to its complexity. Discrete features, proportionally to their cardinality. And even more for continuous features

Thank you - I will try this out.

Are you working on ranking problems? If you have few product variables/relative to total variables, maybe you should consider turning column subsampling off, or reducing it by a lot

Yes, I am working on ranking task. Thanks for the insight, i will tune the col subsampling - any intuition as to why this needs to be done?

Also for this ranking task I find rank:pairwise works better in terms of ndcg than rank:ndcg. Any thoughts on why rank:ndcg is not performing as well as pairwise?

Your boosting machine needs product features to be present in a tree for it’s score to be relevant for sorting. Otherwise it will be context or user variable, and thus constant in each search context

Your tree also needs for two variables to co-occur in a tree for it to learn the relationship between them. So for the boosting machine to learn the relationship between product and user, it needs many trees that contain both product and user variables. High colsample prevents this from happening

Xgboost through feature_weight allows you to tell colsample how to draw from the distribution of features, allowing you to balance the probabilities

That being said, this path robs you of a regularization method. If your data has many irrelevant features and you cannot afford feature engineering, you will be in trouble. Be sure to experiment on your data with a hold out set

Thank you for the very clear response! Also for ranking task I read that xgb only supportes lable till ~30 or 31 for positive labels. Can you refer me to code where this is computed? 2^rel - 1? Thanks

I’m not familiar with the issue. If it’s not clearly documented, you can look at the c++ source code

Revisiting this discussion, if I have 2 sets of features one set all continuous and other all discreet, would you suggest to have a cascade model where first uses only continuous and the second uses output of first + plus all discreet features? Any thoughts on this? This is done so that the model can now have more chance to split based on the discreet features?