Binary vs Continuous Features for feature splitting

vsuriya93 · May 10, 2023, 10:49am

Hi Team,

I have a few binary_features which as very important for my learning task, along with >20 continuous features. When i look at the feature importance - binary features are not making it to top-k. On debugging I found that xgboost is splitting on continuous features more than binary ones.

ref: https://www.youtube.com/watch?v=NLrhmn-EZ88&t=633s,

In this video - the presenter claims that the treeExtra repo (from Amazon) does this by normalising by entropy while computing the split

code: https://github.com/dariasor/TreeExtra/commit/2be1601657d01ebe4017f43ac957d84dbf901f20

I wish to know if this is already a part of xgboost? if not, where should I make these changes in xgboost repo to help the binary_features rank higher. Thank you

PabloBrianese · May 11, 2023, 6:42pm

Does your model subsample features? Have you looked at the feature_weights parameter?

vsuriya93 · May 13, 2023, 4:48am

Yes I am able to get the model-splits to work based on the feature_weights param.

But is there a way to tune these optimally weights for tree construction? Any huristics on how to set these feature_weights?

PabloBrianese · May 13, 2023, 2:18pm

I suggest you give more weight to your categorical features. Perhaps start making the weights equal for the set of categorical features and for the set of continuous features. Do this only if you believe your model fails to fit the data because it misses these very important categorical features.

Also, do know that trees will always split on a variable proportional to its complexity. Discrete features, proportionally to their cardinality. And even more for continuous features

vsuriya93 · May 15, 2023, 2:05am

Thank you - I will try this out.

PabloBrianese · May 15, 2023, 2:18am

Are you working on ranking problems? If you have few product variables/relative to total variables, maybe you should consider turning column subsampling off, or reducing it by a lot

vsuriya93 · May 16, 2023, 3:20am

Yes, I am working on ranking task. Thanks for the insight, i will tune the col subsampling - any intuition as to why this needs to be done?

Also for this ranking task I find rank:pairwise works better in terms of ndcg than rank:ndcg. Any thoughts on why rank:ndcg is not performing as well as pairwise?

PabloBrianese · May 16, 2023, 12:17pm

Your boosting machine needs product features to be present in a tree for it’s score to be relevant for sorting. Otherwise it will be context or user variable, and thus constant in each search context

Your tree also needs for two variables to co-occur in a tree for it to learn the relationship between them. So for the boosting machine to learn the relationship between product and user, it needs many trees that contain both product and user variables. High colsample prevents this from happening

Xgboost through feature_weight allows you to tell colsample how to draw from the distribution of features, allowing you to balance the probabilities

That being said, this path robs you of a regularization method. If your data has many irrelevant features and you cannot afford feature engineering, you will be in trouble. Be sure to experiment on your data with a hold out set

vsuriya93 · May 17, 2023, 7:43am

Thank you for the very clear response! Also for ranking task I read that xgb only supportes lable till ~30 or 31 for positive labels. Can you refer me to code where this is computed? 2^rel - 1? Thanks

PabloBrianese · May 17, 2023, 11:26am

I’m not familiar with the issue. If it’s not clearly documented, you can look at the c++ source code https://github.com/dmlc/xgboost

vsuriya93 · June 14, 2023, 3:52am

Revisiting this discussion, if I have 2 sets of features one set all continuous and other all discreet, would you suggest to have a cascade model where first uses only continuous and the second uses output of first + plus all discreet features? Any thoughts on this? This is done so that the model can now have more chance to split based on the discreet features?