Dealing with xgboost multiple missing values

vsuriya93 · October 1, 2023, 7:33am

Hello Community,

I am using xgboost for a learning-to-rank task where each example has a blend of product and user features (15 of each to be precise) The user feature is missing for the long-tail of search requests. They are all-or-none features (either all 15 are available or all 15 are missing)

If we have the ltr-model built on this 30-dim feature vector, ideal expectation is the model to rely on user-related features when available, when it is not rely on product features to do the ranking. However after model training I see this not the case.

I have done the following for training

Set missing value in xgb-dmatrix api
Monotonic features on subset of user features
hyper-param search on col_sample_by {node, tree and level}
feature-weights to sample the product-features

Is there a way to optimise this for xgboost? the case when there is missing user-features, the ranking behaviour seems to be quite random and very suboptimal.

Appreciate your help in this regard, thank you for your time.

hcho3 · October 2, 2023, 3:32am

the ranking behaviour seems to be quite random and very suboptimal.

Which version of XGBoost are you using? XGBoost 2.0 has a renewed implementation for learning-to-rank, which should improve the ranking accuracy. See the “Learning to Rank” section of the Release Note.

vsuriya93 · October 2, 2023, 3:36am

I am currently using xgboost 1.7.6. Yes, I will upgrade to the next version.

I am also interested in understanding how xgboost would work in such cases when there are many missing values. Ideally you would expect the model to rely on other features to do the ranking. But the splits are heavily dominated by one-category of features (user features in this case)

hcho3 · October 2, 2023, 4:23am

Not quite. XGBoost “imputes” missing values by choosing a “default” direction for each internal test node. So either the missing value gets mapped to the left child node or to the right child node. The default direction is chosen to maximize the reduction of the loss function of the test node.

So we should not assume that features with many missing values will be chosen less often than other features. To de-prioritize features with many missing values, you should explicitly specify smaller weights to such features.

vsuriya93 · October 2, 2023, 4:31am

Thanks for your quick response.

I tried doing an ablation study on the features, The product features are all binary features, whereas the user features are all continues features. So when I try to visualise the tree splits, I see the top features (splits) are all based on continuous features. I tried to experiment with the feature_weights that helped to rebalance some of the splits but did not improve the overall ranking (in scenarios where you are looking for a needle in a haystack - example: focused search query for an item) In these cases where the user-features are all missing, and we need to rely on product binary features such as title match to do the ranking.

However for the queries we have user features xgb dominates ranking. The long-tail of search requests are the ones which needs some kind of special tunings.

Thanks for the information shared above - appreciate your comments

Posted on this forum previously too flagging this

hcho3 · October 2, 2023, 4:35am

Yes, I did notice that XGBoost tends to choose continuous features over binary features. (Also similar issue with categorical features with high cardinality being favored, if direct support for categorical features is used.) You may want to consider special handling logic, such as model stacking: the first model determines whether certain set of features should be used and the second model (XGBoost) is trained on a subset of features determined by the first model. Note. I am not a data science practitioner, and have not tried this suggestion myself.

vsuriya93 · October 2, 2023, 4:39am

Thank you for you suggestions

Yes, I tried building 2 models - one with user + product features, and other with only product features and we switch between these based on the inference features. Seems to bypass some of these issues, but however due to internal constraints we try to do only one model. This is where I feel the linear ranker (such as logistic regression or rank:svm) have a graceful fail-mechanism where missing feature (set to 0) will have no impact on the output.

So we are still in the process of experimentation and hopeful upgrading to 2.0 can help fix some of these issues.

Really appreciate taking time to respond - thanks and have a great day ahead!