Hello Community,
I am using xgboost for a learning-to-rank task where each example has a blend of product and user features (15 of each to be precise) The user feature is missing for the long-tail of search requests. They are all-or-none features (either all 15 are available or all 15 are missing)
If we have the ltr-model built on this 30-dim feature vector, ideal expectation is the model to rely on user-related features when available, when it is not rely on product features to do the ranking. However after model training I see this not the case.
I have done the following for training
- Set missing value in xgb-dmatrix api
- Monotonic features on subset of user features
- hyper-param search on col_sample_by {node, tree and level}
- feature-weights to sample the product-features
Is there a way to optimise this for xgboost? the case when there is missing user-features, the ranking behaviour seems to be quite random and very suboptimal.
Appreciate your help in this regard, thank you for your time.