Is xgboost a good solution for high-dimensional sparse features?

colinbebrave · September 3, 2020, 7:09am

hello guys, I am working on recommender, and have high-dimensional sparse features. which model would be better for such features, LR or xgboost?
As far as i know, high dimension features would lead to very deep trees, and sparse features will lead to overfitting, but I am not sure, and was thinking which model should I adopt.
Thanks for your replies.

hcho3 · September 3, 2020, 11:09pm

If the high-dimentional sparse features originate from feature interaction terms, then you should prefer to use logistic regression (LR), since LR will be much faster to fit and serve.

If the features are result of one-hot encoding of categorical features, we are currently in the process of implementing direct categorical splits in XGBoost, so that we can avoid one-hot encoding. Stay tuned.

chloe-wang · September 9, 2020, 8:15am

@hcho3 could you explain a bit why LR is good for features originate from feature interaction terms?

hcho3 · September 9, 2020, 9:17am

Primarily due to scalability and performance reasons. XGBoost will consume memory in amount proportional to the number of data points X the number of features X the number of tree nodes, so high-dimensional data will lead to high memory consumption and potentially lead to out-of-memory (OOM) error.

When it comes to LR, there are techniques to make LR scale to large high dimensional data, such as feature hashing and sparse representation. See https://courses.cs.washington.edu/courses/cse547/16sp/slides/hashing-sketching-annotated-2.pdf. Here the feature interaction terms consist of interaction between global features and personal (per-user) features.

chloe-wang · September 11, 2020, 4:32am

@hcho3 thanks a lot.