Parameter configuration for xgboost with rfecv for huge dataset

naveen_9697 · February 16, 2021, 3:23pm

I have 640 features and 1 target feature in 53,460 rows.

I am trying to select features using sklearn’s RFECV. Since I want to speed things up, I have configured both of them as shown below and I am sure that this is not a proper configuration of parameters, I would like suggestions from the community to complete things quicker.

from sklearn.feature_selection import RFECV
from xgboost import XGBClassifier
.......
xgb_rfe = XGBClassifier(objective='multi:softmax', num_class=3, eval_metric='logloss', use_label_encoder=False,
                            random_state=100, n_estimators=10_000, verbosity=0, early_stopping_rounds=3_000,
                            # i have cpu with 16 threads(with 8 cores)
                            n_jobs=7)

rfe = RFECV(estimator=xgb_rfe, min_features_to_select=2, verbose=2, n_jobs=2, cv=3)
rfe.fit(X=X_train, y=y_train)

Thanks.

thvasilo · March 1, 2021, 6:24pm

Hello Naveen, a dataset of that size should take less than a second to train on XGBoost with reasonable parameters. You’re using 10,000 trees which is bound to lead to overfitting. I suggest you use 100-500 tree (n_estimators) to start with and focus on other parts of your modelling.

mcerl · May 21, 2021, 6:54pm

I suggest shap-hypetune: a python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models. It supports RFE also with shap feature ranking