XGBOOST over-fitting despite no indication in cross-validation test scores?

We currently work on a project where we aim to identify a set of predictors that may influence the risk of a relatively rare outcome. We are using a semi-large clinical dataset, with data on nearly 200,000 patients. The outcome of interest is binary (i.e. yes/no), and quite rare (~ 5% of the patients). We have a large set of nearly 1,200 mostly dichotomized possible predictors. Of note, our objective is not to create a prediction model, but rather to use the boosted trees algorithm as a tool for variable selection and for examining high-order interactions (i.e. to identify which variables, or combinations of variables, may have some influence on the outcome), so we can target these predictors more specifically in subsequent studies. Given the paucity of etiological information on the outcome, it is somewhat possible that none of the possible predictors we are considering actually have any influence the risk of developing the condition, so if we were aiming to develop a prediction model, it would have likely been a rather bad one. For this work, we use the R implementation of XGBoost.

We have been having some difficulties tuning the models. Specifically, when running cross validation to choose the optimal number of iterations (nrounds), the CV test score continues to improve even at very high values (for example, see figure below for nrounds=600,000). This is observed even when increasing the learning rate (eta), or when adding some regularization parameters (e.g. max_delta_step, lamda, alpha, gamma, even at high values for these). As expected, the CV test score is always lower than the train score, but continuous to improve without ever showing a clear sign of overfitting. This is true regardless of the evaluation matrics that is used (example below is for logloss, but the same is observed for auc/aucpr/error rate, etc.). Relatedly, the same phenomenon is also observed when using a grid search to find the optimal value of tree depth (max_depth). CV Test scores continue to improve regardless of the number of iterations, even at depth values exceeding 100, without showing any sign of overfitting.

Note that owing to the rare outcome, we use a stratified CV approach. Moreover, the same is observed when a train/test split is used instead of CV.

Are there situations in which overfitting happens despite continuous improvements in the CV-test (or test split) scores? If so, why is that and how would one choose the optimal values for the hyperparameters?

Relatedly, again, the idea here is not to create a prediction model (since it would be a rather bad one, owing that we don’t know much about the outcome), but to look for a signal in the data that may help identify a set of predictors for further exploration. If boosted trees is not the optimal method for this, are there others to come to mind? Again, part of the reason we chose to use boosted trees was to enable the identification of higher -i.e. more than 2- order interactions, which cannot be easily assessed using more conventional methods (including lasso/elastic net, etc.). Thank you!

You should consider using nested cross-validation, since you are trying to estimate the generalization performance of the model as well as the hyperparameter search procedure. See https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html.

Thank you Philip for the suggestion! But let’s assume for a second that I am only running a cross-validation to determine the optimal nrounds number (with an early_stop_rounds statement). Wouldn’t you expect that at some point at very high nrounds number the test-CV score will start diverging from the CV-train score, indicating poorer fit and overfitting? And once you set some regularization parameters (again, for the sake of this discussion, let’s assume without a grid search), wouldn’t you expect the same thing to happen, possibly at lower nrounds values? The figure I included was obtained from xgboost’s built-in CV function, without pre-processing of the data.

Thanks again

Suggestion: set a huge nrounds and activate early stopping with CV. (Use xgboost.cv() with early_stopping_rounds option.) That will let you discover when CV metric starts to worsen. (CV metric = average of validation metric over CV folds). Early stopping terminates training when the monitored metric deteriorates for consecutive M rounds, where M is early_stopping_rounds.

Yes, I expect CV folds to worsen at some point, eventually.

I also highly suggest that you consider using nested CV. The issue really is that, validation sets in CV folds get touched repeatedly and CV metric (mean validation metric over CV folds) tends to overestimate how well the model would perform on a truly held out test set.