We currently work on a project where we aim to identify a set of predictors that may influence the risk of a relatively rare outcome. We are using a semi-large clinical dataset, with data on nearly 200,000 patients. The outcome of interest is binary (i.e. yes/no), and quite rare (~ 5% of the patients). We have a large set of nearly 1,200 mostly dichotomized possible predictors. Of note, our objective is not to create a prediction model, but rather to use the boosted trees algorithm as a tool for variable selection and for examining high-order interactions (i.e. to identify which variables, or combinations of variables, may have some influence on the outcome), so we can target these predictors more specifically in subsequent studies. Given the paucity of etiological information on the outcome, it is somewhat possible that none of the possible predictors we are considering actually have any influence the risk of developing the condition, so if we were aiming to develop a prediction model, it would have likely been a rather bad one. For this work, we use the R implementation of XGBoost.

We have been having some difficulties tuning the models. Specifically, when running cross validation to choose the optimal number of iterations (nrounds), the CV test score continues to improve even at very high values (for example, see figure below for nrounds=600,000). This is observed even when increasing the learning rate (eta), or when adding some regularization parameters (e.g. max_delta_step, lamda, alpha, gamma, even at high values for these). As expected, the CV test score is always lower than the train score, but continuous to improve without ever showing a clear sign of overfitting. This is true regardless of the evaluation matrics that is used (example below is for logloss, but the same is observed for auc/aucpr/error rate, etc.). Relatedly, the same phenomenon is also observed when using a grid search to find the optimal value of tree depth (max_depth). CV Test scores continue to improve regardless of the number of iterations, even at depth values exceeding 100, without showing any sign of overfitting.

Note that owing to the rare outcome, we use a stratified CV approach. Moreover, the same is observed when a train/test split is used instead of CV.

Are there situations in which overfitting happens despite continuous improvements in the CV-test (or test split) scores? If so, why is that and how would one choose the optimal values for the hyperparameters?

Relatedly, again, the idea here is not to create a prediction model (since it would be a rather bad one, owing that we don’t know much about the outcome), but to look for a signal in the data that may help identify a set of predictors for further exploration. If boosted trees is not the optimal method for this, are there others to come to mind? Again, part of the reason we chose to use boosted trees was to enable the identification of higher -i.e. more than 2- order interactions, which cannot be easily assessed using more conventional methods (including lasso/elastic net, etc.). Thank you!