Hi all –
I was wondering whether you can help me sort out some confusion I have about using nested cross validation (i.e. inner + outer cv loops) from model derivation. I am happy to share some code, but I think that my question is more conceptual, so I’ll try to keep it short.
I have a large data set with ~600 different binary predictors and a binary outcome. My goal is not to build a prediction model (we are likely missing a lot of important predictors, which are currently unknown), but rather use the model to identify important variables in the dataset that are most strongly associated with the outcome. The database has ~7000 unique rows, and the prevalence of the outcome is about 9%.
What I’ve done so far is the following:
- Shuffle the data and create a 20/80 split. Use the 80 for training and the 20 for testing.
- Run a 5-fold cross validation on the training to tune the hyperparameters using a large grid search.
- Choose the best set of hyperparameters based on the minimal average loss function (log loss) across the cv’s.
- Train the model on 80% of the data (i.e. all the validation data from the initial split) using the chosen hyperparameters.
- Assess model performance on the remaining 20% (again, the focus for our analysis is really on finding the most important predictors, not on developing a prediction model. That said, it’d be nice to get some sense of model performance).
- Finally, run the model again on all available data (i.e. 100%, including both initial splits), and use some metrics to assess variable importance.
So here are my specific questions:
I understand that when the same cross-validation process is used for selecting the hyperparameters as well as for assessing model performance, the performance score can be overly optimistic, which is why nested-cv is warranted. However, is this still true in cases when you leave some of the data completely untouched during the model building process (such as in my case above, where 20% were left out for validation - i.e. never touched by the model beforehand)?
When using nested-cv, the hyperparameters and the choice of variables can differ somewhat across the outer loops. So as opposed to the non-nested approach above where after the (inner) cv you’re left with a specific set of hyperparameters that are then used to train a single model, the nested-cv repeats this process for each outer loop. My guess is that it can be reassuring if the same set of variables is selected each time, since that would indicate that the algorithm leads to stable results. But is there a formal way to combine the variable importance results from the different cv’s together to get one unified ‘importance’ score, similar to what one can obtain from a single run of the model? Or is the nested-cv really meant for evaluation of model performance, and should not be used in cases when the primary goal is to identify important predictors?
Thanks so much for your help!