Cross-validation question


I don’t have a very large dataset, and I want to use 10-fold cross-validation to see how well my model performs on unseen data.

I see a lot of code using cross-validation to establish the best hyper-parameters (#iterations, learning rate, etc). Then when that’s been done, we run the model on all the data with those optimal hyper-parameters.

Does that now give me an idea of how well the model performs on unseen data, or do I need to embed that whole story into another level of cross-validation?

Many thanks.

Using a model to predict its training set does not show how well the model performs on new data. This shows how well the model fits its training set. 10-fold cross validation performance is what you need.

This is a general Machine Learning practice. Here is a standard recommendation:

  1. Split your data into a training and a test set. Hold the test set back. Do not use it.
  2. Perform cross-validation on the training set only. You use this to optimize parameters. You are using a validation set to choose your best parameters. You find them. You finish.
  3. Now you go back, for a final test, and you see how well your optimized model does on the test set that it has never seen before.

Please note that this process is only necessary if you want to double-check your final model on unseen data, which the test set will be. I rarely do this because I work on the education side and usually cross-validation is sufficient for my purposes. But if you want to put extra care into your model, or if it’s going into production, holding back the extra test set is essential.

1 Like