How to choose `best_ntree_limit` using early stopping when doing CV manually?


I’m using a manual CV loop to tune booster parameters (this is at the same time as tuning vectoriser parameters, so I can’t use xgboost’s cv function).

I’m using an eval set for each CV fold to try and choose a good number of estimators for the model using the best_ntree_limit attribute.

These vary a lot in each iteration though, e.g. for 5-fold CV I’m sometimes seeing a wide range of best_ntree_limit values, e.g.: 7, 29, 13, 72, 14.

I’m wondering if there is any suggestion on choosing a value to use for my final model? E.g. I could take the mean or max value, but wondering if there was any better recommendation (or maybe this high variance indicates that there’s some other changes I should be making).


What is your early stopping rounds? You should try increasing early_stopping_rounds, i.e. wait longer until stopping early, so that we are not jumping to conclusions too quickly looking at noises.


I’ve got early stopping rounds set to 30, but am training on quite a small dataset (~800 items), so perhaps that is also contributing to the noise.


You may be overfitting. Try to induce regularization by, e.g. setting small learning rate, large min_child_weight, small max_depth etc


Yup, I suspect that’s the case. I’ll try a more tightly constrained hyperparam search which focuses more on reducing overfitting to see if it reduces the variance of early stopping trees.