How to choose `best_ntree_limit` using early stopping when doing CV manually?


#1

I’m using a manual CV loop to tune booster parameters (this is at the same time as tuning vectoriser parameters, so I can’t use xgboost’s cv function).

I’m using an eval set for each CV fold to try and choose a good number of estimators for the model using the best_ntree_limit attribute.

These vary a lot in each iteration though, e.g. for 5-fold CV I’m sometimes seeing a wide range of best_ntree_limit values, e.g.: 7, 29, 13, 72, 14.

I’m wondering if there is any suggestion on choosing a value to use for my final model? E.g. I could take the mean or max value, but wondering if there was any better recommendation (or maybe this high variance indicates that there’s some other changes I should be making).


#2

What is your early stopping rounds? You should try increasing early_stopping_rounds, i.e. wait longer until stopping early, so that we are not jumping to conclusions too quickly looking at noises.


#3

I’ve got early stopping rounds set to 30, but am training on quite a small dataset (~800 items), so perhaps that is also contributing to the noise.


#4

You may be overfitting. Try to induce regularization by, e.g. setting small learning rate, large min_child_weight, small max_depth etc


#5

Yup, I suspect that’s the case. I’ll try a more tightly constrained hyperparam search which focuses more on reducing overfitting to see if it reduces the variance of early stopping trees.