Nrounds vs max_depth

Hello all –
There’s seems to be a trade-off between the maximum depth of the trees (max_depth) and the number of iterations (nrounds). Specifically, as one goes up the other tends to go down. In my analysis (70,000 rows, about 1000 predictors), I am attempting to use a grid search to find the optimal value for these parameters (as well as a few others). However, the cross validation scores suggest that the optimal model is obtained when max_depth exceeds 100, with a modest nrounds values (~200).

It seems to be like the opposite (i.e. high nrounds and lower max_depth) makes more sense for the ability to generalize the results to other databases, but it feel weird to just go with this feeling arbitrarily. Am I missing something? Is there a better criteria (other than CV that is) for setting max_depth?

You should use nested cross-validation, since you are trying to estimate the generalization performance of the model as well as the hyperparameter search procedure. See

You might be overestimating the generalization capacity of the model. See the linked article.

Thanks Philip - so I completely understand that you shouldn’t evaluate the model performance using the same dataset that was also used for tuning the parameters. Still, my question here is not about the absolute performance, but rather about the choice of the hyper-parameters to use. The CV process should help guide the choice of parameters.

There is a trade off between the number of iterations to tree depth - the shallower trees are, the more iterations are needed. In general, my understanding is that shallower trees are better for generalizability. Intuitively that also makes sense - i.e. you better off with a model that has nrounds of 1,000 and tree depth of 3 than a model that has nrounds of 150 and tree depth of 100. Yet, if CV scores suggest that deeper models are better, should one go with this? I had never seen any publications using xgboost that reported such deep trees, so the choice seems odd to me, but then making another decision arbitrarily also seems unfair.

In any case, since the databases we are handling are quite big, our go to approach is usually to initially split the data to a training and a test set, do the CV for tuning on the training data, then use the full training data for training the model, and finally report the performance on the never seen before test set. This, to my understanding, should give unbiased results similar to the nested CV. Am I missing something?

Thanks again

Keep in mind that setting max_depth=100 does not necessarily produce a tree with depth 100, due to presence of other regularizing hyperparameters, e.g. min_child_weight and gamma.