Does validation data influence the trained model?

marboe123 · July 11, 2022, 12:43pm

Hi,

I have a very basic question.
Do I get a different model if I use the same train data to fit a xgboost-model while using other valuation data (val = xgb_val) ?

I have splitted my data in 80-10-10.

First I have trained an xgboost model with fixed hyperparameters with the code below, using the “first 10-part” of the data for val = xgb_val:

set.seed(20)
model_n <- xgb.train(data = xgb_trainval,
tree_method = “gpu_hist”,
booster = “gbtree”,
objective = “binary:logistic”,
max_depth = 1,
eta = 0.17,
subsample = 0.5,
colsample_bytree = 0.5,
min_child_weight = 10,
nrounds = 1000,
eval_metric = “auc”,
early_stopping_rounds = 30,
print_every_n = 1000,
watchlist = list(train = xgb_trainval, val = xgb_val)
)

Next I did exactly the same while using the “second 10-part” of the data for val = xgb_val.

I noticed the number of iterations are different and that both models give different predictions for identical new data.
I suppose the difference is caused by the xgboost-model-fitting proces which uses the validation data to optimize the number of iterations?

If the validation data is influencing the trained model. Should I use a considerable large part of my data for validation? Probably larger than my current 80-10-10?

Thanks a lot for any advice!

jinlow · July 13, 2022, 9:06am

You are exactly right, the validation data is used in training the model, in that it is used to calculate the optimal number of iterations via the early_stopping_rounds.
Your question about if 10% is a large enough sample is very subjective. I think this would vary on several factors, but mainly how many records fall into that 10%. Do you think that 10% has enough records, that the variation in the data would be representative of the variation in your entire population? Or from another perspective, enough where the performance of that dataset is representative of the performance on your entire population?
I think that consideration needs to be made for both your validation dataset used in training as well as your true untouched holdout dataset (the additional 10% of your data).

marboe123 · July 15, 2022, 2:33pm

Thanks a lot for your answer!

I suppose indeed that the 10% has enough records to be representative for the performance of the dataset if I look at the results for every fold in 5-fold crossvalidation. These are reasonable stable (although this stability depends on the quality of the model as well). The variation in the data is hard to measure since the targetvariabele is classification (true/false).

Besides that I noticed that in 5-fold crossvalidation, the optimal number of iterations only differs in case of 1 fold after changing the valuation data. I suppose this underscores there is enough stability in choosing 10% for validation.