I have a very basic question.
Do I get a different model if I use the same train data to fit a xgboost-model while using other valuation data (val = xgb_val) ?
I have splitted my data in 80-10-10.
First I have trained an xgboost model with fixed hyperparameters with the code below, using the “first 10-part” of the data for val = xgb_val:
model_n <- xgb.train(data = xgb_trainval,
tree_method = “gpu_hist”,
booster = “gbtree”,
objective = “binary:logistic”,
max_depth = 1,
eta = 0.17,
subsample = 0.5,
colsample_bytree = 0.5,
min_child_weight = 10,
nrounds = 1000,
eval_metric = “auc”,
early_stopping_rounds = 30,
print_every_n = 1000,
watchlist = list(train = xgb_trainval, val = xgb_val)
Next I did exactly the same while using the “second 10-part” of the data for val = xgb_val.
I noticed the number of iterations are different and that both models give different predictions for identical new data.
I suppose the difference is caused by the xgboost-model-fitting proces which uses the validation data to optimize the number of iterations?
If the validation data is influencing the trained model. Should I use a considerable large part of my data for validation? Probably larger than my current 80-10-10?
Thanks a lot for any advice!