Can averaging multiple xgboost models improve predictions?

I have a method which seems to improve the AUC on the test set.
I wonder if it makes sense or that it is just caused by randomness.

I have split my data in 80-10-10 (A,B,C).

First I do 5-fold cv on A.
I obtain the best hyperparametersettings for a 300-gridsearch.
With these settings I generate 5 models and I do 5 predictions for every row in C.
I take the average prediction for every row and calculate the AUC which is 0.7189.

Alternatively I fit a model on A with validation data B (and the hypersettings obtained above).
I do predictions with this model on C and calculate the AUC which is 0.7176.

So it seems the first method with the average of 5 models scores better than just 1 model on the same data. I wonder if this can be possible. I read some articles about stacking of different models (neural net + xgboost for example) although these set-ups seem much more sophisticated than mine above.

Is it realistic that the average of 5 models leads to better results? The cost is that I need to run 5 predictions instead of 1 every time I make a prediction.

Can it be the case that my models in itself are not optimal hypertuned wich gives the room for improvement by averaging?

Thanks a lot!

this is the code I used for fitting the models:

model_n <- xgb.train(data = xgb_trainval,
tree_method = “gpu_hist”,
booster = “gbtree”,
objective = “binary:logistic”,
max_depth = 1,
eta = 0.17,
subsample = 0.5,
colsample_bytree = 0.5,
min_child_weight = 10,
nrounds = 1000,
eval_metric = “auc”,
early_stopping_rounds = 30,
print_every_n = 1000,
watchlist = list(train = xgb_trainval, val = xgb_val)