Watchlist example: why is test set used for performance monitoring AND prediction?

I am wondering about the proper use of the validation set and watchlist feature for early stopping and performance monitoring. Doesn’t this involve leakage?

# from guide-python/basic_walkthrough.py
# specify validations set to watch performance
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)

# this is prediction
preds = bst.predict(dtest)

Should it be, instead:

# specify validations set to watch performance
watchlist = [(dtrain, 'train'), (dvalid, 'val')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)

# this is prediction
preds = bst.predict(dtest) # use test set here only

Separately, the positions are reversed in two examples located on GitHub, for instance:

Which is correct? Scoring predictions are not run further in this example, but it seems to follow the “right” intuition of monitoring performance on a dev/validation set while leaving the test set only for scoring conditions.

# from /tests/python/test_eval_metrics.py
watchlist = [(dtrain, 'train'), (dvalid, 'val')]

The last line preds = bst.predict(dtest) is only to demonstrate the use of predict(). We are not performing model selection here.

The demo shows a minimal example of how to use predict() and train(). Yes, if you are performing a model selection with different hyperparameter combinations, then you’d want to use a validation set (or cross-validation).

Thanks so much, I really appreciate your dedication to the board and responding!