Custom callback with early stopping: number of executions

I have noticed that the number of executions of a custom callback seems to be random when it is used with early stopping turned on.

Here is a minimal example to reproduce the behavior I am talking about:

import xgboost as xgb

# read in data
dtrain = xgb.DMatrix('agaricus.txt.train')
dtest = xgb.DMatrix('agaricus.txt.test')

# specify parameters
params_xgb = dict(

params_train = dict(

# specify callback
class ExampleCallback(xgb.callback.TrainingCallback):
    def __init__(self, callback_results):
        self.callback_results = callback_results

    def after_iteration(self, model, epoch, evals_log):
        return False

callback_results = []
params_train['callbacks'] = [ExampleCallback(callback_results), ]

bst = xgb.train(
    evals=[(dtest, 'dtest')],

num_callback_results = len(callback_results)
best_score = bst.best_score
best_iteration = bst.best_iteration
num_trees = len(bst.get_dump())
line = f'{num_callback_results};{best_score};{best_iteration};{num_trees}\n'

with open('callback.csv', mode='a') as f: 

When I run this code say 100 times, in most of the cases the length of the list callback_results is equal to bst.best_iteration + early_stopping_rounds (in my case it’s 43), but in some cases (14%) it is equal to bst.best_iteration + early_stopping_rounds + 1 (in my case it’s 44). Best score, number of iterations and number of trees in the best model are always the same (in my case they are correspondingly 0.0004998939329742, 40 and 44).

Moreover, I have noticed that when I simply run xgboost.train() with early stopping parameter the number of times the model evaluation score is printed is also random and can be 43 or 44.

Is it the expected behavior of callbacks? Is there a way to get my custom callback always execute the same number of times?

My actual need is more complex. I want to do a cross-validation with function and to store the score on the out-of-fold sets at each boosting round. I can use a callback for this, but the length of the list with evaluation scores randomly varies by 1, hence my question.

I am using xgboost ver 1.6.1.

I’m having the same issue with a custom callback class.

It looks like after_iteration() is intermittently called one less time than the number of rounds run during training. Or maybe the right way to look at it is that there is one additional training round that shouldn’t occur?

This is with python 3.12.3 and xgboost 2.0.3

Edited to add: working on a simple reproducible script to share

import xgboost as xgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

class SimpleCallback(xgb.callback.TrainingCallback):

    A simplified custom callback to demonstrate an intermittent issue with early stopping.

    def __init__(self):
        self.feature_stats = []

    def after_iteration(self, model, epoch, evals_log):
        Executed after each boosting round, this method calculates and records a simple statistic.
        print(f'Running epoch {epoch}')
        if model is None:
            return False  # Continue training

        # Simple collection of number of features
        num_features = len(model.get_score())
        print(f'Number of features: {num_features}')

        return False  # Continue training

if __name__ == '__main__':
    # Load example data
    data = load_breast_cancer()
    X_train, X_test, y_train, y_test = train_test_split(,, test_size=0.2, random_state=42)

    for run_number in range(1, 101):  # Run 100 times
        dtrain = xgb.DMatrix(X_train, label=y_train)
        dtest = xgb.DMatrix(X_test, label=y_test)

        # Training parameters
        params = {'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error']}
        num_rounds = 50
        early_stopping_rounds = 10
        evals_result = {}

        # Callback instance
        callback = SimpleCallback()

        # Training with early stopping and the custom callback
        model = xgb.train(
            evals=[(dtrain, 'train'), (dtest, 'test')],

        # Compare the lengths to spot any discrepancies
        if len(callback.feature_stats) != len(evals_result['test']['logloss']) or len(callback.feature_stats) != len(evals_result['test']['error']):
            print(f"Mismatch found on run {run_number}:")
            print("Feature stats collected during training:", callback.feature_stats)
            print("Length of feature stats:", len(callback.feature_stats))
            print("Evals_result logloss length:", len(evals_result['test']['logloss']))
            print("Evals_result error length:", len(evals_result['test']['error']))
            print(f"No mismatch on run {run_number}, all lengths match.")

This simplified script seems to frequently (more often than not) reproduce the error for me.