Inconsistent number of samples in Sklearn with XGBoost

muriloasouza · October 27, 2023, 1:27pm

I am trying to train a XGBRegressor using this code:

import xgboost as xgb
from sklearn.metrics import mean_squared_error

def xgboost(): 
    model = xgb.XGBRegressor(n_estimators=200,
                             max_depth=4,
                             subsample=1,
                             min_child_weight=1,
                             objective='reg:squarederror',
                             tree_method='hist',
                             eval_metric=mean_squared_error, # mean_squared_error
                             early_stopping_rounds=50)
    return model

num_amostras = x_train.shape[0]
val_size = 0.2
num_amostras_train = int(num_amostras * (1-val_size))
x_train_xgb = x_train[:num_amostras_train]
y_train_xgb = y_train[:num_amostras_train]
x_val_xgb = x_train[num_amostras_train:]
y_val_xgb = y_train[num_amostras_train:]
model_xgb = xgboost()
model_xgb.fit(x_train_xgb, y_train_xgb, eval_set=[(x_train_xgb, y_train_xgb), (x_val_xgb, y_val_xgb)])
resultados = model_xgb.evals_result()

x_train has shape (1458, 55)
x_train_xgb has shape (1166, 55)
y_train_xgb has shape (1166, 24)
x_val_xgb has shape (292, 55)
y_val_xgb has shape (292, 24)

But i am getting this error:

Traceback (most recent call last):

  File ~\PeDFurnas\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\users\ldsp_\sipredvs\scripts\treinamento_demanda.py:201
    model_xgb.fit(x_train_xgb, y_train_xgb, eval_set=[(x_train_xgb, y_train_xgb),(x_val_xgb, y_val_xgb)])

  File ~\PeDFurnas\lib\site-packages\xgboost\core.py:729 in inner_f
    return func(**kwargs)

  File ~\PeDFurnas\lib\site-packages\xgboost\sklearn.py:1086 in fit
    self._Booster = train(

  File ~\PeDFurnas\lib\site-packages\xgboost\core.py:729 in inner_f
    return func(**kwargs)

  File ~\PeDFurnas\lib\site-packages\xgboost\training.py:182 in train
    if cb_container.after_iteration(bst, i, dtrain, evals):

  File ~\PeDFurnas\lib\site-packages\xgboost\callback.py:238 in after_iteration
    score: str = model.eval_set(evals, epoch, self.metric, self._output_margin)

  File ~\PeDFurnas\lib\site-packages\xgboost\core.py:2138 in eval_set
    feval_ret = feval(

  File ~\PeDFurnas\lib\site-packages\xgboost\sklearn.py:139 in inner
    return func.__name__, func(y_true, y_score)

  File ~\PeDFurnas\lib\site-packages\sklearn\metrics\_regression.py:442 in mean_squared_error
    y_type, y_true, y_pred, multioutput = _check_reg_targets(

  File ~\PeDFurnas\lib\site-packages\sklearn\metrics\_regression.py:100 in _check_reg_targets
    check_consistent_length(y_true, y_pred)

  File ~\PeDFurnas\lib\site-packages\sklearn\utils\validation.py:397 in check_consistent_length
    raise ValueError(

ValueError: Found input variables with inconsistent numbers of samples: [27984, 1166]

So 27984 = 1166*24 (the product of y_train_xgb shape).

1166 is the number of sample of both x_train_xgb and y_train_xgb.

If i don’t use a sklearn metric ( mean_squared_error in this case), and use the default metric of XGBoostRegressor ( 'rmse' ), the code runs just fine.

So, what is the cause of this problem here? How to fix and use mean_squared_error as eval_metric ?

hcho3 · October 27, 2023, 8:16pm

The use case is currently not supported, according to https://github.com/dmlc/xgboost/issues/9730#issuecomment-1783400188. For now, please use the built-in metrics.