Hi, I have exactly the same problem and I started doubting if this calculation is actually performed.
The XGBoost documentation states that so-called base_score
is automatically estimated for selected objectives before training. To disable the estimation, specify a real number argument.
It suggests that, given a regression tasks, it calculates average linked to the particular loss function, e.g. mean for RMSE objective. However, as you wrote, the extracted value of the base_score is always equal to 0.5. I’ve checked it using the up-to-date version 1.7.6.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import json
import xgboost as xgb
print(xgb.__version__)
# '1.7.6'
# From: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/testing/updater.py
def get_basescore(model: xgb.XGBModel) -> float:
"""Get base score from an XGBoost sklearn estimator."""
base_score = float(
json.loads(model.get_booster().save_config())["learner"]["learner_model_param"][
"base_score"
]
)
return base_score
# Preparing data
X, y = make_regression(n_samples=200)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Training a model
xgb_reg = xgb.XGBRegressor()
xgb_reg.fit(X_train, y_train)
# It seems it always returns... 0.5
# At leats as long as we don't set a custom value manually
# with xgb.XGBRegressor(base_score=np.mean(y_train))
get_basescore(xgb_reg)
How can we explain this (apparent?) discrepancy between the docs and the way it actually works?