I work with very noisy data and observed that my models would perform much worse when trained on version 1.7.4 than on version 1.5.2, with the approx method. I have a reproducible experiment that demonstrates the issue:
This is the data that is used across the two versions:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
x, y = make_classification(10000, 4, n_informative=2, n_redundant=0, weights=[0.1], n_clusters_per_class=2, flip_y=0.05, class_sep=0.01, hypercube=True, random_state=42)
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, stratify=y, random_state=216)
pd.DataFrame(x_train).to_csv("./xtrain", index=False)
pd.DataFrame(y_train).to_csv("./ytrain", index=False)
pd.DataFrame(x_test).to_csv("./xtest", index=False)
pd.DataFrame(y_test).to_csv("./ytest", index=False)
Now, in two different Python kernels with the two different XGBoost versions, run the same code:
import pandas as pd
import numpy as np
import xgboost
x_train = pd.read_csv("./xtrain")
y_train = pd.read_csv("./ytrain").values
x_test = pd.read_csv("./xtest")
y_test = pd.read_csv("./ytest").values
dtrain = xgboost.DMatrix(
data=x_train,
label=y_train,
)
deval = xgboost.DMatrix(
data=x_test,
label=y_test,
)
params = {
'eta' : 0.03,
'max_depth' : 3,
'min_child_weight' : 0,
'max_delta_step' : 0,
'subsample' : 1,
'base_score' : 0.5,
'objective' : 'binary:logistic',
'eval_metric' : 'logloss',
'tree_method' : 'approx',
'gamma' : 4,
}
model = xgboost.train(
params=params,
dtrain=dtrain,
num_boost_round=200,
evals=[(dtrain, 'dtrain'), (deval, 'eval')],
early_stopping_rounds=10,
)
The fitting result for version 1.5.2 is:
[193] dtrain-logloss:0.34347 eval-logloss:0.35179
and the fitting result for version 1.7.4 is:
[199] dtrain-logloss:0.36625 eval-logloss:0.36653
We can see that the fitting performance is nearly 10% worse in the new version. In the real world example I was working with, it seems that 1.7.4 also tend to build shallower trees with worse performance:
v1.5.2:
[79] dtrain-logloss:0.13350 eval-logloss:0.13540
v1.7.4:
[32] dtrain-logloss:0.14292 eval-logloss:0.14281