Got inconsistent model from distributed training and normal trainig

luzengxiangcn · April 7, 2021, 2:37pm

We trained two models by python API. One is normally trained while the other is distributely trained. For distributed training, we use first 25 lines and second 25 lines as two parts of dataset. For normal training, we use first 50 lines as dataset.
From my point of view, we should get same model if hist bin is same and all parameter relative to randomization is depressed. Dataset is small, hist bin should be same. But we got two different model. Could any one explain how could it happen.

Parameters:
params_xgb = {
‘booster’: ‘gbtree’,
‘objective’: ‘binary:logistic’,
‘eval_metric’: ‘rmse’,
‘max_depth’: 5,
‘lambda’: 0,
‘subsample’: 1.0,
‘colsample_bytree’: 1.0,
‘seed’: 123,
“tree_method”: “hist”,
“grow_policy”:“depthwise”,
“gamma”: 0,
“min_child_weight”: “0”,
}

num_boost_round =1

Datasets: FE_pima-indians-diabetes

github.com

mehtavraj/Pima-Diabetes/blob/master/pima-indians-diabetes.csv

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0

This file has been truncated. show original

xgboost version: 1.2.1

luzengxiangcn · April 7, 2021, 6:46pm

what can cause difference/inconsistency between distributed and local training?

luzengxiangcn · April 7, 2021, 6:51pm

@hcho3 Could you help me take a look?

luzengxiangcn · April 8, 2021, 10:39am

One more thing, the models are same, if adjust ‘tree_method’ to ‘approx’.

jiamingy · April 8, 2021, 11:26am

Could you please open a GitHub issue and share your code? Distributed training may have slight differences due to floating-point errors, also sometimes the metric output is not exact with the distributed setting.