Why does row order matter when using xgboost (exact)

Jovan · November 8, 2018, 12:11pm

Hello,

I just noticed that I get different results with xgboost depending on the order in which I feed in the data (the order of the rows, the column order is unchanged). To illustrate this, I’ve created the following script with the iris dataset:

‘’’
import os
import pylab as p
import numpy as np
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
os.environ[‘KMP_DUPLICATE_LIB_OK’] = ‘True’

Load data

X, y = load_iris(return_X_y=True)

“Fix” the labels - to get a binary case for simplicity here

y[y==2] = 1

Train test split

XX, YY, yx, yy = train_test_split(X, y, test_size=0.01, stratify=y)

Set up xgb

params = {
‘tree_method’: ‘exact’, # default is auto
‘max_depth’: 7,
‘learning_rate’: 0.1,
‘min_child_weight’: 1,
‘subsample’: 1.0,
‘colsample_bytree’: 1.0, # subsample ratio of columns when constructing each tree.,
‘colsample_bylevel’: 1.0, # subsample ratio of columns for each split, in each level.,
‘reg_lambda’: 0.0, # L2 norm,
‘reg_alpha:’: 0.0, # L1 norm,
‘objective’: ‘binary:logistic’,
‘random_state’: 22,
‘silent’: 1,
}

Shuffle the data - this shuffles the order of the rows

ind = np.random.randint(low=0, high=len(XX), size=len(XX) )
XX = XX[ind]
yx = yx[ind]

Train the model

dtrain = xgb.DMatrix(data=XX, label=yx)
dtest = xgb.DMatrix(data=YY, label=yy)
booster = xgb.train(params=params, dtrain=dtrain, num_boost_round=1000)

Print the predictions

print(booster.predict(dtest))
‘’’
So every time I run the above script, i get a different probability for dtest. It is probably I am not understanding something, can someone please explain why this happens, surely it is not a bug? I run this on python 3.6, in an conda environment where everything is installed from conda(forge).

Many thanks!
Jovan.

Jovan · November 8, 2018, 12:44pm

HI,

I found my mistake:
ind = np.random.randint(low=0, high=len(XX), size=len(XX) )
XX = XX[ind]
yx = yx[ind]

The randomization was wrong so I was getting different data, hence a different result.

Otherwise, indeed the results are independent of the row order (when subsample is 1 of course)

Cheers,
Jovan.