Colsample_by_tree leads to not reproducible model across machines (Mac OS, Windows)

Hello,

I have come across this issue in the following code. I am working with a friend who is using a Windows machine and I am using a Mac OS. We are unable to reproduce the same results, which is a big problem for collaborative work on this model.
We’ve found by playing with the parameters that this is due to the parameters subsample and colsample_by_tree.

I’ve noticed a few conversations on the subject, and also noticed this PR https://github.com/dmlc/xgboost/pull/735 that was supposed to fix this, but I still see the problem arise. We both have version 1.1.1.1 of xgboost package on our machines. Could you help us with this issue?

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)

watchlist <- list(train = dtrain, eval = dtest)

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 1, 
                    subsample = 0.5, colsample_bytree = 0.1,
                    objective = "binary:logistic", eval_metric = "auc")

set.seed(2020)

bst <- xgb.train(param, dtrain, nrounds = 5, watchlist)

Thank you

It looks like you are using R. R fortunately provides a way to customize the random number generation process. See https://stackoverflow.com/questions/48626086/same-seed-different-os-different-random-numbers-in-r and https://stackoverflow.com/questions/47199415/is-set-seed-consistent-over-different-versions-of-r-and-ubuntu.

Yes sorry, I am working in R!

This reproducibility problem seems to be only in xgboost.
I have ran

set.seed(2020)
rnorm(1) 

and get the same result across Mac OS and Windows.

I have tried both
RNGkind(sample.kind = "Rounding")
and
set.seed(10, kind = "Mersenne-Twister", normal.kind = "Inversion")
and it doesn’t fix the issue.

Thank you for your help

Yeah, it looks like a bug. XGBoost is supposed to use the random number generator from R, and yet the number is not reproducible.

It may take a while for us (XGBoost developers) to get around fixing this bug. For now, you should consider some alternatives:

  • Save the model so that you don’t have to re-train
  • Use the same OS among all collaborators
  • Package XGBoost in a Docker container or a virtual machine (VM)
  • Host XGBoost in a cloud service or a R notebook server

And if you happen to use the Python or JVM package, the random number generation is generally not reproducible if you switch the OS.

Right now, our priority is to guarantee reproducibility in a single machine first: if you run the same training script twice on the same machine, the output should be the same. There are some cases where we have yet to attain this goal. (Example: distributed training, learning to rank with GPU algorithm) Reproducibility with multiple machines with different OSes is a more ambitious goal and will take a while for us to address.

@mathdvg And if you manage to fix the issue yourself, feel free to submit a pull request. Contribution is absolutely welcome!