Colsample_by_tree leads to not reproducible model across machines (Mac OS, Windows)

mathdvg · June 29, 2020, 1:01pm

Hello,

I have come across this issue in the following code. I am working with a friend who is using a Windows machine and I am using a Mac OS. We are unable to reproduce the same results, which is a big problem for collaborative work on this model.
We’ve found by playing with the parameters that this is due to the parameters subsample and colsample_by_tree.

I’ve noticed a few conversations on the subject, and also noticed this PR https://github.com/dmlc/xgboost/pull/735 that was supposed to fix this, but I still see the problem arise. We both have version 1.1.1.1 of xgboost package on our machines. Could you help us with this issue?

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)

watchlist <- list(train = dtrain, eval = dtest)

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 1, 
                    subsample = 0.5, colsample_bytree = 0.1,
                    objective = "binary:logistic", eval_metric = "auc")

set.seed(2020)

bst <- xgb.train(param, dtrain, nrounds = 5, watchlist)

Thank you

hcho3 · June 29, 2020, 10:20pm

It looks like you are using R. R fortunately provides a way to customize the random number generation process. See https://stackoverflow.com/questions/48626086/same-seed-different-os-different-random-numbers-in-r and https://stackoverflow.com/questions/47199415/is-set-seed-consistent-over-different-versions-of-r-and-ubuntu.

mathdvg · June 30, 2020, 7:22am

Yes sorry, I am working in R!

This reproducibility problem seems to be only in xgboost.
I have ran

set.seed(2020)
rnorm(1)

and get the same result across Mac OS and Windows.

I have tried both
RNGkind(sample.kind = "Rounding")
and
set.seed(10, kind = "Mersenne-Twister", normal.kind = "Inversion")
and it doesn’t fix the issue.

Thank you for your help

hcho3 · June 30, 2020, 7:47am

Yeah, it looks like a bug. XGBoost is supposed to use the random number generator from R, and yet the number is not reproducible.

It may take a while for us (XGBoost developers) to get around fixing this bug. For now, you should consider some alternatives:

Save the model so that you don’t have to re-train
Use the same OS among all collaborators
Package XGBoost in a Docker container or a virtual machine (VM)
Host XGBoost in a cloud service or a R notebook server

hcho3 · June 30, 2020, 7:41am

And if you happen to use the Python or JVM package, the random number generation is generally not reproducible if you switch the OS.

Right now, our priority is to guarantee reproducibility in a single machine first: if you run the same training script twice on the same machine, the output should be the same. There are some cases where we have yet to attain this goal. (Example: distributed training, learning to rank with GPU algorithm) Reproducibility with multiple machines with different OSes is a more ambitious goal and will take a while for us to address.

hcho3 · June 30, 2020, 7:48am

@mathdvg And if you manage to fix the issue yourself, feel free to submit a pull request. Contribution is absolutely welcome!

OlivelliAri · February 14, 2024, 3:14pm

Hi, is there any update on this? I seem to have the same problem between MacOS and Ubuntu when running the same code lines implemented in Python. I am using XGBoost 2.0.1.

Thank you very much!