[jvm-packages] Is scala spark xgboost 0.81 model training reproducible

In my use case, I need the model training result exactly reproducible, however I cannot get exact model between multiple training using exactly the same data and random seed. Here are details:

  1. I am using pre-compiled scala spark xgb version 0.81-criteo-20180821 on CDH 2.3.1
  2. I set the same random seed.
  3. The training data is identical in the sense of row order, column order and partition.

The model.summary always shows difference for trainObjectiveHistory since 6th digits after dot.
Did I missing something? suppose I should exactly replicate numbers.

To Community members and developers,

a further test by enforcing nthread=1 still cannot exactly replicate, however the difference in logloss is controlled at 6th digit after dot.

Hope anyone can clarify my mind

Thanks
Yao

In distributed Spark, this is not necessarily true, since we don’t know how the data gets coalesced, reduced, collected, etc. Spark will manage a lot of such data movements, and we cannot control them. Please keep in mind that addition is not associative in floating-point arithmetic: a + (b+c) is not necessarily equal to (a + b) + c.

@CodingCat What do you think?

Agree, you may may have different partition of data in different runs

Thanks, this is what I guess in distributed computation environment. We will argue this to our model review team that the reproducible precision is acceptable.

Hi @hcho3 and @CodingCat,
In XGBoost using Spark, after the change made here which enables deterministic partitioning if checkpoint is enabled, should the model be deterministic after running several times with the same input?
I’m using the same input data, coalescing the data in my python script ( df = df.coalesce(…) ), setting sparkConfig as spark.task.cpus’,‘1’ , fixed random seed, setting nthread=1 in XGBoostClassifier, and the new method needDeterministicRepartitioning to TRUE.
After all of this, I still get different models when using more than 2 workers. Is it expected? Why?
Thanks!

@merleyc I assume if we set executor to 1 and spark.task.cpus = number of cores per executor you would get same result right?

Yes, @chenqin. I got the same models when spark.task.cpus is set to 1 or number of cores per executor and number of executor <= 2.
Here are my experiments:

# executors (nworkers in XGBoostClassifier) # cores per executor # threads in XGBoostClassifier spark.task.cpus in SparkConf Same model?
1 28 1 1 TRUE
1 28 1 28 TRUE
2 14 1 1 TRUE
4 7 1 1 Got 2 identical models and 1 different
2 28 1 1 TRUE
28 1 1 1 FALSE

Hi @chenqin, @hcho3 and @CodingCat,
Any idea why is this happening? Thanks a lot