[jvm-packages] Is scala spark xgboost 0.81 model training reproducible


#1

In my use case, I need the model training result exactly reproducible, however I cannot get exact model between multiple training using exactly the same data and random seed. Here are details:

  1. I am using pre-compiled scala spark xgb version 0.81-criteo-20180821 on CDH 2.3.1
  2. I set the same random seed.
  3. The training data is identical in the sense of row order, column order and partition.

The model.summary always shows difference for trainObjectiveHistory since 6th digits after dot.
Did I missing something? suppose I should exactly replicate numbers.


#2

To Community members and developers,

a further test by enforcing nthread=1 still cannot exactly replicate, however the difference in logloss is controlled at 6th digit after dot.

Hope anyone can clarify my mind

Thanks
Yao


#3

In distributed Spark, this is not necessarily true, since we don’t know how the data gets coalesced, reduced, collected, etc. Spark will manage a lot of such data movements, and we cannot control them. Please keep in mind that addition is not associative in floating-point arithmetic: a + (b+c) is not necessarily equal to (a + b) + c.

@CodingCat What do you think?


#4

Agree, you may may have different partition of data in different runs


#5

Thanks, this is what I guess in distributed computation environment. We will argue this to our model review team that the reproducible precision is acceptable.