[jvm-packages] AUC gap between spark and python


I use xgboost do binary classification, and set object as “binary:logistic”. I use the same parameters in both spark version 0.80 and python version 0.82, but get big AUC gap.
In spark, I can only get AUC 0.75 while 0.83 from python version.
Since the training data is large, spark can handle easily. When training with python version, I add “#dtrain.cache” when loading traing data.
@CodingCat Have you seen a significant drop in AUC metric due to how Spark manages distributed data? I know that XGBoost does not control data movement when using Spark


No, please check whether the feature vector in training dataset is 0-based for spark, missing value is properly set, etc.


Thanks for your reply.
I set the first feature index is 15, so in spark version, when it minus one, both spark version and python version can handle well.
I don’t set missing value in both side.