[jvm-packages] AUC gap between spark and python

Hi,
I use xgboost do binary classification, and set object as “binary:logistic”. I use the same parameters in both spark version 0.80 and python version 0.82, but get big AUC gap.
In spark, I can only get AUC 0.75 while 0.83 from python version.
Since the training data is large, spark can handle easily. When training with python version, I add “#dtrain.cache” when loading traing data.
Anybody can help me figure out why? Thanks

1 Like

@CodingCat Have you seen a significant drop in AUC metric due to how Spark manages distributed data? I know that XGBoost does not control data movement when using Spark

No, please check whether the feature vector in training dataset is 0-based for spark, missing value is properly set, etc.

Thanks for your reply.
I set the first feature index is 15, so in spark version, when it minus one, both spark version and python version can handle well.
I don’t set missing value in both side.

I see the same problem, in this case between spark and scala API. Parameters are the same, missing is set to NaN in Spark. The only difference is that in local mode labeledPoint is made of floats, whereas in Spark we use doubles.