[jvm-packages] AUC gap between spark and python

roy1985715 · March 14, 2019, 5:32pm

Hi,
I use xgboost do binary classification, and set object as “binary:logistic”. I use the same parameters in both spark version 0.80 and python version 0.82, but get big AUC gap.
In spark, I can only get AUC 0.75 while 0.83 from python version.
Since the training data is large, spark can handle easily. When training with python version, I add “#dtrain.cache” when loading traing data.
Anybody can help me figure out why? Thanks

hcho3 · March 14, 2019, 5:33pm

@CodingCat Have you seen a significant drop in AUC metric due to how Spark manages distributed data? I know that XGBoost does not control data movement when using Spark

CodingCat · March 14, 2019, 6:27pm

No, please check whether the feature vector in training dataset is 0-based for spark, missing value is properly set, etc.

roy1985715 · March 15, 2019, 3:07am

Thanks for your reply.
I set the first feature index is 15, so in spark version, when it minus one, both spark version and python version can handle well.
I don’t set missing value in both side.

sramirez · June 12, 2019, 4:09pm

I see the same problem, in this case between spark and scala API. Parameters are the same, missing is set to NaN in Spark. The only difference is that in local mode labeledPoint is made of floats, whereas in Spark we use doubles.