XGBoost scala vs spark different performance

xgboostForLife · November 30, 2021, 8:23am

I have a question about the different performances between XGboost Scala and Spark.

I am training a fraud detection model on data for two banks A and B where bank B is a bit smaller in terms of the number of credit card transactions. This is a classification problem with highly unbalanced classes (1:1000).

I train a model Spark and Scala on the same dataset with the same parameters with data of both banks. Then I calculate model performance:

the precision
the number of true positives (TP)
and the number of false positives (FP)

at the same threshold for the two models (scala and spark) on two test sets for bank A and bank B. I find that:

Results for bank A for model spark and scala are nearly identical for precision, TP and FP
Results for bank B are worse for scala than for spark. For example, the precision spark is 41% while the precision scala is 29% - this is a significant degradation in the context of my research.
For bank B the number of FP in model scala is 50% higher than for model spark at the same threshold
The precision of model scala for bank B is lower than spark for all thresholds than the precision of model spark. In fact, it never goes higher than 30% while the precision of the model spark can be as high as 50%.

What could be the problem here? Why is the performance of bank A identical between two models, while for model B model scala is clearly worse than model spark?

I repeated the training and testing several times and always bank B scala is worse than spark.

I am running XGboost 0.7 on scala 2.11.11 and spark 2.2. Unfortunately, I can’t install a higher version of spark or XGBoost.

Context, our production model must be in pure scala but we explore, cross-validate the model first in spark, and then the final best model is re-trained in scala and put in production. Training of model scala on our cluster takes more than 3 hours while training of model spark is about 25 min.