I have a question about the different performances between XGboost Scala and Spark.
I am training a fraud detection model on data for two banks A and B where bank B is a bit smaller in terms of the number of credit card transactions. This is a classification problem with highly unbalanced classes (1:1000).
I train a model Spark and Scala on the same dataset with the same parameters with data of both banks. Then I calculate model performance:
- the precision
- the number of true positives (TP)
- and the number of false positives (FP)
at the same threshold for the two models (scala and spark) on two test sets for bank A and bank B. I find that:
- Results for bank A for model spark and scala are nearly identical for precision, TP and FP
- Results for bank B are worse for scala than for spark. For example, the precision spark is 41% while the precision scala is 29% - this is a significant degradation in the context of my research.
- For bank B the number of FP in model scala is 50% higher than for model spark at the same threshold
- The precision of model scala for bank B is lower than spark for all thresholds than the precision of model spark. In fact, it never goes higher than 30% while the precision of the model spark can be as high as 50%.
What could be the problem here? Why is the performance of bank A identical between two models, while for model B model scala is clearly worse than model spark?
I repeated the training and testing several times and always bank B scala is worse than spark.
I am running XGboost 0.7 on scala 2.11.11 and spark 2.2. Unfortunately, I can’t install a higher version of spark or XGBoost.
Context, our production model must be in pure scala but we explore, cross-validate the model first in spark, and then the final best model is re-trained in scala and put in production. Training of model scala on our cluster takes more than 3 hours while training of model spark is about 25 min.