Huge AUC drop when upgrading from XGBoost4J-Spark 0.90

dmcarba · October 18, 2021, 1:10pm

Hello,

I’m upgrading the XGBoost4j-Spark version from 0.90 to 1.0.0 (we need to keep spark version 2.4.x for a while) and when testing with a identical dataset and the same XGBoost training params there is a huge AUC drop from the model generated with version 0.90 with results train AUC: 0.944 and test AUC: 0.857 to the model generated with version 1.0.0 with results train AUC: 0.614 and test AUC: 0.613

In both tests the dataset is the same, the models are generated with scala 2.11 and the data points used for the training are DenseVector of type Double. The missing value is configured as NaN, all parameters are the same. The model is a binary classifier created with class XGBoostClassifier. The AUC is obtained using as evaluator the BinaryClassificationEvaluator on the models fitted

The training params taken from the generated models with both versions are:

0.90:

Map(alpha -> 0.0, numEarlyStoppingRounds -> 0, lambdaBias -> 0.0, trainTestRatio -> 1.0, rateDrop -> 0.0, cacheTrainingSet -> false, silent -> 0, seed -> 88, batchSize -> 32768, useExternalMemory -> false, normalizeType -> tree, scalePosWeight -> 10.816430275105986, colsampleBylevel -> 1.0, subsample -> 0.8, timeoutRequestWorkers -> 1800000, growPolicy -> depthwise, lambda -> 1.0, featuresCol -> features, maxDepth -> 5, sketchEps -> 0.03, sampleType -> uniform, numRound -> 1000, checkpointPath -> , objective -> binary:logistic, customEval -> null, checkpointInterval -> -1, evalMetric -> auc, labelCol -> label, baseScore -> 0.5, predictionCol -> prediction, missing -> NaN, customObj -> null, colsampleBytree -> 0.8, treeMethod -> hist, eta -> 0.1, verbosity -> 1, numWorkers -> 32, rawPredictionCol -> rawPrediction, maxBin -> 256, probabilityCol -> probability, gamma -> 0.0, treeLimit -> 0, trackerConf -> TrackerConf(0,python), nthread -> 1, minChildWeight -> 1.0, maxDeltaStep -> 0.0, skipDrop -> 0.0)

1.0.0:

Map(alpha -> 0.0, numEarlyStoppingRounds -> 0, lambdaBias -> 0.0, trainTestRatio -> 1.0, rateDrop -> 0.0, cacheTrainingSet -> false, silent -> 0, seed -> 88, batchSize -> 32768, useExternalMemory -> false, allowNonZeroForMissing -> false , normalizeType -> tree, scalePosWeight -> 10.816430275105986, rabitRingReduceThreshold -> 32768 , colsampleBylevel -> 1.0, subsample -> 0.8, timeoutRequestWorkers -> 1800000, growPolicy -> depthwise, lambda -> 1.0, featuresCol -> features, maxDepth -> 5, sketchEps -> 0.03, sampleType -> uniform, numRound -> 1000, checkpointPath -> , objective -> binary:logistic, customEval -> null, checkpointInterval -> -1, evalMetric -> auc, labelCol -> label, baseScore -> 0.5, predictionCol -> prediction, missing -> NaN, customObj -> null, colsampleBytree -> 0.8, treeMethod -> hist, eta -> 0.1, verbosity -> 1, numWorkers -> 32, rawPredictionCol -> rawPrediction, dmlcWorkerConnectRetry -> 5, maxBin -> 256, probabilityCol -> probability, gamma -> 0.0, treeLimit -> 0, trackerConf -> TrackerConf(0,python), rabitTimeout -> -1 , nthread -> 1, minChildWeight -> 1.0, maxDeltaStep -> 0.0, skipDrop -> 0.0)

All params are identical except for the new ones introduced in 1.0.0 that are highlighted, by the way I did another test with XGBoost4j-Spark 1.1.1 with similar AUC drop results.

I’m a bit lost in how to debug this issue since everything is the same in both tests except for the XGBoost4j-Spark artifact version, could it be caused by the new config parameters introduced in 1.0.0?, I’m testing now setting allowNonZeroForMissing to true (although I understand this is only needed with sparse vectors) but still no luck. Are you aware of any fundamental behaviour changes in 1.0.0 that could explain this issue?, something that needs to be done in a different way than with 0.90?. Please any help is appreciated.

Thanks!

jiamingy · October 21, 2021, 4:39am

Hi, could you please open an issue with a reproducible example that we can run?

dmcarba · October 21, 2021, 10:45am

Hi @jiamingy,

Thanks for your reply. After further investigation, I’ve identified the cause: this particular dataset contained infinite values in three of the columns, I’m not sure about the reason, but I’ve removed them from the datapoint generation process and used NaN instead.

Now in both releases 0.90 and 1.0.0 Infinite values are accepted in the model fitting, no error is raised, but due some change in the way they are handled in 1.0.0 the AUC of the generated model is much worse, Once these Infinites are replaced by NaNs, the test dataset AUC in both models is similar and greater than 0.8.

Regards