Scala trained booster loaded into Python give different predictions

kongwei9901 · May 1, 2019, 9:16pm

We trained our xgboost model in Scala API, saved the model. Then loaded into Python API, when scoring on the exact same record. The two predicted probability have a huge gap. Scala gives 0.005228, while Python gives 0.01544636. I thought its caused by the index difference (Python starts from 0 while Scala starts from 1), so I inserted a empty column at the beginning of my input data and scored again, still not match. Did anyone else have this issue? Can someone help take a look? Thanks in advance and appreciate your response!

I saw a similar thread on GitHub:

Best,
Wei

hcho3 · May 2, 2019, 5:37am

Does your data have missing values? If so, how are they represented?

kongwei9901 · May 2, 2019, 1:49pm

No missings. Numerical missing was already imputed as -999999. Categorical missing was first imputed as “999999”, then did StringIndexer and OneHotEcoder. The data preparation pipeline is in Scala, ONLY final prepared data was converted to Python DMatrix.

kongwei9901 · May 3, 2019, 5:38pm

Thanks Philip! You are right about Missing value representation. I manually scored a record. The difference is mainly caused by the different interpretations of missing from Scala API and Python API… Still investigating into this…

Best,
Wei

hcho3 · May 3, 2019, 5:48pm

Yes, missing values can cause a lot of headaches. There is a proposed tutorial to clarify how to handle missing values: https://github.com/dmlc/xgboost/pull/4425. Can you look at it and see if it helps? Feel free to leave feedback to the proposed tutorial.