While debugging an issue where XGBoost4j would return a completely different prediction with a model trained and saved in Python, it turned out that the culprit is the default value of the missing
parameter in the DMatrix
constructor. In both Python and R interfaces the default value is NaN
but in Java the value is 0.0f
. As a result, zero-valued features are treated as missing unless missing
is explicitly specified as Float.NaN
. This has no effect on models trained on non-negative feature values where all splits have non-negative thresholds but leads to wrong predictions when there are splits with negative thresholds when no value for missing
is specified.
Is there any particular reason for this choice? It seems to break the symmetry between the different interfaces.