While debugging an issue where XGBoost4j would return a completely different prediction with a model trained and saved in Python, it turned out that the culprit is the default value of the
missing parameter in the
DMatrix constructor. In both Python and R interfaces the default value is
NaN but in Java the value is
0.0f. As a result, zero-valued features are treated as missing unless
missing is explicitly specified as
Float.NaN. This has no effect on models trained on non-negative feature values where all splits have non-negative thresholds but leads to wrong predictions when there are splits with negative thresholds when no value for
missing is specified.
Is there any particular reason for this choice? It seems to break the symmetry between the different interfaces.