XGBoost4j default missing value in DMatrix constructor

While debugging an issue where XGBoost4j would return a completely different prediction with a model trained and saved in Python, it turned out that the culprit is the default value of the missing parameter in the DMatrix constructor. In both Python and R interfaces the default value is NaN but in Java the value is 0.0f. As a result, zero-valued features are treated as missing unless missing is explicitly specified as Float.NaN. This has no effect on models trained on non-negative feature values where all splits have non-negative thresholds but leads to wrong predictions when there are splits with negative thresholds when no value for missing is specified.

Is there any particular reason for this choice? It seems to break the symmetry between the different interfaces.

In many Spark applications, it is common to treat 0 as missing value. For example, when using a VectorAssembler, Spark will automatically choose between DenseVector and SparseVector, depending on how many zeros each column has. See https://stackoverflow.com/questions/35844330/vectorassembler-output-only-to-densevector

Hm, I thought Spark uses null and NaN to represent missing values, and those are nowhere the same as the zero elements that are simply not stored in a SparseVector.

XGBoost treats as missing the zeros that are not stored in a SparseVector.

I see. Thank you for the clarification.