How does XGBoost handle missing data?


I cannot find a clear description on how XGBoost handles missing data, even though I understand that XGBoost can handle missing data. Could you please explain how exactly XGBoost handles the missing data? Or if it has been explained before, could you point me to the right direction?



You can take a look at Section 3.4 of the XGBoost paper [1]:

“In many real-world problems, it is quite common for the
input x to be sparse. There are multiple possible causes
for sparsity: 1) presence of missing values in the data; 2)
frequent zero entries in the statistics; and, 3) artifacts of
feature engineering such as one-hot encoding. It is impor-
tant to make the algorithm aware of the sparsity pattern in
the data. In order to do so, we propose to add a default
direction in each tree node, which is shown in Fig. 4. When
a value is missing in the sparse matrix
x , the instance is classified into the default direction. There are two choices
of default direction in each branch. The optimal default di-
rections are learnt from the data. The algorithm is shown in
Alg. 3. The key improvement is to only visit the non-missing
entries I k . The presented algorithm treats the non-presence
as a missing value and learns the best direction to handle
missing values. The same algorithm can also be applied
when the non-presence corresponds to a user specified value
by limiting the enumeration only to consistent solutions.”



I’m just wondering how XGBoost finds the default direction in case a given feature only has a single non-missing value, which is the case in one hot encoded features. Anyone can answer this question? Maybe it is very simple question, but I can’t see the answer here.