How does XGBoost handle fully missing value columns (libsvm)

vosilov · October 29, 2020, 2:46pm

Let me first lay out the context to better explain the situation/problem:
We use libsvm file format to feed the data into XGBoost (Java API). The libsvm file has some column indexes completely omitted - meaning that they are effectively understood as columns containing only missing values when the libsvm file is loaded onto XGBoost. How does XGBoost treat these columns? Does it ignore them? It would make sense that these constant value columns are completely ignored, as there is no useful information in them.

In our experiments, it appears that they are not ignored and somehow impact the model. We tried an experiment with an extreme case: build a model with only 1 feature. So we create two libsvm files. One libsvm that contains only the first column index (zero-based indexing). And another libsvm that contains essentially the same information, except the index is offset (say the index starts with 4, meaning that XGBoost will understand it as there are 4 columns with only missing values, and a 5th column with information in it containing non-null values). If XGBoost would ignore the columns with fully NaNs, these two libsvms should give the same model and model performance. But this isn’t happening.
I was wondering if any has the same issue? or if there is an explanation for it?

hcho3 · October 29, 2020, 6:16pm

XGBoost treats these columns as if they were columns filled with a single constant.

vosilov · October 29, 2020, 8:32pm

So, they should have no impact on the model performance, am I right?

hcho3 · October 30, 2020, 3:34am

Maybe, maybe not. Since the column is not discriminative, the split finding algorithm should always choose other columns, unless it lacks other options. You can verify this by creating text dump of the model

hcho3 · October 30, 2020, 3:57am

You can set hyperparameter min_split_loss to a value greater than 0, to ensure that every new split added results into a net loss reduction. This way, the split-finding algorithm will not choose the constant column(s), as they are not discriminative and choosing them would lead to zero loss reduction.