Let me first lay out the context to better explain the situation/problem:
We use libsvm file format to feed the data into XGBoost (Java API). The libsvm file has some column indexes completely omitted - meaning that they are effectively understood as columns containing only missing values when the libsvm file is loaded onto XGBoost. How does XGBoost treat these columns? Does it ignore them? It would make sense that these constant value columns are completely ignored, as there is no useful information in them.
In our experiments, it appears that they are not ignored and somehow impact the model. We tried an experiment with an extreme case: build a model with only 1 feature. So we create two libsvm files. One libsvm that contains only the first column index (zero-based indexing). And another libsvm that contains essentially the same information, except the index is offset (say the index starts with 4, meaning that XGBoost will understand it as there are 4 columns with only missing values, and a 5th column with information in it containing non-null values). If XGBoost would ignore the columns with fully NaNs, these two libsvms should give the same model and model performance. But this isn’t happening.
I was wondering if any has the same issue? or if there is an explanation for it?