XGBoost-JVM Funny Results

sipal · October 14, 2019, 6:10am

Dear XGBoosters,

I have a dataset with about 35 columns. Some are categorical (like the GenderColumn has values for Male, Female and missing denoted by NA, another categorical column is MaritalStatus with values as withSpouse, noSoupse and missing denoted by NA, etc…). The continuous value variables, the missing data is represented by NaN.

I transformed this file (with 35 columns, which has both categorical & continuous) into a LibSVM file format. The categorical columns were encoded in one-hot encoding. The continuous columns, the missing data were written as NaN in the LibSVM format. The DMatrix class, threw an error reading this LibSVM file format (that has the NaN representing the missing values in those continuous columns). Because of the error, I reformatted this LibSVM but replace the NaN with zero, which means that the final LibSVM file, skips the zero. The DMatrix can now parse & read this LibSVM file into memory, which it was split into trainSet and testSet in a 80/20 split.

I simply follow the example in the XGBoost main page here, below, but using my own data. I haven’t changed any parameters, etc. My target is a continuous variable, so I’m doing regression.

https://xgboost.readthedocs.io/en/latest/jvm/java_intro.html

I selected certain variables to be dropped/eliminated from the 35 original variables and re-run again. The problem is, I eliminate one by one certain variables I thought that their exclusion would increase the NRMSE. I noted that the NRMSE is the same when I ran 35 variables, 34 variables, 33 variables, 32 variables, …, down to 12 variables, still the NRMSE is the same at around ~ 86.22%

Does anyone has a tip of what is going on here? I thought that as the number of variables changes, then the NRMSE should change, but that is not the case.

Any tips would be much appreciated.

Cheers.