Display Original Encoding when writing to CSV or Running Model


Hi, I hope someone has an idea about this one because I cannot figure it out why it runs fine on my laptop then outputs differently in Hadoop.

I have an odd issue running XGBoost in R in the Hadoop environment where the Feature of Importance defaults to a number versus original string value. The same code works perfectly fine on RStudio Desktop. I cannot get the feature output to display the original value. Can anyone help?

Below is what it should look like
Feature Gain Cover Frequency Importance
1: insect 0.20066341 0.101087586 0.084706960 0.20066341
2: fish 0.13148500 0.060271103 0.120421245 0.13148500

Here is what I get in Hadoop

Feature Gain Cover Frequency Importance
1: 14 0.20066341 0.101087586 0.084706960 0.20066341
2: 25 0.13148500 0.060271103 0.120421245 0.13148500

What have I tried:

  1. Dimname, that did not work



I didn’t know this was officially supported.


Hi, do you mean ‘Dimname’? Or the R running on Hadoop with XGBoost package?


I mean running R on Hadoop


It is R Server and it seems to work fine with Hadoop. Just odd that string variables get encoded as numeric.


Yeah, I personally don’t have experience running R in Hadoop cluster. Any help in fixing this issue would be great.


In case anyone else runs into this, you do need to use dimnames verses colnames.

##The below works fine in RStudio
importantance <- xgb.importance(feature_names = colnames(train$data), model = bst)

##use the Dimnames if you are in Hadoop
importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)