Display Original Encoding when writing to CSV or Running Model


#1

Hi, I hope someone has an idea about this one because I cannot figure it out why it runs fine on my laptop then outputs differently in Hadoop.

I have an odd issue running XGBoost in R in the Hadoop environment where the Feature of Importance defaults to a number versus original string value. The same code works perfectly fine on RStudio Desktop. I cannot get the feature output to display the original value. Can anyone help?

Below is what it should look like
Feature Gain Cover Frequency Importance
1: insect 0.20066341 0.101087586 0.084706960 0.20066341
2: fish 0.13148500 0.060271103 0.120421245 0.13148500

Here is what I get in Hadoop

Feature Gain Cover Frequency Importance
1: 14 0.20066341 0.101087586 0.084706960 0.20066341
2: 25 0.13148500 0.060271103 0.120421245 0.13148500

What have I tried:

  1. Dimname, that did not work

image


#2

I didn’t know this was officially supported.


#3

Hi, do you mean ‘Dimname’? Or the R running on Hadoop with XGBoost package?


#4

I mean running R on Hadoop


#5

It is R Server and it seems to work fine with Hadoop. Just odd that string variables get encoded as numeric.


#6

Yeah, I personally don’t have experience running R in Hadoop cluster. Any help in fixing this issue would be great.


#7

In case anyone else runs into this, you do need to use dimnames verses colnames.

##The below works fine in RStudio
importantance <- xgb.importance(feature_names = colnames(train$data), model = bst)

##use the Dimnames if you are in Hadoop
importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)