Attribute-Importance XGBoost-JVM

sipal · October 16, 2019, 7:04am

I have the following problem if anyone has a hint. I have made a map of attributes in my dataset. I map it like the one below, which has 300+ columns, ie, the size of the map “myAttrs” is 300+

Map<String, String> myAttrs = new HashMap<String, String>();
myAttrs.put(“Attr-1”, “Col-1”);
myAttrs.put(“Attr-2”, “Col-2”);
myAttrs.put(“Attr-3”, “Col-3”);
…
…
…
myAttrs.put(“Attr-m”, “Col-m”);

Booster booster = XGBoost.train(train, params, nround, watches, null, null);

booster.setAttrs(myAttrs);

Set myNameSet = myAttrs.keySet();

String[] featureNamesArr = myNameSet.toArray(new String[myNameSet.size()]);

String importanceType = Booster.FeatureImportanceType.GAIN;

Map<String,Double> myScore = booster.getScore(featureNamesArr, importanceType);

The size of the map “myScore” is not 300+. I saw only 5 output scores.

Why is it has 5 rather than 300+ importance-scores?

Any tip would be much appreciated.

Thanks.

hcho3 · October 16, 2019, 10:24am

It means that only 5 features were used in the splits. Try generating a text dump of your model and see which features are being used.

sipal · October 16, 2019, 12:26pm

Thanks for the tip. I did look at the dump, but I was surprised to see that all the 5 were columns from a single categorical variable that was encoded with 1 hot encoding. Does this mean that all the other categorical & continuous variables are less important? Perhaps that I should remove all of those and re-run the XGBoost with that single categorical variable?

My main reason for trying to find out which are the most important variables so I can remove the less important ones (according to a threshold that I manually set), so I can re-run the XGBoost training only on the selected most important variables that net the threshold I manually set. Am I interpreting the use of feature importance correctly here?

hcho3 · October 16, 2019, 11:41pm

It may be that the 5 features are producing gain (loss reduction) a little more than other features do. if you are concerned about the lack of diversity in features in splits, try setting colsample_bytree, colsample_bylevel, and colsample_bynode to a value less than 1. (See https://xgboost.readthedocs.io/en/latest/parameter.html) This will randomly select the set of candidate features and increase the diversity of split features.

jrinne · October 19, 2019, 5:53pm

Obviously good advice.

You might also set colsample_bynode to 1/300 forcing XGBoost to select features randomly and then look at gain for each feature. This will give you the best idea of a feature’s importance.

I would then use Dr. Cho’s advice after removing the features that produce little or no gain.

One advantage of removing features that do not produce gain is that you can do as Dr. Cho suggests in the extreme—limiting the colsample_node to even one randomly selected feature for each node—without forcing the algorithm to use a feature that is just noise (produces not gain).

XGBoost can be set as a pure Random Forest—with subsampling instead of bootstrapping.

In my limited experience, one should try to combine the methods (Random Forest and gradient boosting). Not that I know the exact best way to do this (if there is a best way for every situation).

Dr. Cho offers a way to start to combine them. You might consider removing the random noise first, however.

Or more simply, a little feature selection can still be useful, IMHO, even for XGBoost. Especially, in real-world situations were your cannot be certain that your sample is perfectly i.i.d and the diversity that Dr. Cho suggests will make EACH TREE more i.i.d