Specify Feature Selection Frequency

benrfitzpatrick · August 24, 2018, 9:22am

Is it possible to specify the probability with which explanatory variables (features) are drawn for use in each tree?
This is of interest when groups of unequal sizes exist among the potential explanatory variables.

hcho3 · August 24, 2018, 5:01pm

How is this different from feature importances?

thvasilo · August 26, 2018, 10:28pm

What I think @benrfitzpatrick wants is to specify the probability of each feature being selected when using the subsample parameter.

AFAIK this is not possible in the current codebase.

Perhaps one way to hack this is to duplicate the variables that are of interest: E.g. if my dataset has 10 features and I set subsample to 0.1 each variable has 10% chance of being selected. If I duplicate f1 in my dataset it will have a 20% chance of being selected.

Since this is a decision tree co-linearity should not be an issue, but I’m not 100% sure on that.

benrfitzpatrick · August 27, 2018, 11:19am

Hello, thank you both for your replies. Yes, I am interested in being able to specify the probability that a feature will be available to define a partition.

With respect to the subsample parameter my reading of the documentation was that this controls the amount of bagging being performed rather than the sample size of features randomly selected to be available to define each partition.

I see that there is another parameter colsample_bytree which controls the proportion of features (columns) that are available to define partitions in each tree. There is also another parameter colsample_bylevel which controls the same thing when features are sampled anew each time a partition in a tree is defined.

What I was wondering was if XGBoost includes a feature like the split.select.weights from the Random Forest software ranger.

thvasilo · September 28, 2018, 8:49pm

Hello @benrfitzpatrick as of 0.80 there is no such parameter for XGBoost.

benrfitzpatrick · October 1, 2018, 9:18am

Thank you @thvasilo this is helpful information. I might try my luck with a feature request on GitHub. Cheers.