Is it possible to specify the probability with which explanatory variables (features) are drawn for use in each tree?
This is of interest when groups of unequal sizes exist among the potential explanatory variables.
Specify Feature Selection Frequency
What I think @benrfitzpatrick wants is to specify the probability of each feature being selected when using the subsample
parameter.
AFAIK this is not possible in the current codebase.
Perhaps one way to hack this is to duplicate the variables that are of interest: E.g. if my dataset has 10 features and I set subsample
to 0.1 each variable has 10% chance of being selected. If I duplicate f1 in my dataset it will have a 20% chance of being selected.
Since this is a decision tree co-linearity should not be an issue, but I’m not 100% sure on that.
Hello, thank you both for your replies. Yes, I am interested in being able to specify the probability that a feature will be available to define a partition.
With respect to the subsample
parameter my reading of the documentation was that this controls the amount of bagging being performed rather than the sample size of features randomly selected to be available to define each partition.
I see that there is another parameter colsample_bytree
which controls the proportion of features (columns) that are available to define partitions in each tree. There is also another parameter colsample_bylevel
which controls the same thing when features are sampled anew each time a partition in a tree is defined.
What I was wondering was if XGBoost includes a feature like the split.select.weights
from the Random Forest software ranger.
Thank you @thvasilo this is helpful information. I might try my luck with a feature request on GitHub. Cheers.