Feature Level Random Sampling

pandas_pandas · June 15, 2019, 11:29pm

Hi -

Would the community find it useful to be able to specify the probability that each feature gets selected as a candidate splitting feature for a tree/level/split?

For example, suppose you have 10 features in your dataset, and you’ve decided that randomly sampling 50% of them to search for the best split works well, but you’ve also determined that 1 of the 10 features is causing overfitting. This feature is useful but you don’t want to give the model access to it for splitting purposes 50% of the time like the other 9 features which you don’t think are leading to overfitting. Instead you only want to give the model access to it 10% of the time when determining the best split point.

It would be great in this situation to be able to either (1) pass an list/array/vector/tensor of feature-level sample probabilities of the same length as the number of features in the DMatrix or (2) pass a dictionary for specific overrides of the selected colsample_... arguments. For instance the dictionary might look something like this if the feature index is 4 for the feature that you want to give the model access to 10% of the time: {4: 0.10}. This would indicated that the fifth feature (index 4) should only be considered 10% of the time for purposes of selecting the best split.

In terms of potential pull request for this functionality, I see that the ColumnSampler class is defined in random.h in the src/ folder. Is this the main (only?) place where we would need to change the sampling logic?

Thanks!