Sampsize by strata in subsample?


#1

I’d like to be able to be able to group features and then define subsample probabilities (or direct numbers) based on these groups. This would be equivalent to the ‘sampsize’ argument in R’s randomForest package (https://github.com/cran/randomForest/blob/master/man/randomForest.Rd#L64)

This question seems to be similar (equivalent?) to these discussions:

Is this possible in the current R implementation? If not, does anyone know if it would be difficult to implement?

Thanks in advance,
Tim


#2

No, it is not possible in current implementation. colsample_bytree will perform uniform sampling among features. A pull request to implement this would be welcome.


#3

I am still very interested in this functionality being added.

@thvasilo suggested adding duplicate feature columns to the data to produce an equivalent effect:

Would this cause any problems for xgboost’s tree fitting procedures?


#4

@benrfitzpatrick the change I’m suggesting should be done manually at the data file level. Therefore it shouldn’t affect XGB training in any way, provided that my co-linearity assumption is correct.