Colsample_bylevel works well in Python but not R


#1

First post and I want to say thank you for XGBoost. I use it every day.

I used to use it every day in R but moved to Python: in part for error checking (of my methods). While doing this I found that colsample_bylevel was not really working in R and I could not get it to work correctly. Is it just me?

In any case, Python works better for me overall as it seems to do multithreading with the simple download from Anaconda. So I am happy with the Python implementation and I am not really in need of a fix.

Still, I am new and would like to learn and/or alert the community.

As an aside, I get best results with subsample 0.5 and colsample_bylevel 1/6 (there are 6 features in my model) and max_depth = 6 with a small eta: 0.001.

I use these hyper parameters largely based on cross validation results. But theoretically, I think I am making each tree maximally I.I.D. with this method—like a Random Forest. For example, is it a coincidence that each factor should occur in about half of the trees—increasing independence of the trees? Also, colsample_bylevel 1/6 should increase the independence of the trees, I think. Otherwise, some factors would occur in most of the trees (and decrease the independence).

Anyway, I have looked online and read any text I could find. I have found little specific information about how I might be making XGBoost simulate a Random Forest with what I am doing—other than the original stochastic gradient boosting paper and general comments about subsampling making XGBoost like a Random Forest. Any thoughts expanding on this would be greatly appreciated.

Again, much appreciated.

-Jim