Possibility to return which variables/rows are used in each "subsample" and "colsample_bytree" call

brebbles · February 11, 2019, 12:45am

FWIW I am running in R.

I have run a grid-search cross validation on my data, including different subsample and colsample_bytree values as hyper-parameters in the grid. I use the set.seed() argument in my run, and I can replicate results if I run my grid search in it’s entirety again.

After finding the combination of hyper-parameters which give the best eval_metric on the test sample, I am now trying to replicate the results of the best hyper-parameter combination. If I set up my grid so that it only contains one row of the best combination of hyper-parameters, and run it through the same CV code (with the same seed as the brute-force CV above), I am given a different (worse) eval_metric on the test data.

I understand that this is likely because with each different loop through the grid-search the random split of rows/columns will change (even for the same seed). My question is though - is it possible to work out which rows/columns were used in my best model? Or even better - is it possible to return the actual tree from the best model on the test set? Again I understand this isn’t exactly the point of tuning hyper-parameters, but in my case at least it seems futile to sample by row and column as I am unable to replicate the results when I go to build the final model.