My understanding of colsample_bytree is that it randomly samples from the features (columns) for each tree, so it would be a way to limit dimensionality prior to constructing trees. However when I use values < 1, it seems to always take from the first columns (based on whatever percentage) and not randomly. Extending on the sample R code for xgb.train:
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)
## A simple xgb.train example:
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 1,
objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, verbose = 0)
xgb.importance(model = bst)
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 0.01,
objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, verbose = 0)
xgb.importance(model = bst)
When using 0.01 (which should limit it to 1-2 columns at most), the importance matrix always shows cap-shape=bell, the first column in the data. I wouldn’t expect this if the column sampling were random. I see this with my own data too, where X for colsample_bytree always results in columns in the importance matrix being from the first X * 100% of columns in my data matrix. So if someone uses 0.8, they would essentially be throwing away the last 20% of their columns and I don’t think that’s what users would want or expect they’re doing. Unless I’m misunderstanding the meaning and purpose of this parameter, this seems like a serious issue. Hoping someone knows better.