Colsample_bytree seemingly not random

My understanding of colsample_bytree is that it randomly samples from the features (columns) for each tree, so it would be a way to limit dimensionality prior to constructing trees. However when I use values < 1, it seems to always take from the first columns (based on whatever percentage) and not randomly. Extending on the sample R code for xgb.train:

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')

dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

## A simple xgb.train example:
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 1,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, verbose = 0)
xgb.importance(model = bst)

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 0.01,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, verbose = 0)
xgb.importance(model = bst)

When using 0.01 (which should limit it to 1-2 columns at most), the importance matrix always shows cap-shape=bell, the first column in the data. I wouldn’t expect this if the column sampling were random. I see this with my own data too, where X for colsample_bytree always results in columns in the importance matrix being from the first X * 100% of columns in my data matrix. So if someone uses 0.8, they would essentially be throwing away the last 20% of their columns and I don’t think that’s what users would want or expect they’re doing. Unless I’m misunderstanding the meaning and purpose of this parameter, this seems like a serious issue. Hoping someone knows better.

1 Like

Please note that column sample is done for every tree, and no feature will be intentionally thrown away. One thing to check quickly is that if column_sample=0.5 will affect the perf(it usually should not)

If that’s how it’s supposed to work, then why is no tree showing importance for any variable except the ones that just so happen to be in the first colsample_bytree % of the matrix columns? I’ve reworked the example to make this more clear:

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')

dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

## A simple xgb.train example:
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 1,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, verbose = 0)
xgb.importance(model = bst)

# Feature       Gain     Cover Frequency
# 1:               odor=none 0.67615472 0.4978746       0.4
# 2:         stalk-root=club 0.17135373 0.1920543       0.2
# 3:       stalk-root=rooted 0.12317237 0.1638750       0.2
# 4: spore-print-color=green 0.02931918 0.1461960       0.2

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 0.05,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, verbose = 0)
xgb.importance(model = bst)

# Feature        Gain     Cover Frequency
# 1:    cap-shape=bell 0.613303696 0.5116335      0.50
# 2: cap-shape=knobbed 0.383851088 0.2467807      0.25
# 3:    cap-shape=flat 0.002845216 0.2415857      0.25

colnames(dtrain)[1:7]
# [1] "cap-shape=bell"      "cap-shape=conical"   "cap-shape=convex"    "cap-shape=flat"      "cap-shape=knobbed"   "cap-shape=sunken"    "cap-surface=fibrous"

# shuffle columns first and demonstrate that it changes the outcome to the new lead columns
shuffle_cols <- sample(1:ncol(agaricus.train$data))
dtrain <- xgb.DMatrix(agaricus.train$data[, shuffle_cols], label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data[, shuffle_cols], label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 0.05,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, verbose = 0)
xgb.importance(model = bst)

# Feature       Gain     Cover Frequency
# 1: stalk-color-above-ring=gray 0.34586308 0.2699291      0.25
# 2:             ring-number=two 0.27906377 0.2492748      0.25
# 3:         population=abundant 0.27766361 0.2510303      0.25
# 4: stalk-color-above-ring=pink 0.09740955 0.2297658      0.25

colnames(dtrain)[1:7]
# [1] "stalk-color-above-ring=gray"    "ring-number=two"                "population=abundant"            "stalk-color-above-ring=pink"    "gill-color=gray"                "spore-print-color=chocolate"    "stalk-surface-above-ring=scaly"

So for colsample_bytree = 0.05, it’s going to sample 7 columns and the important variables are always from the first 7 columns in the matrix. Unless I misunderstand something (or perhaps I’m being mislead by the importance matrix), this is not expected if it’s randomly sampling 5% of columns.

And no column_sample (undocumented?) didn’t change the outcome.

To further drive this home, if you set colsample_bytree to 0.86 or higher, you get the same outcome as setting it to 1, as that’s high enough to include 109 features and spore-print-color=green just so happens to be 109th in the matrix. If you drop to 0.85, the model becomes (note the change in the 4th variable):

                          Feature       Gain     Cover Frequency
1:                      odor=none 0.68684104 0.4978845       0.4
2:                stalk-root=club 0.17484853 0.1911557       0.2
3:              stalk-root=rooted 0.12595220 0.1654478       0.2
4: stalk-surface-below-ring=scaly 0.01235824 0.1455120       0.2

eh…yes, we observed the same thing…

1 Like

Thanks for the confirmation. That seems to be enough to raise the alarm but I’m curious if anyone knows if this is simply user error. If not, seems like a serious issue.

There’s also this issue: https://github.com/dmlc/xgboost/issues/3230. Not sure if it’s related.

@hcho3 can you dig a bit into the code and by print out the columns to see if the columns used in each iteration are fixed?

1 Like

Let us also confirm if it is specific to R, as R have a different way to get random numbers

1 Like

@tqchen I’ll take a look.

From what I can tell, this doesn’t happen in Python. I can change the parameter and the model can still select features all throughout the range of columns.

Did you observe this in Java/Scala?

I too have observed the same [language=R]

Any ways to raise this as a bug and fix in subsequent release?

Bugs can be reported as issues in the main repository.

I already reported it before coming here (which was suggested to me in the thread there).

Thanks @neverfox I had missed that issue!

Chiming in to say that I’m also fighting with the same issue. Does anyone have a temporary fix? Right now I’m forced to set colsample_bytree = 1 Thanks.

Will try to fix this before the next planned release (October 1, 2018)

I published a fix at https://github.com/dmlc/xgboost/pull/3781.