Colsample_bytree seemingly not random

neverfox · July 29, 2018, 4:15pm

My understanding of colsample_bytree is that it randomly samples from the features (columns) for each tree, so it would be a way to limit dimensionality prior to constructing trees. However when I use values < 1, it seems to always take from the first columns (based on whatever percentage) and not randomly. Extending on the sample R code for xgb.train:

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')

dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

## A simple xgb.train example:
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 1,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, verbose = 0)
xgb.importance(model = bst)

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 0.01,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 100, watchlist, verbose = 0)
xgb.importance(model = bst)

When using 0.01 (which should limit it to 1-2 columns at most), the importance matrix always shows cap-shape=bell, the first column in the data. I wouldn’t expect this if the column sampling were random. I see this with my own data too, where X for colsample_bytree always results in columns in the importance matrix being from the first X * 100% of columns in my data matrix. So if someone uses 0.8, they would essentially be throwing away the last 20% of their columns and I don’t think that’s what users would want or expect they’re doing. Unless I’m misunderstanding the meaning and purpose of this parameter, this seems like a serious issue. Hoping someone knows better.

tqchen · July 29, 2018, 4:48pm

Please note that column sample is done for every tree, and no feature will be intentionally thrown away. One thing to check quickly is that if column_sample=0.5 will affect the perf(it usually should not)

neverfox · July 29, 2018, 7:19pm

If that’s how it’s supposed to work, then why is no tree showing importance for any variable except the ones that just so happen to be in the first colsample_bytree % of the matrix columns? I’ve reworked the example to make this more clear:

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')

dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

## A simple xgb.train example:
param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 1,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, verbose = 0)
xgb.importance(model = bst)

# Feature       Gain     Cover Frequency
# 1:               odor=none 0.67615472 0.4978746       0.4
# 2:         stalk-root=club 0.17135373 0.1920543       0.2
# 3:       stalk-root=rooted 0.12317237 0.1638750       0.2
# 4: spore-print-color=green 0.02931918 0.1461960       0.2

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 0.05,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, verbose = 0)
xgb.importance(model = bst)

# Feature        Gain     Cover Frequency
# 1:    cap-shape=bell 0.613303696 0.5116335      0.50
# 2: cap-shape=knobbed 0.383851088 0.2467807      0.25
# 3:    cap-shape=flat 0.002845216 0.2415857      0.25

colnames(dtrain)[1:7]
# [1] "cap-shape=bell"      "cap-shape=conical"   "cap-shape=convex"    "cap-shape=flat"      "cap-shape=knobbed"   "cap-shape=sunken"    "cap-surface=fibrous"

# shuffle columns first and demonstrate that it changes the outcome to the new lead columns
shuffle_cols <- sample(1:ncol(agaricus.train$data))
dtrain <- xgb.DMatrix(agaricus.train$data[, shuffle_cols], label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data[, shuffle_cols], label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, colsample_bytree = 0.05,
              objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, verbose = 0)
xgb.importance(model = bst)

# Feature       Gain     Cover Frequency
# 1: stalk-color-above-ring=gray 0.34586308 0.2699291      0.25
# 2:             ring-number=two 0.27906377 0.2492748      0.25
# 3:         population=abundant 0.27766361 0.2510303      0.25
# 4: stalk-color-above-ring=pink 0.09740955 0.2297658      0.25

colnames(dtrain)[1:7]
# [1] "stalk-color-above-ring=gray"    "ring-number=two"                "population=abundant"            "stalk-color-above-ring=pink"    "gill-color=gray"                "spore-print-color=chocolate"    "stalk-surface-above-ring=scaly"

So for colsample_bytree = 0.05, it’s going to sample 7 columns and the important variables are always from the first 7 columns in the matrix. Unless I misunderstand something (or perhaps I’m being mislead by the importance matrix), this is not expected if it’s randomly sampling 5% of columns.

And no column_sample (undocumented?) didn’t change the outcome.

neverfox · July 29, 2018, 7:05pm

To further drive this home, if you set colsample_bytree to 0.86 or higher, you get the same outcome as setting it to 1, as that’s high enough to include 109 features and spore-print-color=green just so happens to be 109th in the matrix. If you drop to 0.85, the model becomes (note the change in the 4th variable):

                          Feature       Gain     Cover Frequency
1:                      odor=none 0.68684104 0.4978845       0.4
2:                stalk-root=club 0.17484853 0.1911557       0.2
3:              stalk-root=rooted 0.12595220 0.1654478       0.2
4: stalk-surface-below-ring=scaly 0.01235824 0.1455120       0.2

CodingCat · July 30, 2018, 4:27am

eh…yes, we observed the same thing…

neverfox · August 2, 2018, 6:37pm

Thanks for the confirmation. That seems to be enough to raise the alarm but I’m curious if anyone knows if this is simply user error. If not, seems like a serious issue.

hcho3 · August 2, 2018, 6:41pm

There’s also this issue: https://github.com/dmlc/xgboost/issues/3230. Not sure if it’s related.

tqchen · August 15, 2018, 6:59pm

@hcho3 can you dig a bit into the code and by print out the columns to see if the columns used in each iteration are fixed?

tqchen · August 15, 2018, 7:00pm

Let us also confirm if it is specific to R, as R have a different way to get random numbers

hcho3 · August 15, 2018, 7:51pm

@tqchen I’ll take a look.

neverfox · August 18, 2018, 11:17pm

From what I can tell, this doesn’t happen in Python. I can change the parameter and the model can still select features all throughout the range of columns.

hcho3 · August 19, 2018, 3:15am

Did you observe this in Java/Scala?

aseshg · August 23, 2018, 8:23am

I too have observed the same [language=R]

aseshg · August 23, 2018, 8:24am

Any ways to raise this as a bug and fix in subsequent release?

thvasilo · August 26, 2018, 10:31pm

Bugs can be reported as issues in the main repository.

neverfox · August 26, 2018, 10:52pm

I already reported it before coming here (which was suggested to me in the thread there).

thvasilo · August 27, 2018, 12:25am

Thanks @neverfox I had missed that issue!

ben519 · August 30, 2018, 9:27pm

Chiming in to say that I’m also fighting with the same issue. Does anyone have a temporary fix? Right now I’m forced to set colsample_bytree = 1 Thanks.

hcho3 · August 30, 2018, 10:07pm

Will try to fix this before the next planned release (October 1, 2018)

hcho3 · October 9, 2018, 8:57am

I published a fix at https://github.com/dmlc/xgboost/pull/3781.