Bias towards cutting on features with large numbers of categories

mleeming · October 23, 2023, 6:07am

Hi,

I’ve noticed that xgboost does not seem to take into account the complexity of the cut when comparing a subset split criteria vs. a cut on a numerical feature.

For example, in the below, even though the target depends only on feature a, xgboost chooses to cut almost entirely on feature b.

import pandas as pd
import numpy as np
import xgboost as xgb

X = pd.DataFrame({'a': np.random.uniform(-1, 1, 10000), 'b': (list(range(5000)) * 2)}) 
y = X['a'] + np.random.normal(0, 5.0, 10000)

ft = ['q', 'c']
Xy = xgb.DMatrix(X, y, feature_types=ft, enable_categorical=True)

model = xgb.train({"tree_method": "hist", "max_cat_to_onehot": 5, 'base_score': 0.0, 'lambda': 5.0, 'max_cat_threshold': 5000}, Xy, num_boost_round=1)

model.trees_to_dataframe().head(20)

Is it possible to penalize complex cuts like these in order to reduce the likelihood of xgboost cutting on this feature?

jiamingy · October 23, 2023, 4:00pm

Yeah, decision tree can make these kinds of mistakes. One way to mitigate the issue is use column sampling with feature weight.

mleeming · October 23, 2023, 10:07pm

It does seem like xgboost should make some sort of numerical adjustment to the maximum gain for these subset splits to compensate for the much larger number of hypotheses being tested: in this case, 2^5000 options at each split.