Non-binary Categorical Support

apakbin94 · December 3, 2020, 9:24pm

I had a question to which I did not find an answer anywhere (saw a reference to it for example here tho). I have been reading the XGB code for a while and I have seen some dramatic changes to the code specially in “src/common/quantile.cu” where the cuts are extracted in the current code. The former snapshot I was reading was for March.

To be more specific, the function responsible for extracting cuts is “PruneImpl” now but back then it was “ExtractCuts” (for the unweighted case). “PruneImpl” seems to consider whether the column is categorical and if it is, it would directly put the “SketchEntry” in the “out_cuts”.

Now my question is, is it still the case that categorical columns need to be one-hot-encoded? or now we have this “is_cat” boolean which takes care of categorical columns?
If the latter is the case, how would pruning be done if we need to reduce the number of cuts to 256 (in case we have more than 256 categories)? I ask that because if we cannot have an order in categorical features, how are we distilling down 257+ categories into 256?

Also as a side note, if categorical features are to be treated differently (and not by one-hot-encoding and then treated like any other feature), does that mean that instead of having a (feature, value) pair for each node that decides data points going to the left node (if x[feature]<value) or right node(if x[feature]>=value), we will have categorical values where for example for x[categ_feature]==value we go to left and go to right otherwise?

hcho3 · December 4, 2020, 2:07am

is it still the case that categorical columns need to be one-hot-encoded?

Direct support for categorical split is currently in active development. So one-hot encoding will not be required in the near future.

how would pruning be done if we need to reduce the number of cuts to 256

All splits in XGBoost are two-way splits. The categorical splits will be in the form of set membership test, i.e. feature \in set.

Currently, XGBoost generates splits of form feature \in { value }, i.e. feature value will be tested against a single categorical value. In the future, XGBoost will also generate splits where the test set contains more than one test value.

hcho3 · December 16, 2020, 2:17am

@apakbin94 You can subscribe to https://github.com/dmlc/xgboost/issues/6503 to keep updated of the progress on categorical data support.