What is the best way to use categorical predictors in xgboost

Sorry to trouble you again.

I have quite a few categorical predictors. Some can take on 3 values, some 300 and some 3000. I have about a million rows of data.
What would be the best way to put those categorical predictors into xgboost?

At the moment, I am converting each of those predictors to a continuous variable, with its value set to the proportion of the minority class over the training set. For example, one predictor might be a code. In the training set, when this code is ‘A3’, 3% of the observations are of class 1. So for all records with code A3, the continuous version of this code variable is set to 0.03. And so on for all the other values of the code (A4,A5, etc)
For the test set, I would also set those records with A3 to 0.03 - the proportions of 1’s in the training set.
Is that the right thing to do, and am I right to compute this over the training set?

Many thanks for your help, it is much appreciated.

That’s an odd way of handling categorical variables. Wouldn’t you usually convert categorical variables into dummy (one-hot-encoded) variables?

Current version of XGBoost cannot handle non-binary categorical features directly, and the expectation is that you’d convert them into binary dummies first. We are actively developing a new feature in XGBoost that would let you put in non-binary categorical features directly (#5949, #6137, #6140)

Yes, it does seem odd. However, it is suggested in ‘The elements of statistical learning’, page 310, section 9.2.4 in the 2nd edition. I have seen it done by the name of Ordered Target Statistics, and it is also what is done in catboost.

Thanks for the reference. I looked it up in my ESL book. The idea seems to be similar to the practice of target encoding, for which many packages exist:

Target encoding is often used as an alternative one-hot encoding, as one-hot encoding tends to slow down training.

This link is also useful: https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study

Thank you, that’s very useful.

According to ESL, the problem with 1-hot encoding is not so much that it slows down the training, but rather that there are too many possibilities of splitting the categories.

Would target encoding be something incorporated in a new version of xgboost, similar to catboost perhaps?

Thank you again,

Michel

I am not entirely convinced that target leakage is a problem with decision trees: Are we not always using the label to determine the optimal split?

Yes, we’d like to put in TargetEncoder in XGBoost at some point in the future.

ESL book page 310 doesn’t say anything about one-hot encoding:

When splitting a predictor having Q possible unordered values, there are 2^(Q-1) - 1 possible partitions of the q values into two groups, and the computations become prohibitive for large q.

which suggests create a binary split directly with the categorical variable without one-hot encoding. If one-hot encoding is used, you’d end up with Q binary dummy variables, so you don’t have 2^(Q-1) - 1 possible partitions (*). Instead, you have slower training because now the training matrix has many more features, and XGBoost consumes memory proportional to the number of features in the data matrix.

It is an issue because now the target is implicitly included in your input matrix. You can avoid the leakage by dividing training data into multiple folds. See https://medium.com/rapids-ai/target-encoding-with-rapids-cuml-do-more-with-your-categorical-data-8c762c79e784

(*) In practice, decision trees are fit with Greedy stepwise fashion, adding one binary split at a time. So if you split a single categorical variable into Q dummies, you do not consider all 2^(Q-1) - 1 combinations.

Yes, I can see that argument about the leakage, but I am not convinced yet that it will be a problem, except perhaps for rare codes. I think what I am going to do is use the H2O package in R to do the encoding.

I agree about not considering all combinations, but because the tree is only so deep, you can only explore a few individual codes vs. all the others lumped together.