Categorical Features -> One Hot Encoding vs (Spurious) Ordinal


#1

Hi all,

From the reading I’ve done, it seems to me that the preferred way to deal with categorical features is to do a full-rank one-hot encoding.

That is, if I have a feature which is, say T-Shirt Colour, which can take on values of 1, 2 or 3, corresponding to Red, Orange and Green, respectively, then it is best to create 2 binary features: colour.red and colour.orange, with the final category, Green, being implied.

My question is then: since XGBoost is an ensemble of CARTs, would it not be fine to leave the category as a numeric? The algorithm should be able to split in such a way that the outcome is the same.

For example, split where T-Shirt Colour is less than 1.5, would correspond to T-Shirt Colour == Red. It could then split again on T-Shirt Colour less than 2.5 to differentiate between Orange and Green?

Thanks!


#2

You’d introduce an artifact in split conditions, since in your example, it is not possible to group Red (1) and Green (3) in the left side of the split and Orange (2) in the right side.


#3

Thanks for the reply.

But, theoretically, the tree would still be able to split all three, and still classify correctly? The tree would be less parsimonious though…


#4

Also, you’d need to specify a lower level of regularization.


#5

That being the gamma parameter?


#6

See https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html#control-overfitting