Categorical Features -> One Hot Encoding vs (Spurious) Ordinal

orrymr · October 1, 2019, 9:17am

Hi all,

From the reading I’ve done, it seems to me that the preferred way to deal with categorical features is to do a full-rank one-hot encoding.

That is, if I have a feature which is, say T-Shirt Colour, which can take on values of 1, 2 or 3, corresponding to Red, Orange and Green, respectively, then it is best to create 2 binary features: colour.red and colour.orange, with the final category, Green, being implied.

My question is then: since XGBoost is an ensemble of CARTs, would it not be fine to leave the category as a numeric? The algorithm should be able to split in such a way that the outcome is the same.

For example, split where T-Shirt Colour is less than 1.5, would correspond to T-Shirt Colour == Red. It could then split again on T-Shirt Colour less than 2.5 to differentiate between Orange and Green?

Thanks!

hcho3 · October 1, 2019, 3:55pm

You’d introduce an artifact in split conditions, since in your example, it is not possible to group Red (1) and Green (3) in the left side of the split and Orange (2) in the right side.

orrymr · October 2, 2019, 8:27am

Thanks for the reply.

But, theoretically, the tree would still be able to split all three, and still classify correctly? The tree would be less parsimonious though…

hcho3 · October 2, 2019, 7:16pm

Also, you’d need to specify a lower level of regularization.

orrymr · October 3, 2019, 7:52am

That being the gamma parameter?

hcho3 · October 3, 2019, 5:53pm

See https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html#control-overfitting