Hi all,
From the reading I’ve done, it seems to me that the preferred way to deal with categorical features is to do a full-rank one-hot encoding.
That is, if I have a feature which is, say T-Shirt Colour, which can take on values of 1, 2 or 3, corresponding to Red, Orange and Green, respectively, then it is best to create 2 binary features: colour.red and colour.orange, with the final category, Green, being implied.
My question is then: since XGBoost is an ensemble of CARTs, would it not be fine to leave the category as a numeric? The algorithm should be able to split in such a way that the outcome is the same.
For example, split where T-Shirt Colour is less than 1.5, would correspond to T-Shirt Colour == Red. It could then split again on T-Shirt Colour less than 2.5 to differentiate between Orange and Green?
Thanks!