Sorry to trouble you again.
I have quite a few categorical predictors. Some can take on 3 values, some 300 and some 3000. I have about a million rows of data.
What would be the best way to put those categorical predictors into xgboost?
At the moment, I am converting each of those predictors to a continuous variable, with its value set to the proportion of the minority class over the training set. For example, one predictor might be a code. In the training set, when this code is ‘A3’, 3% of the observations are of class 1. So for all records with code A3, the continuous version of this code variable is set to 0.03. And so on for all the other values of the code (A4,A5, etc)
For the test set, I would also set those records with A3 to 0.03 - the proportions of 1’s in the training set.
Is that the right thing to do, and am I right to compute this over the training set?
Many thanks for your help, it is much appreciated.