Description
I have set the enable_categorical argument in XGBClassifier using the SKLearn API, by which I can avoid having to 1-hot encode the categorical features. The documentation here:
(https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html#training-with-scikit-learn-interface) suggests that:
- the categorical columns should be set as pandas dtype ‘category’,
- these columns need to be encoded as integers starting at 0, as clarified here: (https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html#miscellaneous). I use the OrdinalEncoder to achieve this.
My XGBClassifier instance is wrapped within a CalibratedClassifierCV (step ‘clf’) and occurs at the end of a set of transformations via the ColumnTransformer (step ‘enc’) and dimension reduction (step ‘dim’) and the whole pipeline is in turn wrapped for hyperparameter tuning within RandomizedSearchCV as in the diagram below:
So, after reading the input data via pd.read_csv(), I changed the desired columns grouped as ‘target’ and ‘cat’ above, to pandas dtype ‘category’ and fitted it to the estimator above.
I also tried a second scenario where I left the ‘target’ and ‘cat’ columns of the input df as the default dtypes as read by pd.read_csv and fitted it to the estimator.
I expected that in the first case (where I changed the pandas columns to dtype ‘category’), the categories of the features would be used to split nodes and give me different results from the second case. However, in both instances, I got identical results.
Finally, I also repeated the above 2 scenarios with the enable_categorical argument set to False in the XGBClassifier instance and got the exact same results in yet again.
This suggests to me that the features (with the same ordinal values in the columns in all cases) are not being treated any differently whether the pandas columns are dtype ‘category’ or otherwise and whether the enable_categorical flag is set or not.
Am I missing something to get XGBClassifier to use the categorical features to split the nodes?
Note that the output of the ColumnTransformer step is a numpy array which gets propagated through the ‘dim’ step to the ‘clf’ steps. So, arguably, the data fed to XGBClassifier at the end of the pipeline is not in a dataframe at all and the relevant columns won’t therefore have the pandas dtype ‘category’ when fed to the XGBClassifier.
Is this the reason for the identical results? If so, how can I utilise the categorical features functionality of XGBClassifier via the SKLearn API?
Reproducible example
Hope above information is sufficient to answer my question. If you indeed need a more tangible estimator, please let me know.
Environment info
xgboost version: 1.6.2
SKLearn: 1.1.2
Pandas: 1.5.1
numpy: 1.23.4