Enable_categorical support for XGBClassifier in pipelines and *SearchCV estimators

Description

I have set the enable_categorical argument in XGBClassifier using the SKLearn API, by which I can avoid having to 1-hot encode the categorical features. The documentation here:
(https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html#training-with-scikit-learn-interface) suggests that:

My XGBClassifier instance is wrapped within a CalibratedClassifierCV (step ‘clf’) and occurs at the end of a set of transformations via the ColumnTransformer (step ‘enc’) and dimension reduction (step ‘dim’) and the whole pipeline is in turn wrapped for hyperparameter tuning within RandomizedSearchCV as in the diagram below:

So, after reading the input data via pd.read_csv(), I changed the desired columns grouped as ‘target’ and ‘cat’ above, to pandas dtype ‘category’ and fitted it to the estimator above.

I also tried a second scenario where I left the ‘target’ and ‘cat’ columns of the input df as the default dtypes as read by pd.read_csv and fitted it to the estimator.

I expected that in the first case (where I changed the pandas columns to dtype ‘category’), the categories of the features would be used to split nodes and give me different results from the second case. However, in both instances, I got identical results.

Finally, I also repeated the above 2 scenarios with the enable_categorical argument set to False in the XGBClassifier instance and got the exact same results in yet again.

This suggests to me that the features (with the same ordinal values in the columns in all cases) are not being treated any differently whether the pandas columns are dtype ‘category’ or otherwise and whether the enable_categorical flag is set or not.

Am I missing something to get XGBClassifier to use the categorical features to split the nodes?

Note that the output of the ColumnTransformer step is a numpy array which gets propagated through the ‘dim’ step to the ‘clf’ steps. So, arguably, the data fed to XGBClassifier at the end of the pipeline is not in a dataframe at all and the relevant columns won’t therefore have the pandas dtype ‘category’ when fed to the XGBClassifier.
Is this the reason for the identical results? If so, how can I utilise the categorical features functionality of XGBClassifier via the SKLearn API?

Reproducible example

Hope above information is sufficient to answer my question. If you indeed need a more tangible estimator, please let me know.

Environment info

xgboost version: 1.6.2
SKLearn: 1.1.2
Pandas: 1.5.1
numpy: 1.23.4

Can you try running XGBoost without sklearn pipeline? The support for sklearn pipeline is not yet fully mature.

Hello @hcho3,
I have built my solution using the pipeline to avoid data leakage and ensure all steps are consistently applied along with cross-validation. So, I couldn’t think of building my solution without a pipeline.

Are you suggesting that enable_categorical in XGBClassifier cannot work in my situation within a pipeline and that I will need to run xgboost standalone to make it work?

Or are you asking me to find out whether it can work standalone?

Thanks
Narayan

For now please run XGBoost standalone to use the categorical data support. Categorical data support is still labeled as experimental and may not work with sklearn pipelines. Alternatively, you can one-hot encode categorical data.

@hcho3,

Thank you for your clarification. So, as I understand it, the XGBClassifier requires to be fed a pandas df with categorical features marked as dtype ‘category’ to use the enable_categorical feature.

The trouble is that my model has a number of categorical and other features that require transformation. And the moment you try to transform anything in sklearn, even without a pipeline, you end up with a numpy array.

So, in summary, unless there is a way to convert the input matrix into a pandas df after transformation and set the relevant columns to dtype ‘category’, one can’t use xgboost’s enable_categorical function even in standalone mode.

I’d appreciate it if you could confirm my understanding of the current limitations as outlined above.

Narayan

unless there is a way to convert the input matrix into a pandas df after transformation and set the relevant columns to dtype ‘category’, one can’t use xgboost’s enable_categorical function even in standalone mode.

That’s right. You need to convert the input matrix as a Pandas df.