[Categorical Features] Potential issue on how XGBoost internally stores categorical features

Hello XGBoost team!

I recently encountered a potential issue related to how xgboost handles categorical features when making predictions on new observations.

To illustrate this behavior, let’s consider a dummy dataset with a single categorical feature named cat_feat, which has unique values ["A," "B," "C," "D"] in both the training and test sets. To use this feature in the XGBClassifier, it must be cast to the Pandas category type. This results in a Pandas series with unique categories ["A," "B," "C," "D"] and numerical codes corresponding to the index positions of elements in that unique categories array (accessible through df['cat_feat'].cat.categories and df['cat_feat'].cat.codes, respectively).

The problem arises when, for new single observations, the categorical feature cat_feat is set to a value such as "C", which corresponds to a numerical code of 2 in the training dataset. However, due to the way xgboost currently handles categorical features, the numerical code is internally assigned as 0 for this new observation (as it is the only unique value available). Consequently, during prediction, xgboost interprets the value for this observation as the first element from the unique values array, which, in this case, would be "A".

To address this issue, it is crucial to set the possible values for the ‘cat_feat’ column in the new observation to match those seen in the training dataset. For instance, using the following code before making predictions:

df_new["cat_feat"] = df_new["cat_feat"].cat.set_categories(["A", "B", "C", "D"])

By explicitly setting the categories for the new observation, you ensure that the correct numerical code (2 for "C" in this example) is used during the prediction, preventing inaccurate results in real-world scenarios. Unlike lightgbm, which creates its own internal map between unique feature values and integer values, xgboost requires users to carry over this map from the training dataset and set it for new predictions.

Should xgboost also adopt a similar approach as Lightgbm in handling categorical features, i.e., creating internally a map between the unique values of categorical features and their numerical codes? If not, it might be beneficial to highlight this aspect in the documentation to prevent users from obtaining incorrect predictions in production environments.

XGBoost doc now has a section for handling categorical data for prediction: https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html#data-consistency

1 Like

Thanks a lot, @hcho3!