How do we encode missing categorical values?

Hi,
I’m a beginner so please bear with me for this question.

I have been trying to use XGBoost to handle the missing values in my data. I have a column (categorical) that has quite a few NaN values. From what I read online, we don’t have to handle the missing values but XGB handles it.

# fit model no training data
model = XGBClassifier(enable_categorical=True)
model.fit(X_train, y_train)

But when I run it with my dataset, I get this error.

ValueError: DataFrame.dtypes for data must be int, float, bool or categorical.  When
                categorical type is supplied, DMatrix parameter
                `enable_categorical` must be set to `True`.STATE, OCCUPATION, INCOME_GROUP

STATE, OCCUPATION, INCOME_GROUP are my categorical variables as it is without encoding and having missing values.

Do I have to encode my categorical data? Or is there some other way so that I can pass my data to this classifier?

I’m not sure how to encode my data because it has NaNs in it. Can someone please help me with this?

XGBClassifier(enable_categorical=True)

This parameter is only added a few days ago and hasn’t been released yet, don’t use it. :wink:

Do I have to encode my categorical data?

For now yes. There are plenty of encoding methods you can try with sklearn first, from basic one-hot encoding to target encoding.

I’m not sure how to encode my data because it has NaNs in it.

  • You can impute the data before encoding.
  • Or you can remove those rows before running encoding, and put them back afterward with NaN filled in corresponding columns and let XGBoost handle them.
1 Like

Thank you very much for this response. I’ll work on it now.