How do we encode missing categorical values?

vishnureddys · June 5, 2021, 5:06pm

Hi,
I’m a beginner so please bear with me for this question.

I have been trying to use XGBoost to handle the missing values in my data. I have a column (categorical) that has quite a few NaN values. From what I read online, we don’t have to handle the missing values but XGB handles it.

# fit model no training data
model = XGBClassifier(enable_categorical=True)
model.fit(X_train, y_train)

But when I run it with my dataset, I get this error.

ValueError: DataFrame.dtypes for data must be int, float, bool or categorical.  When
                categorical type is supplied, DMatrix parameter
                `enable_categorical` must be set to `True`.STATE, OCCUPATION, INCOME_GROUP

STATE, OCCUPATION, INCOME_GROUP are my categorical variables as it is without encoding and having missing values.

Do I have to encode my categorical data? Or is there some other way so that I can pass my data to this classifier?

I’m not sure how to encode my data because it has NaNs in it. Can someone please help me with this?

jiamingy · June 5, 2021, 6:27pm

XGBClassifier(enable_categorical=True)

This parameter is only added a few days ago and hasn’t been released yet, don’t use it.

Do I have to encode my categorical data?

For now yes. There are plenty of encoding methods you can try with sklearn first, from basic one-hot encoding to target encoding.

I’m not sure how to encode my data because it has NaNs in it.

You can impute the data before encoding.
Or you can remove those rows before running encoding, and put them back afterward with NaN filled in corresponding columns and let XGBoost handle them.

vishnureddys · June 5, 2021, 7:38pm

Thank you very much for this response. I’ll work on it now.