I have a dataset with ~4M rows and 1250 features. I’m trying to make sure that I have encoded all of my data correctly prior to running XGBoost, as I’m not sure whether XGBoost is handling my missing values correctly?
- The vast majority of the features in my dataset are binary (0/1) categorical features with no missing values. These features are currently each in their own column, and each row is assigned a 0 or 1 for each feature.
- A couple of features are categorical features that are one-hot encoded because they are not binary features; rather, they are associated with multiple values (e.g., race: White, Black, Asian, Hispanic, Unknown, etc.).
- A couple of features are numeric features with some missing values, where the missing values are set to NaN.
Due to memory constraints, I have to put my data into a SciPy sparse matrix format prior to fitting with XGBoost. As such, the categorical features that are one-hot encoded (i.e., #2 above) are associated with n-1 columns in the sparse matrix.
My questions are as follows:
- Do I need to one-hot encode the binary features?
- Does XGBoost handle the zeros in the binary feature columns as missing values? Or does it only handle the NaNs in the numerical feature columns as missing values?
- I don’t want XGBoost to treat zeros as missing data points. My understanding (per the XGBoost online documentation) is that XGBoost treats only NaNs as missing values – is that correct?