Feature encoding

I have a dataset with ~4M rows and 1250 features. I’m trying to make sure that I have encoded all of my data correctly prior to running XGBoost, as I’m not sure whether XGBoost is handling my missing values correctly?

  1. The vast majority of the features in my dataset are binary (0/1) categorical features with no missing values. These features are currently each in their own column, and each row is assigned a 0 or 1 for each feature.
  2. A couple of features are categorical features that are one-hot encoded because they are not binary features; rather, they are associated with multiple values (e.g., race: White, Black, Asian, Hispanic, Unknown, etc.).
  3. A couple of features are numeric features with some missing values, where the missing values are set to NaN.

Due to memory constraints, I have to put my data into a SciPy sparse matrix format prior to fitting with XGBoost. As such, the categorical features that are one-hot encoded (i.e., #2 above) are associated with n-1 columns in the sparse matrix.

My questions are as follows:

  • Do I need to one-hot encode the binary features?
  • Does XGBoost handle the zeros in the binary feature columns as missing values? Or does it only handle the NaNs in the numerical feature columns as missing values?
  • I don’t want XGBoost to treat zeros as missing data points. My understanding (per the XGBoost online documentation) is that XGBoost treats only NaNs as missing values – is that correct?