[XGBoost4J-Spark] Category wise Feature importance for One Hot Encoded Features

kusumakarb · September 4, 2020, 5:11am

We are extracting the feature importances for the ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel by using model.nativeBooster.getScore(featureNames, "gain"), where featureNames is a list of variable names of all the numerical features and one-hot encoded vectors (created by using org.apache.spark.ml.feature.OneHotEncoderEstimator) used in the model training.

When the above process is followed, we are getting a single feature importance value for the one-hot encoded vector. Is there a way to obtain the feature importance values separately for each category in one hot encoded variable ? For example: If we have a variable called Co_Applicant with 3 categories No, Yes-Different Address, Yes-Same Address, currently we are getting only one feature importance value for this variable. Is there a way to get feature importances separately for each of the 3 categories present in the variable ? This way of getting feature importances for each category is a default behaviour in the python API when we call the model.feature_importances_. How to achieve the same in XGBoost4J-Spark ?