Scala spark xgboost v0.81 SHAP problem

yaozhang2016 · April 24, 2019, 7:27pm

To Community,
We are running into a weird issue in analyzing its SHAP values (by .setContribPredictionCol) from scala spark xgboost v0.81 on CDH. The issue is that:

for classification, our model has 814 features, but output SHAP field has only 813 values, by default we should have 815 values (for bias term as the last one)
for regression, we don’t have this issue.

Did we missed something for SHAP in classification? or is it a bug?

Thanks
Yao

hcho3 · April 25, 2019, 9:13pm

Can you take a sample of your data and run it on a laptop using the Python package? See if you have the same issue. Also, try using latest XGBoost (0.82)

yaozhang2016 · May 1, 2019, 6:17pm

Here are our test:

Use scala spark XGBoost 0.82, we still have this issue.
Use python XGBoost 0.82, we have the same issue but plus an extra issue (it tries to reshape all shap scores into a 3-D array with middle dimension length 0).

We are suspecting this is a bug for xgboost classification (tree learner) only. We are testing a new data and a new xgboost classification model; also try independent python SHAP package.

Thanks
Yao

hcho3 · May 1, 2019, 6:50pm

I think https://github.com/dmlc/xgboost/issues/4276 is related.

yaozhang2016 · May 1, 2019, 7:39pm

@hcho3 Here is our latest update,

Yes, the extra issue is similar to your link.
One more update on this thread: we still use scala spark xgboost v0.82 to do SHAP on a super data and super model which has 4107 features (which contains the 814 reported at the earliest) and it correctly gave back 4108 SHAP component scores.

This basically hints that the SHAP implementation in xgboost (scala and python version) has a bug which was captured accidently by our data and model.

Thanks
Yao