How does XGBoost know the SHAP base values for each tree?

jackbennett · April 5, 2021, 7:54pm

How does XGBoost know the base value for each tree when computing shap values?

shap_values = bst.predict(x_i, pred_contribs=True)

There is a really nice explanation here which explains what SHAP values are, why they are useful and how SHAP values are calculated, for a given prediction. It’s a nice read.

What isn’t clear to me though, is how a pre-trained XGBoost can know the base value when computing SHAP for a new, individual case. The article states, that the base value should equal then “mean prediction for the training set”, although I have since learnt in the XGBoost case, this is actually based on the sums of Hessians for a given tree.

But still, where is this value stored? Can anybody clarify exactly how the base value is computed/where it comes from? Many thanks

hcho3 · April 13, 2021, 8:42pm

The sum of Hessians is used as a proxy for the number of data points that flow through each particular tree node. The value comes from gradient boosting. See https://dl.acm.org/doi/10.1145/2939672.2939785 for more details.

jackbennett · April 13, 2021, 9:15pm

Thanks! My question though is more on the implementation side; where is the sum of Hessians actually stored in a given XGBoost model object? This is what I do not understand, but the value must come from somewhere during computation. Thanks

hcho3 · April 13, 2021, 9:27pm

See the following code snippet. The sum_hess field contains the sum of Hessians.

github.com

dmlc/xgboost/blob/a4ce0eae43f7e0e2f91566ef2360830b86b9fdcf/include/xgboost/tree_model.h#L98-L106


struct RTreeNodeStat {
  /*! \brief loss change caused by current split */
  bst_float loss_chg;
  /*! \brief sum of hessian values, used to measure coverage of data */
  bst_float sum_hess;
  /*! \brief weight of current node */
  bst_float base_weight;
  /*! \brief number of child that is leaf node known up to now */
  int leaf_child_cnt {0};