Feature importance for correlated outcomes

dean · April 11, 2023, 1:08am

I am confused about the derivation of importance scores for an xgboost model. My understanding is that xgboost (and in fact, any gradient boosting model) examines all possible features in the data before deciding on an optimal split (I am aware that one can modify this behavior by introducing some randomness to avoid overfitting, such as the by using the colsample_bytree option, but I’m ignoring this for now).

Thus, for two correlated features where one is more strongly associated with an outcome of interest, my expectation was that the one that is more strongly associated with an outcome be selected first. Or in other words, that once this feature is selected, no additional useful information should be found in the other, correlated feature. This however does not seem to be always the case.

To put this concretely, I simulated the data below, where x1 and x2 are correlated (r=0.8), and where Y (the outcome) depends only on x1. A conventional GLM with all the features included correctly identifies x1 as the culprit factor and correctly yields an OR of ~1 for x2. However, examination of the importance scores using gain and SHAP values from a (naively) trained xgboost model on the same data indicates that both x1 and x2 are important. Why is that? Presumably, x1 will be used as the primary split (i.e. the stomp) since it has the strongest association with the outcome. Once this split happens (even if over multiple trees due to a low learning rate), x2 should have no additional information to contribute to the classification process. What am I getting wrong?

pacman::p_load(dplyr, xgboost,data.table,Matrix,MASS, broom, SHAPforxgboost)

expit<-function(x){
  exp(x)/(1+exp(x))
}

r=0.8
d=mvrnorm(n=2000, mu=c(0,0),Sigma=matrix(c(1,r,r,1),nrow=2),empirical=T)
data=data.table(d,
                replicate(10,rbinom(n=2000,size=1,prob=runif(1,min=0.01,max=0.6))))

colnames(data)[1:2]<-c("x1","x2")
cor(data$x1,data$x2)
data[,Y:=rbinom(n=2000,size=1,prob=expit(-4+2*x1+V2+V4+V6+V8+V3))]

model<-glm(Y~., data=data, family="binomial")
mod<-tidy(model)
mod$or<-round(exp(mod$estimate),2)

sparse_matrix<-sparse.model.matrix(Y~.-1,data=data)
dtrain_xgb<-xgb.DMatrix(data=sparse_matrix,label=data$Y)

xgb<-xgboost(tree_method="hist", 
             booster="gbtree",
             data=dtrain_xgb,
             nrounds=2000,
             fold=5,
             print_every_n=10,
             objective="binary:logistic",
             eval_metric="logloss",
             maximize = F)
shap<-shap.values(xgb,dtrain_xgb)
mean_shap<-data.frame(shap$mean_shap_score)
gain<-xgb.importance(model=xgb)

head(mod,14) #regression
head(mean_shap) #shap values
head(gain) #gain

bmreiniger · April 12, 2023, 3:04pm

(also asked on stats.SE: https://stats.stackexchange.com/q/612542/232706)

dean · April 27, 2023, 4:58pm

Surfacing this again. bmreiniger suggested this might be due to overfitting? Although if so I would expect more random behavior - the observation is very consistent when using different seeds.

dean · May 8, 2023, 4:00pm

trying to resurface this