Get the training data points which lie in each leaf of a base tree?


#1

Hi, I was trying to make some estimation of the uncertainty associated with the learnt regression BST model. I would like to know the “spread” of training label values at each leaf of each base learner (instead of just the mean/median labels at the leaves)

Is there a direct way to do that? Or do I have to traverse the generated base trees after training and group data points based on which leaf they end up with, and manually deriving the spread using a script?

Also, I am certainly aware of the quantile loss which some people have used to produce some sort of predictive interval in boosting. However, I would like to view the uncertainty centred around the leaves at each tree produced.


#2

Hi. I am interested in this as well, from the point of view of computing leave-out-one forecasts for (almost) cross-validation (I say almost because one is not removing the effect each point had on tree structure and location of splits, but conditional on architecture this should enable a good loocv approximation). Or perhaps this already exists?

In fact, in the classification case, one could get a pretty good ‘worst-case’ estimate of the above without the data, but would need at least the sample size used to build the model and probably the node counts or proportions.


#3

The per-example label information is not saved with the tree.

For tree-based uncertainty estimation you might want to look into Quantile Regression Forests and Mondrian Forests, part of the skcikit-garden package. Both of these maintain every label at the leafs (limiting their scalability compared to XGBoost). Or conformal prediction.