Generating prediction intervals through sampling


I’m working on a methodology for creating prediction intervals using XGBoost. My idea is to generate the score for each leaf by drawing a random sample from the observations in that leaf, rather than predicting the mean of the leaf. A potentially more robust solution would be to draw multiple observations from each leaf with replacement and take the mean (i.e. bootstrap). Doing this multiple times for a single test observation would yield multiple predictions for that observation, which could be used to compute prediction intervals.

Has anyone experimented with this before? Does this methodology seem reasonable?

Hello @shkramer the best way to get prediction intervals currently in XGBoost is to use the quantile regression objective.

In general for tree ensembles and random forests, getting prediction intervals/uncertainty out of decision trees is a well researched subject.

I’m going to copy paste here the related related work sub-section from our JMLR paper where we do do this for streaming random forests, see the full paper for references:

Decision and regression trees can be modified to output not only a single label, i.e., class or regression value, but also some estimate of the confidence that the model has in its prediction. Many alternative approaches have been proposed in the batch setting: Mentch and Hooker (2016) use formal statistical inference to produce prediction intervals, Wager et al. (2014) make use of the jackknife to estimate standard errors for random forests, and Chipman et al. (2010) propose Bayesian additive regression trees that estimate a distribution over decision trees which can be used to produce prediction intervals. Johansson et al. (2014b) proposea model for conformal prediction that uses a batch trained decision tree that updates itspredictions in an online manner. The structure of the tree however remains static, i.e. no online training is performed. Lakshminarayanan et al. (2016) use Mondrian processes (Roy and Teh, 2009) to create tree-like online prediction models which have the property of being “extensible”, i.e., the distribution of trees in the online version is the same as the batch one. This property can be leveraged to create online random forests that can be updated efficiently and produce models comparable to their batch version. Mondrian forests are able to predict the full conditional distribution for the predicted variable, and therefore can produce prediction intervals. However, the prediction model is parametric: it models the variable as a hierarchyof Gaussians, one for each node in the tree. The benefit of having access to the complete predictive posterior comes at cost as well: the memory and computational cost for each tree grows for each data point processed, making them unsuitable for unbounded streaming data.

I should also mention Quantile Regression Forests on which we base one of our methods.