Histogram Computation of the bins

Hello!

I am currently working my way through the XGBoost algorithm and I had a question about the way in which the histogram is computed.

Each time we arrive at a new node we need to re-compute the histogram. However, I am not sure about one thing. Are the bins of the histogram computed only once at the top of the algorithm and then used every time that we create a histogram? Or is it the case that we re-compute the bins for the new histogram that we create? If the latter I don’t think I fully understand their advantage (unless we use some approximate method to compute the bins other than sorting the data, is this what’s happening? ).

Thank you!

1 Like

From my understanding, if you are using the hist tree method, it will need to compute the histogram at every node, however it won’t always have to compute it from scratch.
For example, given a parent node, with a histogram computed, and then two children nodes for which two new histograms need to be calculated. For the child node with fewer records, the histogram will be computed from scratch, this will be an O(n) operation, where n is the number of records in that node. For the second child node, the histogram does not need to be computed from scratch, instead the histogram of the other child node can be subtracted from the histogram of the parent node. This means only one of the child histograms need to be computed using the raw records, leading to large speed improvements. Additionally, one of the other benefits of using histograms, is that when deciding on the optimal split for a node using a given feature, only the histograms bins will be used as candidate splits, rather than the actual raw values such as using the exact method, which could have a much higher cardinality.

Also, for the histogram example, this would apply to the histogram of each feature.

This does not address the question, which was about recomputing bin boundaries, not the counts.

Although I myself don’t know the ins and outs of XGBoost, I believe the answer is given in [New Feature] Fast Histogram Optimized Grower, 8x to 10x Speedup #1950 :

How does the new method differ from tree_method=approx ? The existing approximate splitting method in xgboost also buckets continuous features into discrete bins to speed up training. The approx method generates a new set of bins for each iteration, whereas the hist method re-uses the bins over multiple iterations.

The latter approach enables additional optimizations that are not possible for the former, as follows:

  • Cache of the bins : replace continuous feature values in the training data with bin IDs and cache the data structure
  • Histogram subtraction trick : to compute the histogram for a node, we simply take the difference between the histograms for its parent and sibling.