Histogram Computation of the bins

Hello!

I am currently working my way through the XGBoost algorithm and I had a question about the way in which the histogram is computed.

Each time we arrive at a new node we need to re-compute the histogram. However, I am not sure about one thing. Are the bins of the histogram computed only once at the top of the algorithm and then used every time that we create a histogram? Or is it the case that we re-compute the bins for the new histogram that we create? If the latter I don’t think I fully understand their advantage (unless we use some approximate method to compute the bins other than sorting the data, is this what’s happening? ).

Thank you!

From my understanding, if you are using the hist tree method, it will need to compute the histogram at every node, however it won’t always have to compute it from scratch.
For example, given a parent node, with a histogram computed, and then two children nodes for which two new histograms need to be calculated. For the child node with fewer records, the histogram will be computed from scratch, this will be an O(n) operation, where n is the number of records in that node. For the second child node, the histogram does not need to be computed from scratch, instead the histogram of the other child node can be subtracted from the histogram of the parent node. This means only one of the child histograms need to be computed using the raw records, leading to large speed improvements. Additionally, one of the other benefits of using histograms, is that when deciding on the optimal split for a node using a given feature, only the histograms bins will be used as candidate splits, rather than the actual raw values such as using the exact method, which could have a much higher cardinality.

Also, for the histogram example, this would apply to the histogram of each feature.