I am reading the 2016 XGBoost paper, Section 4.1:

Time Complexity Analysis

Let`d`

be the maximum depth of the tree and`K`

be total number of trees. For the exact greedy algorithm, the time complexity of original spase aware algorithm is`O(K * d * nnz(x) * log n)`

. Here we use`nnz(x)`

to denote the number of non-missing entries in the training data.

On the other hand, tree boosting on the block structure only cost`O(K * d * nnz(x) + nnz(x) * log n)`

. Here`O(nnz(x) * log n)`

is the one time preprocessing cost that can be amortized. This analysis shows that the block structure helps to save an additional`log n`

factor, which is significant when`n`

is large. For the approximate algorithm, the time complexity of original algorithm withisbinary search`O(K * d * nnz(x) * log q)`

. Here`q`

is the number of proposal candidates in the dataset. While`q`

is usually between 32 and 100, the log factor still introduces overhead. Using the block structure, we can reduce the time to`O(K * d * nnz(x) + nnz(x) log B)`

, where`B`

is the maximum number of rows in each block. Again we can save the additional`log q`

factor in computation.

Could someone please explain where is the aforementioned binary search used and why is it needed?

I googled here and there, and thought a little bit, but I could not figure this out. I found this great answer, in which @hcho3 explains how histogram-based split evaluation and finding works in the parallel setting. But there is no mention of binary search there either.

I think I understand how we obtain `O(K * d * nnz(x) + nnz(x) * log n)`

: sorting a sparse dataset takes `O(nnz(x) * log n)`

, evaluating and finding splits at each level of a tree is `O(nnz(x))`

, hence `O(K * d * nnz(x))`

for `d`

levels and `K`

trees.

I also understand how `O(K * d * nnz(x) + nnz(x) log B)`

complexity is obtain in the “approximate algorithm” (i.e., when we use histogram-based aggregation). I don’t think it is achievable for the exact algorithm. Please correct me if I am wrong.

What I don’t understand is where `O(K * d * nnz(x) * log n)`

and `O(K * d * nnz(x) * log q)`

come from. The paper mentions binary search. We use binary search on ordered arrays. The only ordering we have is that by feature values. But why would we be looking for an element with a specific feature value?

I would highly appreciate a detailed explanation or even a short hint!