How does the "hist" tree_method computes the histogram bins?

ramy · February 29, 2024, 7:46am

Hi, where can I find a detailed explanation on the way which the bins are computed when we use “tree_method=hist” in XGBoost?
I read the algorithm of weighted quantile sketching from the XGBoost paper, but when I tried to train an XGBoost and check the split points I got some unexpected results.
I run the following code to train an XGBoost on the adult dataset while restricting the number of bins to 2 (of course using only 2 bins is a bad practice but I did this to try to understand the binning algorithm):
clf = XGBClassifier(max_depth=6, n_estimators=100, tree_method=‘hist’, max_bin=2, nthread=1)
clf.fit(X_train, y_train)
feature_vals, counts = np.unique(X_train[‘capital-gain’], return_counts=True)
print(‘feature_vals:’, feature_vals)
print('counts: ', counts)
print(‘chosen splits:’)
booster = clf.get_booster()
print(booster.get_split_value_histogram(‘capital-gain’)).

The results I get are as follows:
feature_vals: [ 0 114 594 914 991 1055 1086 1151 1173 1409 1424 1455
1471 1506 1639 1797 1831 1848 2009 2036 2050 2062 2105 2174
2176 2202 2228 2290 2329 2346 2354 2387 2407 2414 2463 2538
2580 2597 2635 2653 2829 2885 2907 2936 2961 2964 2977 2993
3103 3137 3273 3325 3411 3418 3432 3456 3464 3471 3674 3781
3818 3887 3908 3942 4064 4101 4386 4416 4508 4650 4687 4787
4865 4931 4934 5013 5178 5455 5556 5721 6097 6360 6418 6497
6514 6767 6849 7298 7430 7443 7688 7896 8614 9386 9562 10520
10566 10605 11678 13550 14084 14344 15020 15024 15831 18481 20051 22040
25124 25236 27828 34095 41310 99999]
counts: [20906 1 25 6 4 20 2 6 2 4 2 1
6 12 1 2 6 4 2 3 4 2 7 34
16 12 4 3 5 4 8 1 15 6 8 1
7 14 3 4 21 18 8 1 3 6 6 1
64 23 4 38 17 5 3 2 17 6 8 8
7 6 23 9 30 13 52 5 7 24 2 15
9 1 4 51 68 7 3 2 1 2 6 9
4 4 20 167 7 4 194 2 42 14 3 26
6 9 1 22 25 18 3 255 4 1 31 1
3 11 20 3 1 103]
chosen splits:
SplitValue Count
0 7298.5 195.0

It is clear that most of the feature values are “0” (about 20K from total of about 30K values are 0). Because the hessians before the start of the training are 0.25 for all samples, then I would expect all of the samples to have the same weights. This means that the first split should be very close to the first feature value (which is 0), but the actual split value I get is 7298.5, why does this happen?

Thanks in advance!