Get_split_value_histogram Method Doesn't Return Actual Splits

jinlow · March 20, 2021, 7:28pm

Hi!
I am seeing splits being returned using the get_split_value_histogram XGboost model method in the python API, that are not actually splits in the model. I was just curious if this was a bug, or something I wasn’t understanding.
For example…

import xgboost as xgb
import seaborn as sns

titanic = sns.load_dataset("titanic")
Xt = titanic.select_dtypes("number").drop(columns="survived")
y = titanic["survived"]
mod = xgb.train(
    params=dict(objective="binary:logitraw", eval_metric="auc", seed=0),
    dtrain=xgb.DMatrix(Xt, label=y),
)

mod.get_split_value_histogram("pclass")

#    SplitValue  Count
# 0         2.5    4.0
# 1         3.0    8.0

However, if we actually look at the splits in the model, there are no splits at 2.5.

mod_dmp = "\n".join(mod.get_dump()).replace("\t", "").split("\n")
[l for l in mod_dmp if "pclass" in l]

# ['0:[pclass<3] yes=1,no=2,missing=1',
#  '18:[pclass<2] yes=27,no=28,missing=27',
#  '0:[pclass<3] yes=1,no=2,missing=1',
#  '20:[pclass<2] yes=35,no=36,missing=35',
#  '0:[pclass<3] yes=1,no=2,missing=1',
#  '0:[pclass<3] yes=1,no=2,missing=1',
#  '8:[pclass<3] yes=17,no=18,missing=17',
#  '32:[pclass<3] yes=51,no=52,missing=51',
#  '33:[pclass<2] yes=53,no=54,missing=53',
#  '2:[pclass<3] yes=5,no=6,missing=5',
#  '5:[pclass<3] yes=9,no=10,missing=9',
#  '9:[pclass<2] yes=13,no=14,missing=13']

Curious if this is just a bug? It seems it is how the np.histogram function formats bins maybe (this is the function used inside of get_split_value_histogram)?