Hi!
I am seeing splits being returned using the get_split_value_histogram
XGboost model method in the python API, that are not actually splits in the model. I was just curious if this was a bug, or something I wasn’t understanding.
For example…
import xgboost as xgb
import seaborn as sns
titanic = sns.load_dataset("titanic")
Xt = titanic.select_dtypes("number").drop(columns="survived")
y = titanic["survived"]
mod = xgb.train(
params=dict(objective="binary:logitraw", eval_metric="auc", seed=0),
dtrain=xgb.DMatrix(Xt, label=y),
)
mod.get_split_value_histogram("pclass")
# SplitValue Count
# 0 2.5 4.0
# 1 3.0 8.0
However, if we actually look at the splits in the model, there are no splits at 2.5.
mod_dmp = "\n".join(mod.get_dump()).replace("\t", "").split("\n")
[l for l in mod_dmp if "pclass" in l]
# ['0:[pclass<3] yes=1,no=2,missing=1',
# '18:[pclass<2] yes=27,no=28,missing=27',
# '0:[pclass<3] yes=1,no=2,missing=1',
# '20:[pclass<2] yes=35,no=36,missing=35',
# '0:[pclass<3] yes=1,no=2,missing=1',
# '0:[pclass<3] yes=1,no=2,missing=1',
# '8:[pclass<3] yes=17,no=18,missing=17',
# '32:[pclass<3] yes=51,no=52,missing=51',
# '33:[pclass<2] yes=53,no=54,missing=53',
# '2:[pclass<3] yes=5,no=6,missing=5',
# '5:[pclass<3] yes=9,no=10,missing=9',
# '9:[pclass<2] yes=13,no=14,missing=13']
Curious if this is just a bug? It seems it is how the np.histogram
function formats bins maybe (this is the function used inside of get_split_value_histogram
)?