I was under the impression that if I have n
trees, then these trees are labelled 0...n-1
, and that the iteration_range
argument of predict
takes a half-open interval, meaning that tree i
is iteration_range=(i, i+1)
, etc. But when I try to verify this I get very strange results:
import numpy as np
import pandas as pd
import xgboost as xgb
assert xgb.__version__ == '1.7.4'
np.random.seed(0)
features = ['x1', 'x2']
df = pd.DataFrame(data=np.random.normal(size=(10_000, 2)), columns=features)
df['y'] = np.where(df['x1'] > df['x2'], 1, -1) + np.random.normal(size=(10_000,)) * 0.2
xgb_ds = xgb.DMatrix(df[features], label=df['y'])
xgb_booster = xgb.train({}, dtrain=xgb_ds, num_boost_round=5)
assert xgb_booster.num_boosted_rounds() == 5
mse = lambda l, r: np.mean((l -r)**2)
pred2 = 0
for i in range(xgb_booster.num_boosted_rounds()):
pred1 = xgb_booster.predict(xgb_ds, iteration_range=(0, i+1)) # [0, i+1)
pred2 += xgb_booster.predict(xgb_ds, iteration_range=(i, i+1)) # [i, i+1)
print(mse(pred1, pred2))
outputs:
0.0
0.25
1.0
2.25
4.0
But I would expect this to be 0, since both pred
and pred2
should contain the predictions of trees 0...i
inclusive.
Clearly I am misunderstanding this parameter but I’ve read the documentation a few times now and I can’t find my mistake. What am I doing wrong?
iteration_range (Tuple [ int , int ] ) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.