`.predict` `iteration_range` unexpected behaviour

adfea9c0 · July 28, 2023, 6:33pm

I was under the impression that if I have n trees, then these trees are labelled 0...n-1, and that the iteration_range argument of predict takes a half-open interval, meaning that tree i is iteration_range=(i, i+1), etc. But when I try to verify this I get very strange results:

import numpy as np
import pandas as pd
import xgboost as xgb
assert xgb.__version__ == '1.7.4'

np.random.seed(0)

features = ['x1', 'x2']
df = pd.DataFrame(data=np.random.normal(size=(10_000, 2)), columns=features)
df['y'] = np.where(df['x1'] > df['x2'], 1, -1) + np.random.normal(size=(10_000,)) * 0.2

xgb_ds = xgb.DMatrix(df[features], label=df['y'])
xgb_booster = xgb.train({}, dtrain=xgb_ds, num_boost_round=5)

assert xgb_booster.num_boosted_rounds() == 5

mse = lambda l, r: np.mean((l -r)**2)

pred2 = 0
for i in range(xgb_booster.num_boosted_rounds()):
    pred1 = xgb_booster.predict(xgb_ds, iteration_range=(0, i+1))   # [0, i+1)
    pred2 += xgb_booster.predict(xgb_ds, iteration_range=(i, i+1))  # [i, i+1)

    print(mse(pred1, pred2))

outputs:

0.0
0.25
1.0
2.25
4.0

But I would expect this to be 0, since both pred and pred2 should contain the predictions of trees 0...i inclusive.

Clearly I am misunderstanding this parameter but I’ve read the documentation a few times now and I can’t find my mistake. What am I doing wrong?

iteration_range (Tuple [ int , int ] ) –
Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.

jiamingy · July 31, 2023, 2:48pm

Hi, there’s a base_score parameter in xgboost, which represents the global bias, or intercept. The prediction function automatically adds it into the final prediction.

adfea9c0 · August 1, 2023, 10:04pm

Can i find this value somewhere? It’s not in dump_model.