XGBoost trees understanding

EZCocos · September 9, 2020, 10:46am

Hi All,
I have I believe a simple question which I’m not sure of the answer.

While running XGBoost with say 100 trees, if we want to manually reconstruct the process for one observation, how should we use the trees to find the final leaf for this specific observation?
Is it the last tree (100) or a complex gradient calculation of all trees?

Many thanks

hcho3 · September 9, 2020, 5:38pm

No, you should obtain the final leaf for every trees in the ensemble. The prediction for a particular observation is obtained by summing the output from all the trees.

EZCocos · September 9, 2020, 11:04pm

Many thanks hcho3!

Unfortunately, that’s what I was concerned about.

I’m trying to classify each observation in a specific and unique leaf or group. Is there a way I can achieve this with gradient boosting?
Thanks,

hcho3 · September 9, 2020, 11:15pm

You should call predict() function with parameter pred_leaf=True. That will give you the set of leaf IDs that each observation is associated with.

EZCocos · September 9, 2020, 11:28pm

Excellent! That’s great… Many thanks,
EZ

EZCocos · September 10, 2020, 6:35am

Hi hcho3,
I cannot see parameter pred_leaf=True in XGBoost.predict(???) ?

I’m using python package XGBoost.XGBRegressor !!!

A bit confused. Any suggestion?
Many thanks

hcho3 · September 10, 2020, 7:43am

Try clf.get_booster().predict(pred_leaf=True). The get_booster() obtains the underlying Booster object inside the XGBRegressor model.

EZCocos · September 10, 2020, 3:33pm

Hi hcho3,
clf.get_booster().predict(pred_leaf=True) did not work

However, after many hours using XGBRegressor.apply(X) gave what I think is the leaf id…

jm

EZCocos · September 11, 2020, 10:01am

Hi
Actually, I have the final leaves of all 20 trees’ rounds.

What does it means?

Do I need to sum up the average within all of final 20 leaves found for a specific observation
or
the 20th tree gives the final leaf and value?

Cheers,
jm

hcho3 · September 11, 2020, 5:11pm

You should sum the leaf values (outputs) associated with the 20 final leaves.