Feature Importance Based on optimal number of trees?

pandas_pandas · February 11, 2019, 12:36am

Let’s say we fit a model that includes early stopping with a validation set and we find the best_ntree_limit is 1,000 but I set early_stopping_rounds to be 500. Therefore our model object has 1,500 trees encoded.

We would like to get feature importances back from this model but only for the first 1,000 trees - the optimal model - and not the overfit model with 1,500 trees. Is that possible in either Python or R API without having to calculate ourselves?

Thanks.

Herm41 · February 16, 2019, 10:51pm

XGBoost only safe the last Model, not the best one (best ntree). So we have to run it again in exactly 1000 tree.

yaozhang2016 · March 4, 2019, 7:27pm

In R, you can do it through xgb.model.dt.tree by setting n_tree_first, and then do aggregation yourself.
You can check source code of xgb.importance, xgb.dump, and xgb.model.dt.tree.