Retrieve feature_names from pickled model

TalhaAsmal · July 13, 2021, 7:17am

I have a large number of models trained with previous versions of xgboost (mainly 1.2.x) that are saved as pickled objects. When I load them with 1.4.2, the model_features list is completely empty. Reverting to 1.2 brings that list back, so I know it’s still available in the pickled model.

I understand JSON is the standard going forward, so I’d like to know, is there a way to load the pickled model without losing feature_names and then re-save it as JSON.

I’ve tried all of the following, to no avail

Load with xgboost 1.3 (feature_names is populated) and save using bst.save_model() to binary format (feature_names are lost when I re-load using bst.load_model())
Load pickled model and save to JSON using 1.4 (feature_names are lost when I load the pickled model)
Load pickled model (feature_names is populated) and save to JSON using 1.2 and 1.3 (feature_names are lost when I re-load using bst.load_model())

Is there any easy way to achieve what I’m trying, other than manually keeping track of the feature_names for each model and repopulating it before saving it in JSON format with the latest version of xgboost?

hcho3 · July 13, 2021, 7:36am

I understand your frustration of having to migrate a large number of models. Unfortunately, there isn’t an easy way to migrate the feature name information.

other than manually keeping track of the feature_names for each model and repopulating it before saving it in JSON format with the latest version of xgboost?

This is essentially what you’ll have to do. For this, you’d need to have two Python virtual environments, to use XGBoost 1.3 and 1.4 respectively.

hcho3 · July 13, 2021, 7:53am

In general, backward compatibility is difficult when it comes to Python pickles. That is, it is hard to guarantee that a pickle produced with a previous version of XGBoost can be read into a new version of XGBoost.

TalhaAsmal · July 13, 2021, 8:04am

Thanks for the quick response. I fully understand pickles are unreliable, which is why I tried to save them as the default xgboost binary format as well, which also failed.

Lets assume I started with XGBoost 1.2, how should I save it to preserve all attributes including feature_names and best_ntree_limit, for use with future versions of XGBoost? The reason I ask is, I have access to XGBoost 1.2, so I can easily load up the pickled models in that version, save them in an approved format, and then use them in the latest version of XGBoost.

hcho3 · July 13, 2021, 8:25am

EDIT. best_ntree_limit is already saved as part of the model file.

You can’t. More precisely, neither binary format nor JSON format will save feature_names.

We are trying move away from storing important information in Python attributes. The only guarantee we provide is that, if a piece of information is already available in a saved JSON file, it will be preserved when the JSON file is read in a future version of XGBoost. So we are in a difficult situation, since XGBoost 1.2 does not yet feature_names in the JSON file.

hcho3 · July 13, 2021, 8:29am

You have two alternatives:

Manually export important attributes (like feature_names) as a separate JSON file, and then re-populate them after migrating the model to the latest version.
Keep a separate environment with old version of XGBoost.

TalhaAsmal · July 13, 2021, 10:53am

Awesome, thanks for the advice. I think I’ve already resigned myself to needing to keep a log of the feature_names somewhere, I was just hoping there’s an easier way.

In that case, where/how should we store important information?

hcho3 · July 13, 2021, 6:00pm

XGBoost developers (including me) are moving important information like feature_names into the saved JSON file. This way, it can be accessed portably in later versions of XGBoost.