Different inference time for models that were pickled vs saved

Hi! I’ve recently got surprised by an XGBoost inference behaviour, and any feedback or insights would be useful!

I had the same XGBClassifier model both dumped (pickled) and saved (the more recent JSON format), and loading these two back, the doing inference takes wildly different amount of time for them (5-10x time difference with models I’ve tried briefly). Any reason why this might happen?

I did a quick dummy test with the Iris dataset, as below. I’ve done a quick model training and save, the code is in the next collapsed section, resulting in a pickle and a JSON of the same model.

Model training and export code

Here’s the code that I used to train and save the model:

import pickle
import time

import xgboost as xgb
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris["data"], iris["target"]

xgb_model = xgb.XGBClassifier(n_jobs=1).fit(X, y)

#  Model export: dump
with open("xgb_iris.pkl", "wb") as f:
    pickle.dump(xgb_model, f)
# Model export: save
xgb_model.save_model("xgb_iris.json")

Then I had a quick benchmark to run inference a number of times on the whole dataset, and print out the average inference time per item. The code is below in the collapsed section to do the pickled/dumped model first:

Pickled model loading and benchmarking code
import pickle
import time

from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris["data"], iris["target"]

#  Model load from dump
with open("xgb_iris.pkl", "rb") as f:
    xgb_dumped = pickle.load(f)

repeats = 5
start = time.perf_counter()
for _ in range(repeats):
    for x_inference in X:
        xgb_dumped.predict_proba([x_inference])
stop = time.perf_counter()
print(f"Dumped model: Time per single inference: {(stop-start)*1000/len(X)/repeats:.3f} ms")

The result is pretty fast:

Dumped model: Time per single inference: 5.442 ms

Then I repeated the same thing for the JSON saved model:

JSON saved model loading and benchmarking cod
import time

import xgboost as xgb
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris["data"], iris["target"]

#  Model load from save
xgb_saved = xgb.XGBClassifier()
xgb_saved.load_model("xgb_iris.json")

repeats = 5
start = time.perf_counter()
for _ in range(repeats):
    for x_inference in X:
        xgb_saved.predict_proba([x_inference])
stop = time.perf_counter()
print(f"Saved model: Time per single inference: {(stop-start)*1000/len(X)/repeats:.3f} ms")
Saved model: Time per single inference: 50.858 ms

This results a ~9x slowdown, while the model does the same thing (and checked elsewhere that the results seem the same) :thinking:

Finally, if in the same code I switch between the two different objects to do inference, the dumped model slows down as well…

Combined/interleaved model benchmarking code
import pickle
import time

import xgboost as xgb
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris["data"], iris["target"]

# Inference

#  Model load from dump
with open("xgb_iris.pkl", "rb") as f:
    xgb_dumped = pickle.load(f)
#  Model load from save
xgb_saved = xgb.XGBClassifier()
xgb_saved.load_model("xgb_iris.json")


def run_timing(model, model_name, X, repeats=1):
    """Run timing and display results"""
    start = time.perf_counter()
    for _ in range(repeats):
        for x_inference in X:
            model.predict_proba([x_inference])
    stop = time.perf_counter()
    print(f"{model_name} model: Time per single inference: {(stop-start)*1000/len(X)/repeats:>7.3f} ms")

for _ in range(3):
    run_timing(xgb_dumped, "Dumped", X, repeats=5)
    run_timing(xgb_saved, "Saved ", X, repeats=5)

There the first dumped model is fast, the repeated runs are similar speed as the loaded model.

Dumped model: Time per single inference:   5.156 ms
Saved  model: Time per single inference:  51.047 ms
Dumped model: Time per single inference:  47.337 ms
Saved  model: Time per single inference:  50.523 ms
Dumped model: Time per single inference:  50.274 ms
Saved  model: Time per single inference:  48.661 ms

Anything that I’m missing? I expected both methods the same speed. And also wondering if Pickle can be fast, can the JSON loaded version be sped up?

Cheers!

Given the size of the data, the actual inference time (i.e. CPU time spent on traversing the trees) is likely to be small. The remainder is probably overhead. Maybe garbage collection in Python is kicking in? Also, there is some overhead for converting the input data into an internal matrix representation.

To reduce inference time to bare minimum, you can use the C API of XGBoost or use an optimized inference server, such as Triton.

Adding to hcho3’s reply, please make sure the number of threads is correctly set. Also, you can try inplace_predict.

You can use .ubj to save as a binary json file. Reduces the size of the model by half and has inference times similar to the pickled version
https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html