Pickle file size inconsistency

Hi all
TL;DR I am noticing that the size in bytes of a pickled xgboost.sklearn.XGBRegressor object on disk changes after I reload and then write it back again. This is causing issues with version control (using dvc) of machine learning models I create. My question is What is the right way to ensure this doesn’t happen using joblib/pickle?

For a quick glance, here is a snip of the notebook (see highlighted part for file sizes in bytes):

Here is code for reproduction:

from xgboost.sklearn import XGBRegressor
import pandas as pd
import numpy as np
import joblib
rs = np.random.RandomState(seed=42)
n=10**4
X = pd.DataFrame(rs.randint(0,100,size = (n,3)),columns=['var1','var2','var3'])
y = pd.DataFrame(rs.randint(0,100,size = n))
xgb = XGBRegressor(n_jobs = 4, random_state = 42).fit(X,y)
_ = joblib.dump(xgb,"./test_xgb.joblib")
_ = joblib.dump(joblib.load("./test_xgb.joblib"),"./test_xgb_r.joblib")
_ = joblib.dump(joblib.load("./test_xgb_r.joblib"),"./test_xgb_r_r.joblib")

You may have better success with using save_model() and load_model() methods.

@hcho3 thanks for the suggestion. Indeed, the save_model method is consistent from file size point of view. However, I am actually trying to save a sklearn.pipeline.Pipeline object with XGBRegressor inside it (looks like I missed that detail while trying to construct MWE). Thus I am not sure how I can work around pickling.

By the way, if the above is an actual issue with pickling of the class I would be happy to help fix it. Would github issues be the right place to start in that direction?

Yes, you should go ahead and file a new issue on GitHub.