Hi all
TL;DR I am noticing that the size in bytes of a pickled xgboost.sklearn.XGBRegressor
object on disk changes after I reload and then write it back again. This is causing issues with version control (using dvc) of machine learning models I create. My question is What is the right way to ensure this doesn’t happen using joblib/pickle?
For a quick glance, here is a snip of the notebook (see highlighted part for file sizes in bytes):
Here is code for reproduction:
from xgboost.sklearn import XGBRegressor
import pandas as pd
import numpy as np
import joblib
rs = np.random.RandomState(seed=42)
n=10**4
X = pd.DataFrame(rs.randint(0,100,size = (n,3)),columns=['var1','var2','var3'])
y = pd.DataFrame(rs.randint(0,100,size = n))
xgb = XGBRegressor(n_jobs = 4, random_state = 42).fit(X,y)
_ = joblib.dump(xgb,"./test_xgb.joblib")
_ = joblib.dump(joblib.load("./test_xgb.joblib"),"./test_xgb_r.joblib")
_ = joblib.dump(joblib.load("./test_xgb_r.joblib"),"./test_xgb_r_r.joblib")