Save PySpark model, load it on other language

iamdvd · August 16, 2022, 2:54pm

Hi,

I was using the xgboost 2.0 version to train a model with PySpark, and was wondering if I could save this model and use it on Python. I’ve tried using the ‘save’ function and it is not working for me. Checking the documentation of this ‘save’ function it doesn’t mention about the universal format.
Spark save function
Python save_model function

Thanks!

jiamingy · August 21, 2022, 8:03am

Yes, but pyspark also saves some other meta data so users need to manually retrieve the model in the saved directory created by pyspark. I opened an issue https://github.com/dmlc/xgboost/issues/8186 , will try to write some documents later on.

iamdvd · August 31, 2022, 11:58am

Thanks @jiamingy , I was taking a deeper look and have made some upgrades but still not working.

So pyspark has saved the model in a folder(apart from the metadata) with 2 parquet files. So I’ve done:

Do a spark.read.parquet on pyspark to the folder where these two parquet files are located.
Write this object as json with object.write.json(path)
Try to load in python this json with xgbosot load_model.

Just in case it’s relevant, I am using SparkXGBClassifier on Pyspark and XGBClassifier in Python, xgboost versions used are 2.0.0 in both pyspark and python.

I get the error: xgboost/json.h:79: Invalid cast, from Null to Object

iamdvd · September 1, 2022, 3:26pm

Just to update here, I finally managed to load the model on Python. These are the steps followed:

Save the model on Pyspark. This is a folder with parquet files.
Read this parquets on pyspark, and write it as a json.
Load this json in python and write it to a file with a good format.
Load the json to xgboost model with load_model()

jiamingy · December 24, 2022, 8:43pm

I think the easier way is just to call the get_booster method from pyspark estimators.