Save and Load model in XGBoost4j with Databricks DBFS

Daniel8hen · May 29, 2020, 8:52am

Hi,

I am using Databricks (Spark 2.4.4), and XGBoost4J - 0.9.

I am able to save my model into an S3 bucket (using the dbutils.fs.cp after saved it in the local file system), however I can’t load it.

Code and errors are below:

val trainedModel = pipeline.fit(trainUpdated) // train model on pipeline (vectorAssembler + xgbregressor)

create directory to save the pipeline (again, model + vecotr) -
dbutils.fs.mkdirs("/tmp/test-sage")
val trainedModelPath = "/dbfs/tmp/test-sage/m"

Save model in a specific way with -
trainedModel.stages(1).asInstanceOf[ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel].nativeBooster.saveModel(trainedModelPath)

Then, copy from dbfs to S3
dbutils.fs.cp("/tmp/test-sage/m", "/mnt/S3/XXXX-data-science/sandbox/save-test-xgboost/model")

I see a file in S3 called model (see screenshot attached) -

However, when I tried to load using -
val xgb = XGBoostRegressor.load("/mnt/S3/XXXX-data-science/sandbox/save-test-xgboost/model")
I get the error -
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: /mnt/S3/XXXX-data-science/sandbox/save-test-xgboost/model/metadata

i.e. the saved model is 1 file ( < 1MB) and no metadata file is saved alongside it.

Am I doing something wrong?

From a quick search I see this is a real pitty with XGBoost4J and Spark, so if this will be solved, I will be more than welcome to write a detailed documentation and create a relevant PR for it.

Thanks,
Daniel

@hcho3 FYI

Daniel8hen · May 31, 2020, 10:22am

@hcho3 can you assist as always? Thanks in advance

hcho3 · May 31, 2020, 7:09pm

I have no idea. Can you raise the issue with Databricks?

Daniel8hen · May 31, 2020, 8:03pm

will try them as well. Thank you.

Daniel8hen · June 1, 2020, 4:53am

@hcho3 is it even possible to save an XGBoost4J (Spark) model as a pickle? If so, can you elaborate about the best practice how to do it? Thanks

Daniel8hen · June 1, 2020, 10:15am

@hcho3, I think that is relates to XGBoost4J -

When trying to save a pipeline (with XGBoost4J model), I get an error java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long

Would love to hear from you about it. Thank you once again

hcho3 · June 1, 2020, 1:30pm

Maybe try using latest XGBoost snapshot? Again I have no clue what’s going on.

Daniel8hen · June 1, 2020, 1:31pm

Hey @hcho3, it is the latest.

hcho3 · June 1, 2020, 1:34pm

XGBoost 0.90 is not the latest. We have 1.0 in Maven Central and 1.1 in our private Maven repo.

Daniel8hen · June 1, 2020, 1:35pm

Oh, got it. Will try 1.0 (a more public) and will update.

hcho3 · June 1, 2020, 1:37pm

How to access snapshot version: https://xgboost.readthedocs.io/en/latest/jvm/index.html#access-snapshot-version.

Daniel8hen · June 1, 2020, 1:42pm

I am using Databricks so will get it from Maven directly. thanks

Daniel8hen · June 16, 2020, 12:10pm

Solved it, was related to dbfs.

jmpanfil · September 5, 2021, 9:19pm

What was the solution?

Daniel8hen · September 6, 2021, 7:29am

Hi @jmpanfil,
the location I tried to save to wasn’t exist.you can try and save the file using:

model.save
The path, in case using Databricks, can be something as follows:
dbfs://some_path_to_file
or:
file:/some_path_to_file
or save directly in S3.
Also, I can offer you to try using mleap (see URL - https://docs.databricks.com/_static/notebooks/mleap-model-export-demo-python.html)
which helped me a lot to save the pipeline as well, as eventually that pipeline I created is indeed from Spark.
Good luck!