With the latest XGBoost4J-Spark, I was able to save the model to HDFS.
In your pom.xml
, add the following dependency:
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark</artifactId>
<version>0.8-SNAPSHOT</version>
</dependency>
Then add the following repository:
<repository>
<id>XGBoost4J-Spark Snapshot Repo</id>
<name>XGBoost4J-Spark Snapshot Repo</name>
<url>https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/</url>
</repository>
This repository hosts the latest JAR for XGBoost4J-Spark.
As for your question about two methods, they serve different purposes.
-
model.save()
provides for persistence between Spark sessions:
/* Session 1 in spark shell */
model.save("hdfs://...")
/* Session 2 in spark shell */
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
...
val xgbClassificationModel2 = XGBoostClassificationModel.load("hdfs://...")
xgbClassificationModel2.transform(xgbInput)
This function is available from the JAR hosted by the repository CodingCat/xgboost/maven-repo/
.
-
model.nativeBooster.saveModel()
exports the model for other bindings of XGBoost (e.g. Python)
/* Session 1 in spark shell */
model.nativeBooster.saveModel(nativeModelPath)
# Session 2 in Python shell
import xgboost as xgb
bst = xgb.Booster({'nthread': 4})
bst.load_model(nativeModelPath)
This function is not available from the JAR hosted by the repository CodingCat/xgboost/maven-repo/
. You’ll have to compile from the source. Make sure to set USE_HDFS=ON
in jvm-packages/create_jni.py
.