[jvm-packages] User-input parameters not saved when calling trainWithRDD

Reposting from https://github.com/dmlc/xgboost/issues/3435.

OS: RHEL 6
Scala 2.11.8
Spark 2.2.0
Package used (python/R/jvm/C++): jvm
xgboost version used: v0.72

Steps to reproduce

Train a model using XGBoost.trainWithRDD, then output model parameters and save the model. The parameters returned from calling extractParamMap() and those found in the metadata of the saved model are always the default parameters, instead of the user-input parameters.

Code

val paramMap = List(
  "booster" -> "gbtree",
  "silent" -> 0,
  "nthread" -> 4,
  "objective" -> "reg:linear",
  "eta" -> 0.05f,
  "max_depth" -> 5,
  "min_child_weight" -> 2,
  "subsample" -> 0.5f,
  "alpha" -> 0,
  "lambda" -> 0,
  "seed" -> 27,
  "eval_metric" -> "auc",
  "tree_method" -> "approx").toMap

val xgboostModelRDD = XGBoost.trainWithRDD(trainRDD, paramMap, round=5, nWorkers=1, useExternalMemory=true)

print(xgboostModelRDD.extractParamMap())
xgboostModelRDD.write.overwrite().save("xgboostModel")

Output

{
	XGBoostRegressionModel_151ece6c7f93-alpha: 0.0,
	XGBoostRegressionModel_151ece6c7f93-booster: gbtree,
	XGBoostRegressionModel_151ece6c7f93-colsample_bylevel: 1.0,
	XGBoostRegressionModel_151ece6c7f93-colsample_bytree: 1.0,
	XGBoostRegressionModel_151ece6c7f93-eta: 0.3,
	XGBoostRegressionModel_151ece6c7f93-featuresCol: features,
	XGBoostRegressionModel_151ece6c7f93-gamma: 0.0,
	XGBoostRegressionModel_151ece6c7f93-grow_policy: depthwise,
	XGBoostRegressionModel_151ece6c7f93-labelCol: label,
	XGBoostRegressionModel_151ece6c7f93-lambda: 1.0,
	XGBoostRegressionModel_151ece6c7f93-lambda_bias: 0.0,
	XGBoostRegressionModel_151ece6c7f93-max_bin: 16,
	XGBoostRegressionModel_151ece6c7f93-max_delta_step: 0.0,
	XGBoostRegressionModel_151ece6c7f93-max_depth: 6,
	XGBoostRegressionModel_151ece6c7f93-min_child_weight: 1.0,
	XGBoostRegressionModel_151ece6c7f93-normalize_type: tree,
	XGBoostRegressionModel_151ece6c7f93-predictionCol: prediction,
	XGBoostRegressionModel_151ece6c7f93-rate_drop: 0.0,
	XGBoostRegressionModel_151ece6c7f93-sample_type: uniform,
	XGBoostRegressionModel_151ece6c7f93-scale_pos_weight: 1.0,
	XGBoostRegressionModel_151ece6c7f93-sketch_eps: 0.03,
	XGBoostRegressionModel_151ece6c7f93-skip_drop: 0.0,
	XGBoostRegressionModel_151ece6c7f93-subsample: 1.0,
	XGBoostRegressionModel_151ece6c7f93-tree_method: auto,
	XGBoostRegressionModel_151ece6c7f93-use_external_memory: false
}

What have you tried?

Training the model using XGBoost.trainWithDataFrame saves the user-input parameters (calling extractParamMap() and saving the model both output the user-input parameters), but XGBoost.trainWithRDD seems to be saving the default parameters.

The upcoming 0.80 version of XGBoost has a revamped interface for Spark integration. There will be a single function to train the model:

import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
...
val xgbParam = Map("eta" -> 0.1f,
      "max_depth" -> 2,
      "objective" -> "multi:softprob",
      "num_class" -> 3,
      "num_round" -> 100,
      "num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
      setFeaturesCol("features").
      setLabelCol("classIndex")
xgbClassifier.fit(xgbInput)

We are currently working on a tutorial for latest XGBoost4J-Spark. You can read the draft here. The tutorial will give you an overview of how data gets pre-processed and fed into the Booster.