I am trying to implement an xgboost model in scala, using zeppelin in dataproc (google cloud). This is the code I’m implementing:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.{DataFrame, Dataset, Row, SaveMode, SparkSession}
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
import scala.collection.mutable
import org.apache.spark.sql.{DataFrame, _}
import spark.implicits._
import org.apache.spark.ml.{Pipeline, PipelineStage}
Adding deppendency (also added jar in zeppelin notebook dependencies)
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark</artifactId>
<version>0.72</version>
</dependency>
Dummy data:
val someData = Seq(
Row(8, 15 1),
Row(64, 25 1),
Row(27, 22 0)
)
val someSchema = List(
StructField("var1", IntegerType, true),
StructField("var2", IntegerType, true),
StructField("Classification", IntegerType, true)
)
val data= spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
Model implementation:
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
val stringIndexer = new StringIndexer().
setInputCol("Classification").
setOutputCol("label").
fit(data)
val labelTransformed = stringIndexer.transform(data).drop("Classification")
val vectorAssembler = new VectorAssembler().
setInputCols(Array("var1", "var2")).
setOutputCol("features")
val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "label")
import ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
val paramMap = Map[String, Any]("objective" -> "binary:logistic", "nworkers" -> 2)
val est = new XGBoostEstimator(paramMap)
val model = est.fit(xgbInput)
Everything works except for the very last line, where I get the following error:
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.156.0.33, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=2}
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:406)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:356)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:337)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:336)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:139)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:36)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
... 69 elided
Once again, using scala on zeppelin through dataproc, spark version is 2.4.5.
Can anyone help me? If anyone has another alternative altogheter that isn’t at all this code, I also accept that (I appreciate any solution, as long as it works in scala on google-cloud-dataproc!)