Has anyone used xgboost4j-spark on spark-shell?


#1

Hi

I am trying to use xgboost4j-spark on my windows machine (not running on cluster because I don’t have access to run spark job on cluster yet). I keep getting into issues

Tracker started, with env={}
...
20/02/24 22:13:01 ERROR RabitTracker: Uncaught exception thrown by worker:
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source)

looks like “Tracker started, with env={}” is causing issue but not sure how to resolve it.
Based on https://github.com/dmlc/xgboost/issues/3951, I tried adding “TrackerConf” but it didn’t help

settings

  • windows spark-shell (spark-2.4.5-bin-hadoop2.6)
  • xgboost4j-spark-0.90-criteo-20190702_2.11.jar
  • xgboost4j-0.90-criteo-20190702_2.11-win64.jar
    got windows compatible xgboost4j jar files from criteo repo and added to spark-2.4.5-bin-hadoop2.6/jars

example

// import
import org.apache.spark.ml.feature.{VectorAssembler, VectorIndexer, StringIndexer}
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType, IntegerType}
// data prep
val schema = new StructType().add(“sepal_length”,DoubleType,true).add(“sepal_width”,DoubleType,true).add(“petal_length”,DoubleType,true).add(“petal_width”,DoubleType,true).add(“species”,StringType,true)
val iris = spark.read.schema(schema).option(“header”,“true”).format(“csv”).load(“path/iris.csv”)
val testDF = iris.withColumn(“class”, when(col(“species”) === “setosa”,1).otherwise(0))
val vectorAssembler = new VectorAssembler().setInputCols(Array(“sepal_length”, “sepal_width”, “petal_length”, “petal_width”)).setOutputCol(“features”)
val xgbInput = vectorAssembler.transform(testDF).selectExpr(“features”, “class”)
// model params and training
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier}
val paramMap = ( List(“eta” -> 0.1f, “scale_pos_weight”->10, “objective”->“binary:logistic”, “seed” -> 123, “silent” -> 1, “missing”->0.0, “lambda”->0.7, “alpha”->0, “max_depth”->3, “min_child_weight”->2.0, “round”->10, “num_workers”->2).toMap )
val initmodel = new XGBoostClassifier(paramMap).setFeaturesCol(“features”).setLabelCol(“class”)
val model = initmodel.fit(xgbInput)

error message
20/02/24 22:13:01 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing ‘eval_sets’ and ‘eval_set_names’
Tracker started, with env={}
20/02/24 22:13:01 WARN XGBoostSpark: train_test_ratio is deprecated since XGBoost 0.82, we recommend to explicitly pass a training and multiple evaluation datasets by passing ‘eval_sets’ and ‘eval_set_names’
20/02/24 22:13:01 ERROR RabitTracker: Uncaught exception thrown by worker:
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:206)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:157)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:243)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:729)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:980)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:978)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:978)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2$$anon$1.run(XGBoost.scala:452)
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:582)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:459)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2.apply(XGBoost.scala:435)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:434)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:194)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:44)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
… 49 elided

greatly appreciate any suggestions