I am training XGBoostClassifier using ml.dmlc.xgboost4j.scala.spark version 0.82 on a binary dataset. The training fails with the error message mentioned below. The logic goes through fine with a single parquet file (test done to verify the dataset).
19/05/23 22:49:46 ERROR ApplicationMaster: User class threw exception: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:511)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$1.apply(XGBoost.scala:404)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$1.apply(XGBoost.scala:381)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:380)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:196)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at ge.drawbrid.dpp.spark2.demo.UserLevelGenderTrain$.main(UserLevelGenderTrain.scala:72)
at ge.drawbrid.dpp.spark2.demo.UserLevelGenderTrain.main(UserLevelGenderTrain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Below is the code for the same:
val trainDF = spark.read.parquet(trainDFPath).as[DemoFeatureWithLabel].repartition(numWorkers, col("dbid")).toDF
val testDF = spark.read.parquet(testDFPath).as[DemoFeatureWithLabel].toDF
val paramMap = Map(
"eta" -> 0.1f,
"objective" -> "binary:logistic",
"num_round" -> 100,
"num_workers" -> numWorkers,
//"eval_metric" -> "auc",
"training_metric" -> "true",
//"timeout_request_workers" -> 300000L,
"verbosity" -> verbosity
)
val xgbClassifier = new XGBoostClassifier(paramMap)
.setFeaturesCol("featureVec")
.setLabelCol("label")
.setNumEarlyStoppingRounds(10)
.setMaximizeEvaluationMetrics(true)
.setMaxDepth(6)
.setSilent(isSilent)
.setEvalSets(Map("train"-> trainDF, "test" -> testDF))
.setUseExternalMemory(true)
val model = xgbClassifier.fit(trainDF)
The training dataset is ~12GB (~4M rows and ~100K features). Following is the spark-config (spark-2.3.0) used:
–driver-memory 20g
–master yarn
–deploy-mode cluster
–num-executors 300
–executor-memory 25g
–executor-cores 4
–queue eng-normal
–conf “spark.sql.shuffle.partitions=10001”
–conf “spark.yarn.executor.memoryOverhead=20480”
–conf “spark.dynamicAllocation.enabled=false”
–conf “spark.shuffle.service.enabled=false”
–packages ml.dmlc:xgboost4j-spark:0.82
–conf spark.kryoserializer.buffer.max=256m
Probably the issue has something to do with the config setting. Looking forward to the suggestions. Thanks!