XGBoostModel training fails with increase in data size. Any help would be appreciated!


#1

I am training XGBoostClassifier using ml.dmlc.xgboost4j.scala.spark version 0.82 on a binary dataset. The training fails with the error message mentioned below. The logic goes through fine with a single parquet file (test done to verify the dataset).

19/05/23 22:49:46 ERROR ApplicationMaster: User class threw exception: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:511)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$1.apply(XGBoost.scala:404)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$1.apply(XGBoost.scala:381)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:285)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:380)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:196)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
	at ge.drawbrid.dpp.spark2.demo.UserLevelGenderTrain$.main(UserLevelGenderTrain.scala:72)
	at ge.drawbrid.dpp.spark2.demo.UserLevelGenderTrain.main(UserLevelGenderTrain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)

Below is the code for the same:

val trainDF = spark.read.parquet(trainDFPath).as[DemoFeatureWithLabel].repartition(numWorkers, col("dbid")).toDF
    val testDF = spark.read.parquet(testDFPath).as[DemoFeatureWithLabel].toDF

    val paramMap = Map(
      "eta" -> 0.1f,
      "objective" -> "binary:logistic",
      "num_round" -> 100,
      "num_workers" -> numWorkers,
      //"eval_metric" -> "auc",
      "training_metric" -> "true",
      //"timeout_request_workers" -> 300000L,
      "verbosity" -> verbosity
    )

    val xgbClassifier = new XGBoostClassifier(paramMap)
      .setFeaturesCol("featureVec")
      .setLabelCol("label")
      .setNumEarlyStoppingRounds(10)
      .setMaximizeEvaluationMetrics(true)
      .setMaxDepth(6)
      .setSilent(isSilent)
      .setEvalSets(Map("train"-> trainDF, "test" -> testDF))
      .setUseExternalMemory(true)

    val model = xgbClassifier.fit(trainDF)

The training dataset is ~12GB (~4M rows and ~100K features). Following is the spark-config (spark-2.3.0) used:

–driver-memory 20g
–master yarn
–deploy-mode cluster
–num-executors 300
–executor-memory 25g
–executor-cores 4
–queue eng-normal
–conf “spark.sql.shuffle.partitions=10001”
–conf “spark.yarn.executor.memoryOverhead=20480”
–conf “spark.dynamicAllocation.enabled=false”
–conf “spark.shuffle.service.enabled=false”
–packages ml.dmlc:xgboost4j-spark:0.82
–conf spark.kryoserializer.buffer.max=256m

Probably the issue has something to do with the config setting. Looking forward to the suggestions. Thanks!