stderr like below:
[2022-12-30 16:45:52.684]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 545764 Aborted (core dumped) /usr/local/jdk8/bin/java -server -Xmx15360m -Djava.io.tmpdir=/data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp ‘-Dspark.driver.port=33425’ ‘-Dspark.ui.port=0’ -Dspark.yarn.app.container.log.dir=/data/hdfs/yarn/logs/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003 -XX:OnOutOfMemoryError=‘kill %p’ org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@sh-bs-b1-303-i3-hadoop-128-245:33425 --executor-id 1 --hostname sh-bs-b1-303-i4-hadoop-129-4 --cores 10 --app-id application_1658828757310_6328150 --user-class-path file:/data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/app.jar > /data/hdfs/yarn/logs/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/stdout 2> /data/hdfs/yarn/logs/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/stderr
Last 4096 bytes of stderr :
container_e05_1658828757310_6328150_02_000003/tmp/10-cache-33693313126262929727/train.sorted.col.page
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-07149317088523727121/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-33693313126262929727/train
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-07149317088523727121/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-4924047746409874369/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-4924047746409874369/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-93775136737660731039/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-93775136737660731039/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-85776801237113420567/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-85776801237113420567/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-54855043875011110821/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-54855043875011110821/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-22729313882414221673/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-22729313882414221673/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-76094794478847063278/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-76094794478847063278/train
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
pure virtual method called
terminate called without an active exception
.
22/12/30 16:45:54 INFO scheduler.DAGScheduler: Job 7 failed: foreachPartition at XGBoost.scala:452, took 21.636377 s
22/12/30 16:45:54 ERROR java.RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job 7 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:932)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:930)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:930)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2128)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2041)
at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1948)
at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply$mcV$sp(SparkParallelismTracker.scala:131)
at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:131)
at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:131)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.TaskFailedListener$$anon$1.run(SparkParallelismTracker.scala:130)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$2$$anon$1.run(XGBoost.scala:452)
22/12/30 16:45:54 INFO scheduler.DAGScheduler: ResultStage 10 (foreachPartition at XGBoost.scala:452) failed in 21.630 s due to Stage cancelled because SparkContext was shut down
22/12/30 16:45:54 WARN scheduler.TaskSetManager: Lost task 9.0 in stage 10.0 (TID 45, sh-bs-b1-303-i4-hadoop-129-4, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container from a bad node: container_e05_1658828757310_6328150_02_000003 on host: sh-bs-b1-303-i4-hadoop-129-4. Exit status: 134. Diagnostics: 25 --executor-id 1 --hostname sh-bs-b1-303-i4-hadoop-129-4 --cores 10 --app-id application_1658828757310_6328150 --user-class-path file:/data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/app.jar > /data/hdfs/yarn/logs/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/stdout 2> /data/hdfs/yarn/logs/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/stderr
Last 4096 bytes of stderr :
container_e05_1658828757310_6328150_02_000003/tmp/10-cache-33693313126262929727/train.sorted.col.page
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-07149317088523727121/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-33693313126262929727/train
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-07149317088523727121/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-4924047746409874369/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-4924047746409874369/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-93775136737660731039/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-93775136737660731039/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-85776801237113420567/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-85776801237113420567/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-54855043875011110821/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-54855043875011110821/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-22729313882414221673/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-22729313882414221673/train
[16:45:37] SparsePage::Writer Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-76094794478847063278/train.sorted.col.page
[16:45:37] SparsePageSource: Finished writing to /data/hdfs/data7/yarn/local/usercache/root/appcache/application_1658828757310_6328150/container_e05_1658828757310_6328150_02_000003/tmp/10-cache-76094794478847063278/train
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
Too many nodes went down and we cannot recover…
pure virtual method called
terminate called without an active exception