[jvm-packages][spark] Xgboost training on 3T data failed got exit code 255

I am trying to use XGBoost on a really big dataset, ~3T, and I was tried different spark-submit parameter combination, but failed to get results. The training went smoothly with mae results printing, but failed in the end with EXIT CODE 255. Not sure what does it mean exactly, could someone help with this?

Spark-submit code

spark-submit
–master yarn
–deploy-mode client
–queue ${SPARK_YARN_QUEUE}
–driver-cores 4
–driver-memory 16g
–num-executors 1000
–executor-cores 3
–executor-memory 8g
–conf spark.task.cpus=3
–conf spark.yarn.executor.memoryOverhead=12g
–conf spark.debug.maxToStringFields=1500
–conf spark.network.timeout=2000000
–conf spark.executor.heartbeatInterval=1000000
–conf spark.memory.fraction=0.6
–conf spark.memory.storageFraction=0.2
–conf spark.dynamicAllocation.enabled=true
–conf spark.dynamicAllocation.minExecutors=1000
–conf spark.dynamicAllocation.executorIdleTimeout=3600s
–conf spark.default.parallelism=1000
–conf spark.sql.shuffle.partitions=1000
–conf spark.shuffle.service.enabled=true
–class com.isf.menasor.Menasor
bin/menasor-jar-with-dependencies.jar
${property}

Part of Log I got:

19/04/15 13:47:30 INFO TaskSetManager: Finished task 550.0 in stage 20.1 (TID 26219) in 1975626 ms on BJHTYD-Tyrande-158-101.hadoop.jd.local (executor 535) (591/1000)
19/04/15 13:47:30 INFO TaskSetManager: Finished task 180.0 in stage 20.1 (TID 25849) in 1975662 ms on BJHTYD-Tyrande-148-136.hadoop.jd.local (executor 558) (592/1000)
19/04/15 13:47:30 ERROR YarnScheduler: Lost executor 702 on BJHTYD-Tyrande-110-70.hadoop.jd.local: Container marked as failed: container_e02_1533628320510_19787367_01_000703 on host: BJHTYD-Tyrande-110-70.hadoop.jd.local. Exit status: 255. Diagnostics: Exception from container-launch.
Container id: container_e02_1533628320510_19787367_01_000703
Exit code: 255
Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:111)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:102)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:381)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:319)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:85)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : run as user is yarn
main : requested yarn user is ads_aof
Getting exit code file…
Creating script paths…
Writing pid file…
Writing to tmp file /data8/yarn1/local/nmPrivate/application_1533628320510_19787367/container_e02_1533628320510_19787367_01_000703/container_e02_1533628320510_19787367_01_000703.pid.tmp
Writing to cgroup task files…
Creating local dirs…
Launching container…
Getting exit code file…
Creating script paths…

Container exited with a non-zero exit code 255

19/04/15 13:47:30 WARN TaskSetManager: Lost task 68.0 in stage 20.1 (TID 25737, BJHTYD-Tyrande-110-70.hadoop.jd.local, executor 702): ExecutorLostFailure (executor 702 exited caused by one of the running tasks) Reason: Container marked as failed: container_e02_1533628320510_19787367_01_000703 on host: BJHTYD-Tyrande-110-70.hadoop.jd.local. Exit status: 255. Diagnostics: Exception from container-launch.
Container id: container_e02_1533628320510_19787367_01_000703
Exit code: 255
Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:111)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:102)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:381)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:319)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:85)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : run as user is yarn
main : requested yarn user is ads_aof
Getting exit code file…
Creating script paths…
Writing pid file…
Writing to tmp file /data8/yarn1/local/nmPrivate/application_1533628320510_19787367/container_e02_1533628320510_19787367_01_000703/container_e02_1533628320510_19787367_01_000703.pid.tmp
Writing to cgroup task files…
Creating local dirs…
Launching container…
Getting exit code file…
Creating script paths…

Container exited with a non-zero exit code 255

‘’’

https://github.com/dmlc/xgboost/issues/3462 may be related.

thanks! I tried the method mentioned, reduce the num_partitions via reduce the worker number, add spark.network.timeout 10000000 spark.executor.heartbeatInterval 10000000 as mentioned in the stackoverflow answer, but neither one worked.

And this is really tricky since this error message doesn’t appear all the time, sometimes I got the model when I am lucky:) . I am not sure whether this is related to I’m using yarn cluster environment and the resources is not stable all the time? Any idea on the root cause and walk around for this question?

Thanks a lot for your help!

Does it mean that some executors randomly go down? XGBoost crashes if one of the executors assigned to it dies.

yes, I saw several task dies and executor lost after the loop finished and start to save model result.
But I am not sure what’s the root cause since this only happens when I was training on really large dataset.

This error occurs when the cluster’ resources is not enough, generally when the partitions are big (bigger than the executors’ resources) or the excutors’s memories aren’t enough.

According to your settings:

  • spark.task.cpus => each task needs 3 CPU
  • executor-cores 3 => NUMBER_TASK_PER_EXECUTOR = executor-cores / park.task.cpus ==> e.g 1 in your case
  • num-executors 1000 => you should have 1000*1 = 1000 tasks per stage, i.e the number of partitions of your data (println(myDataFrame.rdd.partitions.size))
  • you can’t use in the same –conf spark.dynamicAllocation.enabled=true and num-executors 1000, the later keep the number of executor constant and you won’t use the maximum of the capacity of your cluster, wheras the first one add executors when needed

Some idea which could help

executor-cores 6
executor-memory 20g
spark.task.cpus=3 
spark.dynamicAllocation.minExecutors = #ClusterNodes *2 (for example)

worker number = [(#WORKER_NODE_CPU -1 )* #WORKER_NODES] / #spark.task.cpus, #WORKER_NODE_CPU -1 because 1 CPU is reverved for the machine’s OS

What type of machines (CPU and Mermory) do you have and how many (workers nodes) to better get an idea on the cluster’s resources