[jvm-packages][spark] Xgboost training on 3T data failed got exit code 255


#1

I am trying to use XGBoost on a really big dataset, ~3T, and I was tried different spark-submit parameter combination, but failed to get results. The training went smoothly with mae results printing, but failed in the end with EXIT CODE 255. Not sure what does it mean exactly, could someone help with this?

Spark-submit code

spark-submit
–master yarn
–deploy-mode client
–queue ${SPARK_YARN_QUEUE}
–driver-cores 4
–driver-memory 16g
–num-executors 1000
–executor-cores 3
–executor-memory 8g
–conf spark.task.cpus=3
–conf spark.yarn.executor.memoryOverhead=12g
–conf spark.debug.maxToStringFields=1500
–conf spark.network.timeout=2000000
–conf spark.executor.heartbeatInterval=1000000
–conf spark.memory.fraction=0.6
–conf spark.memory.storageFraction=0.2
–conf spark.dynamicAllocation.enabled=true
–conf spark.dynamicAllocation.minExecutors=1000
–conf spark.dynamicAllocation.executorIdleTimeout=3600s
–conf spark.default.parallelism=1000
–conf spark.sql.shuffle.partitions=1000
–conf spark.shuffle.service.enabled=true
–class com.isf.menasor.Menasor
bin/menasor-jar-with-dependencies.jar
${property}

Part of Log I got:

19/04/15 13:47:30 INFO TaskSetManager: Finished task 550.0 in stage 20.1 (TID 26219) in 1975626 ms on BJHTYD-Tyrande-158-101.hadoop.jd.local (executor 535) (591/1000)
19/04/15 13:47:30 INFO TaskSetManager: Finished task 180.0 in stage 20.1 (TID 25849) in 1975662 ms on BJHTYD-Tyrande-148-136.hadoop.jd.local (executor 558) (592/1000)
19/04/15 13:47:30 ERROR YarnScheduler: Lost executor 702 on BJHTYD-Tyrande-110-70.hadoop.jd.local: Container marked as failed: container_e02_1533628320510_19787367_01_000703 on host: BJHTYD-Tyrande-110-70.hadoop.jd.local. Exit status: 255. Diagnostics: Exception from container-launch.
Container id: container_e02_1533628320510_19787367_01_000703
Exit code: 255
Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:111)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:102)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:381)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:319)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:85)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : run as user is yarn
main : requested yarn user is ads_aof
Getting exit code file…
Creating script paths…
Writing pid file…
Writing to tmp file /data8/yarn1/local/nmPrivate/application_1533628320510_19787367/container_e02_1533628320510_19787367_01_000703/container_e02_1533628320510_19787367_01_000703.pid.tmp
Writing to cgroup task files…
Creating local dirs…
Launching container…
Getting exit code file…
Creating script paths…

Container exited with a non-zero exit code 255

19/04/15 13:47:30 WARN TaskSetManager: Lost task 68.0 in stage 20.1 (TID 25737, BJHTYD-Tyrande-110-70.hadoop.jd.local, executor 702): ExecutorLostFailure (executor 702 exited caused by one of the running tasks) Reason: Container marked as failed: container_e02_1533628320510_19787367_01_000703 on host: BJHTYD-Tyrande-110-70.hadoop.jd.local. Exit status: 255. Diagnostics: Exception from container-launch.
Container id: container_e02_1533628320510_19787367_01_000703
Exit code: 255
Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:111)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:102)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:381)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:319)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:85)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : run as user is yarn
main : requested yarn user is ads_aof
Getting exit code file…
Creating script paths…
Writing pid file…
Writing to tmp file /data8/yarn1/local/nmPrivate/application_1533628320510_19787367/container_e02_1533628320510_19787367_01_000703/container_e02_1533628320510_19787367_01_000703.pid.tmp
Writing to cgroup task files…
Creating local dirs…
Launching container…
Getting exit code file…
Creating script paths…

Container exited with a non-zero exit code 255

‘’’