Spark distributed training not utilizing all resources

jerdavis · June 8, 2019, 3:38pm

Xgboost 0.90.
Scala Library.
EMR 5.23.0

Given an arbitrary cluster (Lets say 4 node, 32 cores each.)
I’m noticing that no matter what I set for these variables (for example):
spark.executor.cores (32)
spark.executor.instances (4)
spark.task.cpus (1)
num_workers (64)
nthreads (1)

I see Rabbit start with --num_workers=64, I see “INFO @tracker All of 64 nodes getting started”
But looking at Spark I only see one or two active tasks, and Ganglia shows only a couple active cores.
I’ve tried switching around num_workers(4) and nthreads(16), but get the same end result (low activity)

I’ve tried various dataset sizes (from 100MB to 10GB), and various hyperparameters.

I’m expecting that some combination of parameters is going to light up more cores, but I can’t seem to get there. Are my expectations wrong?

jerdavis · June 8, 2019, 5:02pm

I guess chalk this up to user error for now… Task tracker definitely only shows 1-2 tasks, but I can get the cpu utilization to go up on the nodes by adjusting num_workers…

hq.jacob · June 11, 2019, 5:48am

@jerdavis, as my understanding, num_workers should be the number of executors to execute the training task, so it should be less than or equal to your assigned number of executors.

kharrmat · June 14, 2019, 1:54pm

Total-Cluster CPU: #worker_nodes * (32 - 1 ) CPUs = 3 * 31 = 93 CPUs
- The master won’t perform any computation
- 32 -1, because 1 CPU will be reserved for the OS of the node
spark.task.cpus =1 => each task will use 1 CPU
At maximum your cluster can perform 93 tasks simultaneously (each executor 31)
If you want to use all your CPUs, ensure that your data have enough (> 90) partitions, so that each task works on one partition
Look at dataset/dafaframe functions like repartition(), …