Understanding the Spark/EMR performance settings


I don’t think I fully understand the entire picture, can anyone help me fill in the details or tell me where I am wrong? I would really appreciate it.

Let’s say I’m using an m4.10xlarge instance for my Spark jobs. Looking at this website – https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html it tells me that each machine has 20 cores, 2 threads each.

  1. Each machine really has 40 virtual CPUs and that’s exactly what spark.task.cpus refers to? So I can set spark.task.cpus=40 and nthread to be anything in [1, 40]?

  2. If that’s the case, let’s say I have executor-cores=20 (I’m assuming I can’t set it to 40, this is really the physical cores that the flag refers to) and spark.task.cpus = 40, then what if I set nthread to 10. Does this mean that the executor will only take advantage of 1/4 of its cpu power? Should I set num_workers to 4 to train 4 trees in parallel?

  3. Lets say E executors can train a single model in parallel. Can I then extend this to K * E executors and set parallelism to K during grid search?

  4. Finally, how many partittions should I repartition my training set into given all of this?