[jvm-packages] [spark] Is it okay to have spark.task.maxFailures > 1 in the Spark job?


#1

Hi XGBoost developers, I have two questions regarding the Spark failure setting of the XGBoost training job using XGBoost4J

  1. can spark.task.maxFailures be set to a number larger than 1?
  2. if so, does the Spark blacklisting feature spark.blacklist.enable=true work smoothly with XGBoost training?

According to this Github issue: https://github.com/dmlc/xgboost/issues/3348#issuecomment-394798445, it seems that we cannot set spark.task.maxFailures to a number > 1, but in my experiment, it looks like jobs go smoothly with spark.task.maxFailures > 1. If I set spark.task.maxFailures to 1, training failures happens from time to time, but after it is set to 4, the failure (crush of the training job) does not happen anymore.