Hi XGBoost developers, I have two questions regarding the Spark failure setting of the XGBoost training job using XGBoost4J
- can
spark.task.maxFailures
be set to a number larger than 1? - if so, does the Spark blacklisting feature
spark.blacklist.enable=true
work smoothly with XGBoost training?
According to this Github issue: https://github.com/dmlc/xgboost/issues/3348#issuecomment-394798445, it seems that we cannot set spark.task.maxFailures to a number > 1, but in my experiment, it looks like jobs go smoothly with spark.task.maxFailures > 1. If I set spark.task.maxFailures to 1, training failures happens from time to time, but after it is set to 4, the failure (crush of the training job) does not happen anymore.