Fault tolerance of distributed XGBoost

VigneshN1997 · December 8, 2020, 10:43am

Hi community,

To test the fault tolerance of distributed XGBoost in a Yarn cluster, is there a way to forcefully kill a node and see if the node recovers? Any way would be helpful (from inside C++, from python or through some yarn command)

hcho3 · December 8, 2020, 3:55pm

In latest XGBoost, single-point recovery has been removed due to lack of maintenance. Please use checkpoints instead.

VigneshN1997 · December 9, 2020, 4:54am

So, if I want to check if checkpointing works how can I simulate the behaviour of a node going down?

hcho3 · December 9, 2020, 5:18am

Let me elaborate what I mean by checkpointing. You’d generate one checkpoint for each boosting iteration. If one of the nodes go down, you would resume the training job from the last saved checkpoint.

You can test checkpointing by attempting to resume a training job from the last saved checkpoint. As for simulating the behavior of a node going down in a Yarn cluster, I suggest that you post in Stack Overflow or other forums.

VigneshN1997 · December 15, 2020, 6:15am

Just one doubt. How is this different from single point recovery?

hcho3 · December 15, 2020, 7:07am

When I mean by “single-point recovery” is when a previously dead worker restarts and attempts to recover data from the other remaining workers. The single-point recovery thus would let us avoid stopping the training job.

The current approach, on the other hand, the death of a worker causes the whole training job to be stopped.

The single-point recovery is quite attractive on paper, but we have found to be difficult to maintain and ensure its correctness. Therefore, we no longer support it.

VigneshN1997 · December 15, 2020, 10:16am

So if the death of a worker stops the whole training job, will check pointing save the binary model as a file till the last boosting round so that it can be re loaded?

hcho3 · December 15, 2020, 3:21pm

Yes that is correct.

VigneshN1997 · December 17, 2020, 8:48am

I had a doubt in checkpointing. In CLITrain, during boosting why is there 2 calls to CheckPoint?

After UpdateOneIter call: https://github.com/dmlc/xgboost/blob/ad1a52770938b9d7c6eb03b95ca77dba83c63e28/src/cli_main.cc#L235
After EvalOneIter call: https://github.com/dmlc/xgboost/blob/ad1a52770938b9d7c6eb03b95ca77dba83c63e28/src/cli_main.cc#L259

hcho3 · December 17, 2020, 8:51am

This is actually an artifact of fault-tolerant AllReduce(aka “single-point recovery”), which we no longer support. So checkpointing once in each iteration should be sufficient.