Fault tolerance of distributed XGBoost

Hi community,

To test the fault tolerance of distributed XGBoost in a Yarn cluster, is there a way to forcefully kill a node and see if the node recovers? Any way would be helpful (from inside C++, from python or through some yarn command)

In latest XGBoost, single-point recovery has been removed due to lack of maintenance. Please use checkpoints instead.

So, if I want to check if checkpointing works how can I simulate the behaviour of a node going down?

Let me elaborate what I mean by checkpointing. You’d generate one checkpoint for each boosting iteration. If one of the nodes go down, you would resume the training job from the last saved checkpoint.

You can test checkpointing by attempting to resume a training job from the last saved checkpoint. As for simulating the behavior of a node going down in a Yarn cluster, I suggest that you post in Stack Overflow or other forums.

Just one doubt. How is this different from single point recovery?

When I mean by “single-point recovery” is when a previously dead worker restarts and attempts to recover data from the other remaining workers. The single-point recovery thus would let us avoid stopping the training job.

The current approach, on the other hand, the death of a worker causes the whole training job to be stopped.

The single-point recovery is quite attractive on paper, but we have found to be difficult to maintain and ensure its correctness. Therefore, we no longer support it.

So if the death of a worker stops the whole training job, will check pointing save the binary model as a file till the last boosting round so that it can be re loaded?

Yes that is correct.

I had a doubt in checkpointing. In CLITrain, during boosting why is there 2 calls to CheckPoint?

After UpdateOneIter call: https://github.com/dmlc/xgboost/blob/ad1a52770938b9d7c6eb03b95ca77dba83c63e28/src/cli_main.cc#L235
After EvalOneIter call: https://github.com/dmlc/xgboost/blob/ad1a52770938b9d7c6eb03b95ca77dba83c63e28/src/cli_main.cc#L259

This is actually an artifact of fault-tolerant AllReduce(aka “single-point recovery”), which we no longer support. So checkpointing once in each iteration should be sufficient.