Distributed XGBoost -- standalone?

As a quite inexperienced xgboost user, I found documentation on distributed xgboost using YARN and there’s also mentioning about xgboost + Spark but not much else.

Is it possible to use distributed xgboost “standalone”? Maybe with dmlc-submit --cluster ssh?

I’ve also seen xgboost can run on MPI. Does it just mean rabit uses MPI instead of directly using TCP sockets? Should I expect a performance difference between MPI vs. no-MPI?

Is xgboost + Spark the “recommended” way for distributing xgboost?

Thanks!

Hello Jinliang,

In my experience the easiest way to launch distributed XGBoost clusters is to use EMR to launch a YARN cluster and then install XGBoost manually and launch jobs through YARN.

Setting up manual SSH access would be a real pain with permissions etc.

I have a couple scripts to automate the setup, I’ll try to post them soon to make launching ad-hoc clusters easier.

If you only want to run XGBoost jobs I’d stay use standalone XGBoost through YARN and skip the Spark layer.

–Theo

2 Likes

@jinliangwei If you really want to run distributed xgboost through ssh, you could refer to the implementation of ‘dmlc-submit --cluster local’.

Running xgboost on MPI means xgboost uses the MPI implementation of allreduce and broadcast, which doesn’t have the ‘checkpoint’ mechanism supported by rabit. As to performance, I can’t say anything.

Thanks. I don’t really mind setting up SSH, etc – I’ve done it many times before. The dollar cost on EMR would be a bigger concern for me right now. :stuck_out_tongue:

Thanks all. @codyyang @thvasilo I guess I didn’t make myself clear before. Launching processes on distributed machines isn’t a problem for me. What I meant to ask was – if I actually use SSH to launch processes, do I need to provide any additional arguments or environment variables to xgboost?

@jinliangwei In order to run distributed xgboost, you should start a rabit tracker, provide the ip and port of your rabit tracker to xgboost instance through environment variables rabit_tracker_uri and rabit_tracker_port.

There 's a implementation of rabit tracker in python.

1 Like

Great, thank you!

But did you actually mean dmlc-submit --cluster ssh when you said dmlc-submit --cluster local? I thought local was for lauching local jobs.

I’m afraid there’s no ‘ssh’ support for the --cluster option.
You could read the code of the ‘local’ version, make a little change to it, to support ssh connection.

Thank you! There is already a ssh version. I had to add ssh to submit.py, but it seems to work just fine.

Regarding xgboost on MPI, I thought xgboost always uses rabit and rabit has an MPI engine. Does xgboost on MPI mean it uses rabit’s MPI engine? And do you mean that rabit’s MPI engine doesn’t support checkpointing?

AFAIK, rabit is a ‘tailored’ version of MPI, it only implements the allreduce and broadcast API of MPI. Xgboost could use both rabit and MPI interchangably as the underlying data sync engine, but with rabit only, you got the additional ‘checkpoint’ supporting.