Distributed XGBoost -- standalone?

jinliangwei · August 9, 2018, 10:52pm

As a quite inexperienced xgboost user, I found documentation on distributed xgboost using YARN and there’s also mentioning about xgboost + Spark but not much else.

Is it possible to use distributed xgboost “standalone”? Maybe with dmlc-submit --cluster ssh?

I’ve also seen xgboost can run on MPI. Does it just mean rabit uses MPI instead of directly using TCP sockets? Should I expect a performance difference between MPI vs. no-MPI?

Is xgboost + Spark the “recommended” way for distributing xgboost?

Thanks!

thvasilo · August 9, 2018, 11:43pm

Hello Jinliang,

In my experience the easiest way to launch distributed XGBoost clusters is to use EMR to launch a YARN cluster and then install XGBoost manually and launch jobs through YARN.

Setting up manual SSH access would be a real pain with permissions etc.

I have a couple scripts to automate the setup, I’ll try to post them soon to make launching ad-hoc clusters easier.

If you only want to run XGBoost jobs I’d stay use standalone XGBoost through YARN and skip the Spark layer.

–Theo

codyyang · August 10, 2018, 1:49am

@jinliangwei If you really want to run distributed xgboost through ssh, you could refer to the implementation of ‘dmlc-submit --cluster local’.

Running xgboost on MPI means xgboost uses the MPI implementation of allreduce and broadcast, which doesn’t have the ‘checkpoint’ mechanism supported by rabit. As to performance, I can’t say anything.

jinliangwei · August 10, 2018, 2:29am

Thanks. I don’t really mind setting up SSH, etc – I’ve done it many times before. The dollar cost on EMR would be a bigger concern for me right now.

jinliangwei · August 10, 2018, 2:41am

Thanks all. @codyyang @thvasilo I guess I didn’t make myself clear before. Launching processes on distributed machines isn’t a problem for me. What I meant to ask was – if I actually use SSH to launch processes, do I need to provide any additional arguments or environment variables to xgboost?

codyyang · August 10, 2018, 3:36am

@jinliangwei In order to run distributed xgboost, you should start a rabit tracker, provide the ip and port of your rabit tracker to xgboost instance through environment variables rabit_tracker_uri and rabit_tracker_port.

There 's a implementation of rabit tracker in python.

jinliangwei · August 10, 2018, 7:19am

Great, thank you!

But did you actually mean dmlc-submit --cluster ssh when you said dmlc-submit --cluster local? I thought local was for lauching local jobs.

codyyang · August 10, 2018, 7:55am

I’m afraid there’s no ‘ssh’ support for the --cluster option.
You could read the code of the ‘local’ version, make a little change to it, to support ssh connection.

jinliangwei · August 10, 2018, 6:36pm

Thank you! There is already a ssh version. I had to add ssh to submit.py, but it seems to work just fine.

Regarding xgboost on MPI, I thought xgboost always uses rabit and rabit has an MPI engine. Does xgboost on MPI mean it uses rabit’s MPI engine? And do you mean that rabit’s MPI engine doesn’t support checkpointing?

codyyang · August 13, 2018, 1:14am

AFAIK, rabit is a ‘tailored’ version of MPI, it only implements the allreduce and broadcast API of MPI. Xgboost could use both rabit and MPI interchangably as the underlying data sync engine, but with rabit only, you got the additional ‘checkpoint’ supporting.