Network configurations of rabid library for Federated XGBoost on Amazon EC2 instances

luckystarufo · November 13, 2022, 12:15am

Hello all,

I’ve recently tried to deploy the Federated XGBoost framework on Amazon EC2 instances (Ubuntu 18.04). There are 3 instances (one server and two clients) and are in different regions. I’ve properly configured the network so that the three instances can ‘talk to each other’ through ssh.

However, when running the algorithm, the clients get stuck at the line of subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) in remote_server.py, emitting the following lines to the stdout:

retry connect to ip(retry time 1): [172.30.0.113]
retry connect to ip(retry time 2): [172.30.0.113]
retry connect to ip(retry time 3): [172.30.0.113]
retry connect to ip(retry time 4): [172.30.0.113]
connect to (failed): [172.30.0.113]
Socket Connection Error: Connection refused, shutting down …

After diving a bit deeper, I see it stems from the C++ implementation of the rabit library which is now merged into dmlc/xgboost.

So my question is how the rabbit library handles the communication here and what I can do to fix/test it? Are there any other network configurations I should do so that FedXGBoost can run on AWS? Thanks!