Hello,
I’m trying to get distributed training of XGBoost working with rabit
via MPI, but I’m having some trouble getting the library to detect that I’m in distributed mode. Specifically, I’m using the tracker submit script to run the following:
import os
import numpy as np
import xgboost
from pytorch_lightning import seed_everything
if __name__ == "__main__":
seed_everything(seed=0)
LOCAL_RANK = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
WORLD_SIZE = int(os.environ["OMPI_COMM_WORLD_SIZE"])
WORLD_RANK = int(os.environ["OMPI_COMM_WORLD_RANK"])
xgboost.rabit.init()
rank = xgboost.rabit.get_rank()
world_size = xgboost.rabit.get_world_size()
print(
f"Initialized XGBoost RABIT rank {rank} / {world_size} (from env: rank {WORLD_RANK} / {WORLD_SIZE})."
)
print(f"Creating DMatrix on rank {rank} / {world_size}.")
train_data = xgboost.DMatrix(
data=np.random.randn(1000, 19), label=np.random.randint(0, 3, (1000))
)
xgboost.rabit.finalize()
With this script, I notice that I don’t see this line:
XGBoost distributed mode detected, will split data among workers
but when load training data from a text file, I do see that line, e.g. if I run:
import os
import numpy as np
import xgboost
from pytorch_lightning import seed_everything
if __name__ == "__main__":
seed_everything(seed=0)
LOCAL_RANK = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
WORLD_SIZE = int(os.environ["OMPI_COMM_WORLD_SIZE"])
WORLD_RANK = int(os.environ["OMPI_COMM_WORLD_RANK"])
xgboost.rabit.init()
rank = xgboost.rabit.get_rank()
world_size = xgboost.rabit.get_world_size()
print(
f"Initialized XGBoost RABIT rank {rank} / {world_size} (from env: rank {WORLD_RANK} / {WORLD_SIZE})."
)
print(f"Creating DMatrix on rank {rank} / {world_size}.")
train_data = xgboost.DMatrix("train_data.txt)
xgboost.rabit.finalize()
where train_data.txt
is from here.
Does anyone know why this happens?
Thanks in advance!