Distributed Mode Detection with RABIT/MPI

dthiagarajan · November 10, 2022, 4:02pm

Hello,

I’m trying to get distributed training of XGBoost working with rabit via MPI, but I’m having some trouble getting the library to detect that I’m in distributed mode. Specifically, I’m using the tracker submit script to run the following:

import os

import numpy as np
import xgboost
from pytorch_lightning import seed_everything

if __name__ == "__main__":
    seed_everything(seed=0)
    LOCAL_RANK = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
    WORLD_SIZE = int(os.environ["OMPI_COMM_WORLD_SIZE"])
    WORLD_RANK = int(os.environ["OMPI_COMM_WORLD_RANK"])

    xgboost.rabit.init()
    rank = xgboost.rabit.get_rank()
    world_size = xgboost.rabit.get_world_size()
    print(
        f"Initialized XGBoost RABIT rank {rank} / {world_size} (from env: rank {WORLD_RANK} / {WORLD_SIZE})."
    )
    print(f"Creating DMatrix on rank {rank} / {world_size}.")
    train_data = xgboost.DMatrix(
        data=np.random.randn(1000, 19), label=np.random.randint(0, 3, (1000))
    )
    xgboost.rabit.finalize()

With this script, I notice that I don’t see this line:

XGBoost distributed mode detected, will split data among workers

but when load training data from a text file, I do see that line, e.g. if I run:

import os

import numpy as np
import xgboost
from pytorch_lightning import seed_everything

if __name__ == "__main__":
    seed_everything(seed=0)
    LOCAL_RANK = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
    WORLD_SIZE = int(os.environ["OMPI_COMM_WORLD_SIZE"])
    WORLD_RANK = int(os.environ["OMPI_COMM_WORLD_RANK"])

    xgboost.rabit.init()
    rank = xgboost.rabit.get_rank()
    world_size = xgboost.rabit.get_world_size()

    print(
        f"Initialized XGBoost RABIT rank {rank} / {world_size} (from env: rank {WORLD_RANK} / {WORLD_SIZE})."
    )
    print(f"Creating DMatrix on rank {rank} / {world_size}.")
    train_data = xgboost.DMatrix("train_data.txt)
    xgboost.rabit.finalize()

where train_data.txt is from here.

Does anyone know why this happens?

Thanks in advance!

dthiagarajan · November 11, 2022, 10:17pm

Looks like the logging message that gets printed saying that distributed mode is detected is only applicable for text files, but the actual model still communicates even if you load data from arrays directly.

I found this out by saving out the model from each process, and seeing that they were all the same.