XGBoost2.0.3 + DASK = "Multiple processes within communication group running on same CUDA device is not supported."

ihmon · January 29, 2024, 9:45am

[Environment]
OS: Ubuntu18.04
Hardware: 1PC installed with 2GPU boards with 8GB VRAM each
Python: 3.10.13 with xgboost 2.0.3 & dask 2024.1.1

[part of code]

    cluster = dask.distributed.LocalCluster()
    client = dask.distributed.Client(cluster)
    dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
    output = xgb.dask.train(
        client,
        {"verbosity": 2, "tree_method": "hist", "device": "cuda", "objective": "reg:squarederror"},
        dtrain,
        num_boost_round=4,
        evals=[(dtrain, "train")],
    )

[What happened]
The following error occurs at a line of xgb.dask.train(). No error happens without "device": "cuda" in a dictionary in one of the arguments of the train().

[18:14:16] /home/conda/feedstock_root/build_artifacts/xgboost-split_1705650282415/work/src/collective/nccl_device_communicator.cu:40: Check failed: n_uniques == world_size_ (1 vs. 4) : Multiple processes within communication group running on same CUDA device is not supported. 9e1253dbb8c3fe1928e2fed0d04a63d5

[Question]
dask.config.config shows a dictionary. I feel like some of the values in the dictionary should be modified for my hardware configuration along with https://docs.dask.org/en/latest/configuration.html#directly-within-python, however, I have no idea which keys they are.

Please help me out.

ihmon · January 31, 2024, 12:57am

I should have set n_workers=1 no matter how many threads_per_worker there are.
I set n_workers=2 since I have 2 GPUs, however, I had the same error. Maybe I should use LocalCudaCluster() or something instead of LocalCluster().
Thanks,

hcho3 · February 1, 2024, 9:10pm

Yes, you should use LocalCUDACluster instead of LocalCluster. LocalCUDACluster ensures that each worker is assigned exactly one GPU.