XGBoost2.0.3 + DASK = "Multiple processes within communication group running on same CUDA device is not supported."

OS: Ubuntu18.04
Hardware: 1PC installed with 2GPU boards with 8GB VRAM each
Python: 3.10.13 with xgboost 2.0.3 & dask 2024.1.1

[part of code]

    cluster = dask.distributed.LocalCluster()
    client = dask.distributed.Client(cluster)
    dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
    output = xgb.dask.train(
        {"verbosity": 2, "tree_method": "hist", "device": "cuda", "objective": "reg:squarederror"},
        evals=[(dtrain, "train")],

[What happened]
The following error occurs at a line of xgb.dask.train(). No error happens without "device": "cuda" in a dictionary in one of the arguments of the train().

[18:14:16] /home/conda/feedstock_root/build_artifacts/xgboost-split_1705650282415/work/src/collective/nccl_device_communicator.cu:40: Check failed: n_uniques == world_size_ (1 vs. 4) : Multiple processes within communication group running on same CUDA device is not supported. 9e1253dbb8c3fe1928e2fed0d04a63d5

dask.config.config shows a dictionary. I feel like some of the values in the dictionary should be modified for my hardware configuration along with https://docs.dask.org/en/latest/configuration.html#directly-within-python, however, I have no idea which keys they are.

Please help me out.

I should have set n_workers=1 no matter how many threads_per_worker there are.
I set n_workers=2 since I have 2 GPUs, however, I had the same error. Maybe I should use LocalCudaCluster() or something instead of LocalCluster().

Yes, you should use LocalCUDACluster instead of LocalCluster. LocalCUDACluster ensures that each worker is assigned exactly one GPU.

1 Like