[Environment]
OS: Ubuntu18.04
Hardware: 1PC installed with 2GPU boards with 8GB VRAM each
Python: 3.10.13 with xgboost 2.0.3 & dask 2024.1.1
[part of code]
cluster = dask.distributed.LocalCluster()
client = dask.distributed.Client(cluster)
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
output = xgb.dask.train(
client,
{"verbosity": 2, "tree_method": "hist", "device": "cuda", "objective": "reg:squarederror"},
dtrain,
num_boost_round=4,
evals=[(dtrain, "train")],
)
[What happened]
The following error occurs at a line of xgb.dask.train()
. No error happens without "device": "cuda"
in a dictionary in one of the arguments of the train().
[18:14:16] /home/conda/feedstock_root/build_artifacts/xgboost-split_1705650282415/work/src/collective/nccl_device_communicator.cu:40: Check failed: n_uniques == world_size_ (1 vs. 4) : Multiple processes within communication group running on same CUDA device is not supported. 9e1253dbb8c3fe1928e2fed0d04a63d5
[Question]
dask.config.config shows a dictionary. I feel like some of the values in the dictionary should be modified for my hardware configuration along with https://docs.dask.org/en/latest/configuration.html#directly-within-python, however, I have no idea which keys they are.
Please help me out.