Multiple processes within communication group running on same CUDA device is not supported

emrecicekyurt · June 18, 2024, 4:25pm

Hello,

I want to perform hyper-parameter tuning of XGBoost using multiple GPUs in an HPC environment. I faced the error “Multiple processes within communication group running on same CUDA device is not supported.”. I used LocalCUDACluster with n_workers=4 and for hyper-parameter optimization I use fn() function from hyperopt library. I am sure that each worker is assigned to a different GPU device

{‘tcp://127.0.0.1:37381’: ‘GPU-944354e8-8a62-3c0e-5adf-12711434804c’, ‘tcp://127.0.0.1:38521’: ‘GPU-44899cee-c31c-01a4-f045-eda8ede3e876’, ‘tcp://127.0.0.1:41189’: ‘GPU-a4f9c888-6cd5-82d3-c83a-cea4d747e035’, ‘tcp://127.0.0.1:46841’: ‘GPU-546f8d0d-c530-7e1c-3dcc-8b308b63b2b2’}

When I check the workers information, I see that only a worker utilizes GPU:

Worker 3

Worker tcp://127.0.0.1:37381: info {‘type’: ‘Worker’, ‘id’: 3, ‘host’: ‘127.0.0.1’, ‘resources’: {}, ‘local_directory’: ‘/scratch/tmp/3170541/dask-scratch-space/worker-y2b1_t5p’, ‘name’: 3, ‘nthreads’: 1, ‘memory_limit’: 131072000000, ‘last_seen’: 1718727438.2354257, ‘services’: {‘dashboard’: 40993}, ‘metrics’: {‘task_counts’: {‘memory’: 15, ‘error’: 4, ‘released’: 2}, ‘bandwidth’: {‘total’: 2894637538.1045012, ‘workers’: {}, ‘types’: {}}, ‘digests_total_since_heartbeat’: {‘latency’: 0.0014104843139648438, ‘profile-duration’: 0.0010442733764648438, ‘tick-duration’: 0.5005083084106445, (‘execute’, ‘dispatched_train’, ‘thread-cpu’, ‘seconds’): 10.324267772999974, (‘execute’, ‘dispatched_train’, ‘thread-noncpu’, ‘seconds’): 0.19892153325918116, (‘execute’, ‘dispatched_train’, ‘executor’, ‘seconds’): 0.0003533677663654089, (‘execute’, ‘dispatched_train’, ‘other’, ‘seconds’): 0.0007818789454177022, (‘get-data’, ‘serialize’, ‘seconds’): 0.020052231033332646, (‘get-data’, ‘compress’, ‘seconds’): 5.982117727398872e-06}, ‘managed_bytes’: 18070586399, ‘spilled_bytes’: {‘memory’: 0, ‘disk’: 0}, ‘transfer’: {‘incoming_bytes’: 0, ‘incoming_count’: 0, ‘incoming_count_total’: 314, ‘outgoing_bytes’: 22715, ‘outgoing_count’: 1, ‘outgoing_count_total’: 848}, ‘event_loop_interval’: 0.020012521743774415, ‘cpu’: 100.0, ‘memory’: 6550106112, ‘time’: 1718727437.7344222, ‘host_net_io’: {‘read_bps’: 45878.502118908735, ‘write_bps’: 45558.582295896085}, ‘host_disk_io’: {‘read_bps’: 0.0, ‘write_bps’: 65519.579752992}, ‘gil_contention’: 0.0007561793900094926, ‘num_fds’: 79, ‘gpu_utilization’: 0, ‘gpu_memory_used’: 315949056}, ‘status’: ‘running’, ‘nanny’: ‘tcp://127.0.0.1:37373’}

Worker 1

Worker tcp://127.0.0.1:38521: info {‘type’: ‘Worker’, ‘id’: 1, ‘host’: ‘127.0.0.1’, ‘resources’: {}, ‘local_directory’: ‘/scratch/tmp/3170541/dask-scratch-space/worker-n2sel_1o’, ‘name’: 1, ‘nthreads’: 1, ‘memory_limit’: 131072000000, ‘last_seen’: 1718727438.1410418, ‘services’: {‘dashboard’: 36741}, ‘metrics’: {‘task_counts’: {‘memory’: 1}, ‘bandwidth’: {‘total’: 100000000, ‘workers’: {}, ‘types’: {}}, ‘digests_total_since_heartbeat’: {‘latency’: 0.0012814998626708984, ‘tick-duration’: 0.4997572898864746}, ‘managed_bytes’: 368640000, ‘spilled_bytes’: {‘memory’: 0, ‘disk’: 0}, ‘transfer’: {‘incoming_bytes’: 0, ‘incoming_count’: 0, ‘incoming_count_total’: 0, ‘outgoing_bytes’: 0, ‘outgoing_count’: 0, ‘outgoing_count_total’: 114}, ‘event_loop_interval’: 0.01999349594116211, ‘cpu’: 0.0, ‘memory’: 1228947456, ‘time’: 1718727437.6411247, ‘host_net_io’: {‘read_bps’: 45660.679341810486, ‘write_bps’: 45340.63252467772}, ‘host_disk_io’: {‘read_bps’: 0.0, ‘write_bps’: 65545.58814879073}, ‘gil_contention’: 0.000758473586756736, ‘num_fds’: 77, ‘gpu_utilization’: 0, ‘gpu_memory_used’: 315949056}, ‘status’: ‘running’, ‘nanny’: ‘tcp://127.0.0.1:46075’}

Worker 0

Worker tcp://127.0.0.1:41189: info {‘type’: ‘Worker’, ‘id’: 0, ‘host’: ‘127.0.0.1’, ‘resources’: {}, ‘local_directory’: ‘/scratch/tmp/3170541/dask-scratch-space/worker-e70ktjrg’, ‘name’: 0, ‘nthreads’: 1, ‘memory_limit’: 131072000000, ‘last_seen’: 1718727438.234162, ‘services’: {‘dashboard’: 46785}, ‘metrics’: {‘task_counts’: {‘memory’: 6, ‘error’: 4, ‘released’: 1}, ‘bandwidth’: {‘total’: 100000000, ‘workers’: {}, ‘types’: {}}, ‘digests_total_since_heartbeat’: {‘latency’: 0.0013418197631835938, ‘tick-duration’: 0.5002224445343018}, ‘managed_bytes’: 1844641228, ‘spilled_bytes’: {‘memory’: 0, ‘disk’: 0}, ‘transfer’: {‘incoming_bytes’: 0, ‘incoming_count’: 0, ‘incoming_count_total’: 1, ‘outgoing_bytes’: 0, ‘outgoing_count’: 0, ‘outgoing_count_total’: 434}, ‘event_loop_interval’: 0.01999287128448486, ‘cpu’: 4.0, ‘memory’: 1178628096, ‘time’: 1718727437.733471, ‘host_net_io’: {‘read_bps’: 45911.88407954344, ‘write_bps’: 45591.731477550544}, ‘host_disk_io’: {‘read_bps’: 0.0, ‘write_bps’: 65567.25288814466}, ‘gil_contention’: 0.0007572172326035798, ‘num_fds’: 78, ‘gpu_utilization’: 93, ‘gpu_memory_used’: 8766357504}, ‘status’: ‘running’, ‘nanny’: ‘tcp://127.0.0.1:41639’}

Worker 2

Worker tcp://127.0.0.1:46841: info {‘type’: ‘Worker’, ‘id’: 2, ‘host’: ‘127.0.0.1’, ‘resources’: {}, ‘local_directory’: ‘/scratch/tmp/3170541/dask-scratch-space/worker-m3uqd6x9’, ‘name’: 2, ‘nthreads’: 1, ‘memory_limit’: 131072000000, ‘last_seen’: 1718727437.7701666, ‘services’: {‘dashboard’: 38307}, ‘metrics’: {‘task_counts’: {‘memory’: 1}, ‘bandwidth’: {‘total’: 100000000, ‘workers’: {}, ‘types’: {}}, ‘digests_total_since_heartbeat’: {‘latency’: 0.0016438961029052734, ‘tick-duration’: 0.49962377548217773}, ‘managed_bytes’: 160000, ‘spilled_bytes’: {‘memory’: 0, ‘disk’: 0}, ‘transfer’: {‘incoming_bytes’: 0, ‘incoming_count’: 0, ‘incoming_count_total’: 0, ‘outgoing_bytes’: 0, ‘outgoing_count’: 0, ‘outgoing_count_total’: 261}, ‘event_loop_interval’: 0.02000694751739502, ‘cpu’: 0.0, ‘memory’: 845303808, ‘time’: 1718727437.2693243, ‘host_net_io’: {‘read_bps’: 53149.34204102666, ‘write_bps’: 52708.58455868561}, ‘host_disk_io’: {‘read_bps’: 0.0, ‘write_bps’: 65648.82355159869}, ‘gil_contention’: 0.0013746030163019896, ‘num_fds’: 77, ‘gpu_utilization’: 0, ‘gpu_memory_used’: 315949056}, ‘status’: ‘running’, ‘nanny’: ‘tcp://127.0.0.1:37861’}

ERROR

ERROR:root:Error occurred: [17:39:23] /apps/ACC/ANACONDA/2023.07/envs/ESP2/modules/xgboost-2.0.3/src/collective/nccl_device_communicator.cu:40: Check failed: n_uniques == world_size_ (1 vs. 2) : Multiple processes within communication group running on same CUDA device is not supported. d382d56c88c8f9a435e047d7a4ce3ac8