Strange memory failure on dask + GPU

I encountered a strange GPU memory error.

Here is the error info (just from one of dask worker):
File “//xgboost/", line 433, in inner_f
return f(kwargs)
File "/
/xgboost/", line 1695, in fit
return self._client_sync(self._fit_async, args)
File "/
/xgboost/", line 1592, in _client_sync
return self.client.sync(func, kwargs, asynchronous=asynchronous)
File "/
/distributed/”, line 309, in sync
return sync(
File “//distributed/", line 376, in sync
raise exc.with_traceback(tb)
File "/
/distributed/”, line 349, in f
result = yield future
File “//tornado/", line 762, in run
value = future.result()
File "/
/xgboost/”, line 1653, in _fit_async
results = await self.client.sync(
File “//xgboost/", line 914, in _train_async
results = await client.gather(futures, asynchronous=True)
File "/
/distributed/”, line 2030, in _gather
raise exception.with_traceback(traceback)
File “//xgboost/", line 870, in dispatched_train
bst = worker_train(params=local_param,
File "/
/xgboost/”, line 191, in train
bst = _train_internal(params, dtrain,
File “//xgboost/", line 82, in _train_internal
bst.update(dtrain, i, obj)
File "/
/xgboost/”, line 1496, in update
File “/***/xgboost/”, line 210, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [14:50:36] /opt/conda/envs/rapids/conda-bld/xgboost_1619020640907/work/src/tree/ Exception in gpu_hist: [14:50:36] /opt/conda/envs/rapids/conda-bld/xgboost_1619020640907/work/src/c_api/…/data/…/common/device_helpers.cuh:414: Memory allocation error on worker 2: Caching allocator

  • Free memory: 7269908480
  • Requested memory: 8825801215

Stack trace:
[bt] (0) //lib/ [0x7f1c46bad872]
[bt] (1) /
/lib/<char, std::char_traits, std::allocator > const&, unsigned long)+0x395) [0x7f1c46bb1985]
[bt] (2) //lib/ thrust::cuda_cub::sort<thrust::detail::execute_with_allocator<dh::detail::XGBCachingDeviceAllocatorImpl&, thrust::cuda_cub::execute_on_stream_base>, thrust::detail::normal_iterator<thrust::device_ptrxgboost::Entry >, xgboost::common::detail::EntryCompareOp>(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<dh::detail::XGBCachingDeviceAllocatorImpl&, thrust::cuda_cub::execute_on_stream_base> >&, thrust::detail::normal_iterator<thrust::device_ptrxgboost::Entry >, thrust::detail::normal_iterator<thrust::device_ptrxgboost::Entry >, xgboost::common::detail::EntryCompareOp)+0xcde) [0x7f1c46bc7cde]
[bt] (3) /
/lib/, xgboost::MetaInfo const&, xgboost::SparsePage const&, unsigned long, unsigned long, xgboost::common::SketchContainer*, int, unsigned long)+0x18e) [0x7f1c46bbf9de]
[bt] (4) //lib/, xgboost::DMatrix, int, unsigned long)+0x68e) [0x7f1c46bc11ee]
[bt] (5) /
/lib/, xgboost::BatchParam const&)+0x3ba) [0x7f1c46c20aea]
[bt] (6) //lib/, xgboost::BatchParam const&)+0x2e) [0x7f1c46c20f6e]
[bt] (7) /
/lib/ const&)+0xa8) [0x7f1c46a23098]
[bt] (8) /
*/lib/<xgboost::detail::GradientPairInternal >::InitDataOnce(xgboost::DMatrix)+0xf2) [0x7f1c46d70dd2]

Stack trace:
[bt] (0) //lib/ [0x7f1c46d5f3f2]
[bt] (1) /
/lib/<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x703) [0x7f1c46d7b723]
[bt] (2) //lib/<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x1ac) [0x7f1c46a59dfc]
[bt] (3) //lib/, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::PredictionCacheEntry)+0x4c6) [0x7f1c46a5c7b6]
[bt] (4) //lib/ [0x7f1c46a70210]
[bt] (5) /
/lib/ [0x7f1c46968efe]
[bt] (6) //lib/python3.8/lib-dynload/…/…/ [0x7f32974429dd]
[bt] (7) /
/lib/python3.8/lib-dynload/…/…/ [0x7f3297442067]
[bt] (8) /***/lib/python3.8/lib-dynload/ [0x7f329745ad39]

Data set:

  • one training: 4M rows * 4K columns (all float)
  • four test: in total, 3.4 M rows * 4K columns (all float)
    Binary DMatrix on hard disk is 122G.


  • Two GPU servers: each has 4 GPU cards, 80 CPU VCores and 1.5T RAM
  • GPU card is NVIDA Tesla V100-SXM2 16G, with NVIDIA driver 418.67 and CUDA 10.1


  • miniconda 4.9.2 with python 3.8.11
  • xgboost-1.4.0dev.rapidsai0.19-cuda10.1py38_0 (installed from conda rapids channel)
  • dask+ditributed 2022.03.0 (installed from conda rapids channel)

I say it is strange because:

  • the number (7G) reported in “Free memory: 7269908480” is actually wrong because all GPU cards are actually empty (16G available).
  • if switch back to conda4.3.1 + python3.6 + xgboost 1.1.0 (pip installed) + dask_2.19.0 (pip installed, distributed included), the error disappear and actually only needs ~70G GPU RAM.
  • If I reduce the data size to 10%, memory is fine but speed is 2-3 time slower.

Is this due to that xgboost 1.4.0 has cuda included, so with some issues with system cuda/driver? or some other reasons?