Strange memory failure on dask + GPU

I encountered a strange GPU memory error.

Here is the error info (just from one of dask worker):
File “//xgboost/core.py", line 433, in inner_f
return f(kwargs)
File "/
/xgboost/dask.py", line 1695, in fit
return self._client_sync(self._fit_async, args)
File "/
/xgboost/dask.py", line 1592, in _client_sync
return self.client.sync(func, kwargs, asynchronous=asynchronous)
File "/
/distributed/utils.py”, line 309, in sync
return sync(
File “//distributed/utils.py", line 376, in sync
raise exc.with_traceback(tb)
File "/
/distributed/utils.py”, line 349, in f
result = yield future
File “//tornado/gen.py", line 762, in run
value = future.result()
File "/
/xgboost/dask.py”, line 1653, in _fit_async
results = await self.client.sync(
File “//xgboost/dask.py", line 914, in _train_async
results = await client.gather(futures, asynchronous=True)
File "/
/distributed/client.py”, line 2030, in _gather
raise exception.with_traceback(traceback)
File “//xgboost/dask.py", line 870, in dispatched_train
bst = worker_train(params=local_param,
File "/
/xgboost/training.py”, line 191, in train
bst = _train_internal(params, dtrain,
File “//xgboost/training.py", line 82, in _train_internal
bst.update(dtrain, i, obj)
File "/
/xgboost/core.py”, line 1496, in update
_check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
File “/***/xgboost/core.py”, line 210, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [14:50:36] /opt/conda/envs/rapids/conda-bld/xgboost_1619020640907/work/src/tree/updater_gpu_hist.cu:793: Exception in gpu_hist: [14:50:36] /opt/conda/envs/rapids/conda-bld/xgboost_1619020640907/work/src/c_api/…/data/…/common/device_helpers.cuh:414: Memory allocation error on worker 2: Caching allocator

  • Free memory: 7269908480
  • Requested memory: 8825801215

Stack trace:
[bt] (0) //lib/libxgboost.so(+0x39c872) [0x7f1c46bad872]
[bt] (1) /
/lib/libxgboost.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, unsigned long)+0x395) [0x7f1c46bb1985]
[bt] (2) //lib/libxgboost.so(void thrust::cuda_cub::sort<thrust::detail::execute_with_allocator<dh::detail::XGBCachingDeviceAllocatorImpl&, thrust::cuda_cub::execute_on_stream_base>, thrust::detail::normal_iterator<thrust::device_ptrxgboost::Entry >, xgboost::common::detail::EntryCompareOp>(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<dh::detail::XGBCachingDeviceAllocatorImpl&, thrust::cuda_cub::execute_on_stream_base> >&, thrust::detail::normal_iterator<thrust::device_ptrxgboost::Entry >, thrust::detail::normal_iterator<thrust::device_ptrxgboost::Entry >, xgboost::common::detail::EntryCompareOp)+0xcde) [0x7f1c46bc7cde]
[bt] (3) /
/lib/libxgboost.so(xgboost::common::ProcessBatch(int, xgboost::MetaInfo const&, xgboost::SparsePage const&, unsigned long, unsigned long, xgboost::common::SketchContainer*, int, unsigned long)+0x18e) [0x7f1c46bbf9de]
[bt] (4) //lib/libxgboost.so(xgboost::common::DeviceSketch(int, xgboost::DMatrix, int, unsigned long)+0x68e) [0x7f1c46bc11ee]
[bt] (5) /
/lib/libxgboost.so(xgboost::EllpackPageImpl::EllpackPageImpl(xgboost::DMatrix, xgboost::BatchParam const&)+0x3ba) [0x7f1c46c20aea]
[bt] (6) //lib/libxgboost.so(xgboost::EllpackPage::EllpackPage(xgboost::DMatrix, xgboost::BatchParam const&)+0x2e) [0x7f1c46c20f6e]
[bt] (7) /
/lib/libxgboost.so(xgboost::data::SimpleDMatrix::GetEllpackBatches(xgboost::BatchParam const&)+0xa8) [0x7f1c46a23098]
[bt] (8) /
*/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::InitDataOnce(xgboost::DMatrix)+0xf2) [0x7f1c46d70dd2]

Stack trace:
[bt] (0) //lib/libxgboost.so(+0x54e3f2) [0x7f1c46d5f3f2]
[bt] (1) /
/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x703) [0x7f1c46d7b723]
[bt] (2) //lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x1ac) [0x7f1c46a59dfc]
[bt] (3) //lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::PredictionCacheEntry)+0x4c6) [0x7f1c46a5c7b6]
[bt] (4) //lib/libxgboost.so(+0x25f210) [0x7f1c46a70210]
[bt] (5) /
/lib/libxgboost.so(XGBoosterUpdateOneIter+0x4e) [0x7f1c46968efe]
[bt] (6) //lib/python3.8/lib-dynload/…/…/libffi.so.7(+0x69dd) [0x7f32974429dd]
[bt] (7) /
/lib/python3.8/lib-dynload/…/…/libffi.so.7(+0x6067) [0x7f3297442067]
[bt] (8) /***/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f329745ad39]

Data set:

  • one training: 4M rows * 4K columns (all float)
  • four test: in total, 3.4 M rows * 4K columns (all float)
    Binary DMatrix on hard disk is 122G.

Hardware:

  • Two GPU servers: each has 4 GPU cards, 80 CPU VCores and 1.5T RAM
  • GPU card is NVIDA Tesla V100-SXM2 16G, with NVIDIA driver 418.67 and CUDA 10.1

Software:

  • miniconda 4.9.2 with python 3.8.11
  • xgboost-1.4.0dev.rapidsai0.19-cuda10.1py38_0 (installed from conda rapids channel)
  • dask+ditributed 2022.03.0 (installed from conda rapids channel)

I say it is strange because:

  • the number (7G) reported in “Free memory: 7269908480” is actually wrong because all GPU cards are actually empty (16G available).
  • if switch back to conda4.3.1 + python3.6 + xgboost 1.1.0 (pip installed) + dask_2.19.0 (pip installed, distributed included), the error disappear and actually only needs ~70G GPU RAM.
  • If I reduce the data size to 10%, memory is fine but speed is 2-3 time slower.

Is this due to that xgboost 1.4.0 has cuda included, so with some issues with system cuda/driver? or some other reasons?