Why I can't use all my gpus


#1

My server has 4 GPUS V100, but I can only run with 3 of them .

That’s my code .

import xgboost as xgb
d_train = xgb.DMatrix('./raw-data/stacking_train.00')
param = {'max_depth': 6,
         'eta': 0.1,
         'silent': 1,
         'objective': 'binary:logistic',
         'tree_method':'gpu_hist',
        'subsample':0.75,
        'colsample_bytree':0.8,
        #'rate_drop':0.1
         'n_gpus' : 3,
         'nthread':10}

param['n_estimators'] = 4000
param['eval_metric'] = ['error','error@0.004']
evallist = [(d_train,'train')]
bst = xgb.train(param, d_train, 3000, evallist)

When I set the ‘n_gpus’ =4 or -1 ,it get errors:

[07:34:02] 100000x5021 matrix with 101394310 entries loaded from ./raw-data/stacking_train.00
[07:34:08] /workspace/xgboost/rabit/include/rabit/./internal/…/…/dmlc/./logging.h:300: [07:34:08] /workspace/xgboost/include/xgboost/./…/…/src/common/common.h:41: /workspace/xgboost/src/tree/updater_gpu_hist.cu: 286: invalid argument

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN4dmlc15LogMessageFatalD2Ev+0x3c) [0x7efdecd35a0c]
[bt] (1) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(+0x359890) [0x7efdecfa7890]
[bt] (2) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE10UpdateTreeEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixEPNS_7RegTreeE+0x2032) [0x7efdecfd1152]
[bt] (3) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE6UpdateEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixERKSt6vectorIPNS_7RegTreeESaISE_EE+0x14a) [0x7efdecfd193a]
[bt] (4) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_7DMatrixEiPSt6vectorISt10unique_ptrINS_7RegTreeESt14default_deleteISC_EESaISF_EE+0xa8c) [0x7efdecd6dd2c]
[bt] (5) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_11ObjFunctionE+0xc39) [0x7efdecd6f669]
[bt] (6) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x381) [0x7efdecdb16c1]
[bt] (7) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x45) [0x7efdecd49235]
[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7efe77cbae20]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7efe77cba88b]

[07:34:08] /workspace/xgboost/rabit/include/rabit/./internal/…/…/dmlc/./logging.h:300: [07:34:08] /workspace/xgboost/src/tree/updater_gpu_hist.cu:957: Exception in gpu_hist: [07:34:08] /workspace/xgboost/include/xgboost/./…/…/src/common/common.h:41: /workspace/xgboost/src/tree/updater_gpu_hist.cu: 286: invalid argument
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN4dmlc15LogMessageFatalD2Ev+0x3c) [0x7efdecd35a0c]
[bt] (1) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(+0x359890) [0x7efdecfa7890]
[bt] (2) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE10UpdateTreeEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixEPNS_7RegTreeE+0x2032) [0x7efdecfd1152]
[bt] (3) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE6UpdateEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixERKSt6vectorIPNS_7RegTreeESaISE_EE+0x14a) [0x7efdecfd193a]
[bt] (4) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_7DMatrixEiPSt6vectorISt10unique_ptrINS_7RegTreeESt14default_deleteISC_EESaISF_EE+0xa8c) [0x7efdecd6dd2c]
[bt] (5) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_11ObjFunctionE+0xc39) [0x7efdecd6f669]
[bt] (6) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x381) [0x7efdecdb16c1]
[bt] (7) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x45) [0x7efdecd49235]
[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7efe77cbae20]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7efe77cba88b]

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN4dmlc15LogMessageFatalD2Ev+0x3c) [0x7efdecd35a0c]
[bt] (1) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE6UpdateEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixERKSt6vectorIPNS_7RegTreeESaISE_EE+0x2b1) [0x7efdecfd1aa1]
[bt] (2) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_7DMatrixEiPSt6vectorISt10unique_ptrINS_7RegTreeESt14default_deleteISC_EESaISF_EE+0xa8c) [0x7efdecd6dd2c]
[bt] (3) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_11ObjFunctionE+0xc39) [0x7efdecd6f669]
[bt] (4) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x381) [0x7efdecdb16c1]
[bt] (5) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x45) [0x7efdecd49235]
[bt] (6) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7efe77cbae20]
[bt] (7) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7efe77cba88b]
[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7efe77cb501a]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) [0x7efe77ca8fcb]


XGBoostError Traceback (most recent call last)
in
----> 1 import run

/workspace/data/run.py in
18 param[‘eval_metric’] = [‘error’,‘error@0.004’]
19 evallist = [(d_train,‘train’)]
—> 20 bst = xgb.train(param, d_train, 3000, evallist)

/usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks, learning_rates)
214 evals=evals,
215 obj=obj, feval=feval,
–> 216 xgb_model=xgb_model, callbacks=callbacks)
217
218

/usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
72 # Skip the first update if it is a recovery step.
73 if version % 2 == 0:
—> 74 bst.update(dtrain, i, obj)
75 bst.save_rabit_checkpoint()
76 version += 1

/usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/core.py in update(self, dtrain, iteration, fobj)
1060 if fobj is None:
1061 _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, ctypes.c_int(iteration),
-> 1062 dtrain.handle))
1063 else:
1064 pred = self.predict(dtrain)

/usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/core.py in _check_call(ret)
176 “”"
177 if ret != 0:
–> 178 raise XGBoostError(_LIB.XGBGetLastError())
179
180
XGBoostError: b’[07:34:08] /workspace/xgboost/src/tree/updater_gpu_hist.cu:957: Exception in gpu_hist: [07:34:08] /workspace/xgboost/include/xgboost/./…/…/src/common/common.h:41: /workspace/xgboost/src/tree/updater_gpu_hist.cu: 286: invalid argument\n\nStack trace returned 10 entries:\n[bt] (0) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN4dmlc15LogMessageFatalD2Ev+0x3c) [0x7efdecd35a0c]\n[bt] (1) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(+0x359890) [0x7efdecfa7890]\n[bt] (2) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE10UpdateTreeEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixEPNS_7RegTreeE+0x2032) [0x7efdecfd1152]\n[bt] (3) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE6UpdateEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixERKSt6vectorIPNS_7RegTreeESaISE_EE+0x14a) [0x7efdecfd193a]\n[bt] (4) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_7DMatrixEiPSt6vectorISt10unique_ptrINS_7RegTreeESt14default_deleteISC_EESaISF_EE+0xa8c) [0x7efdecd6dd2c]\n[bt] (5) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_11ObjFunctionE+0xc39) [0x7efdecd6f669]\n[bt] (6) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x381) [0x7efdecdb16c1]\n[bt] (7) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x45) [0x7efdecd49235]\n[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7efe77cbae20]\n[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7efe77cba88b]\n\n\n\nStack trace returned 10 entries:\n[bt] (0) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN4dmlc15LogMessageFatalD2Ev+0x3c) [0x7efdecd35a0c]\n[bt] (1) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE6UpdateEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixERKSt6vectorIPNS_7RegTreeESaISE_EE+0x2b1) [0x7efdecfd1aa1]\n[bt] (2) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_7DMatrixEiPSt6vectorISt10unique_ptrINS_7RegTreeESt14default_deleteISC_EESaISF_EE+0xa8c) [0x7efdecd6dd2c]\n[bt] (3) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_11ObjFunctionE+0xc39) [0x7efdecd6f669]\n[bt] (4) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x381) [0x7efdecdb16c1]\n[bt] (5) /usr/local/lib/python3.5/dist-packages/xgboost-0.81-py3.5-linux-x86_64.egg/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x45) [0x7efdecd49235]\n[bt] (6) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7efe77cbae20]\n[bt] (7) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7efe77cba88b]\n[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7efe77cb501a]\n[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) [0x7efe77ca8fcb]\n’

When I use 3 of my GPUs , it will work fine.


#2

The error came from CUDA api call, specifically cudaMemset. How many rows your dataset have?


#3

Opened an issue for you:


#4

100000x5021 matrix with 101394310 entries loaded from ./raw-data/stacking_train.00


#5

Thanks. This is a bug and I will look into it in the future. Please follow that issue. It will take some time since currently I don’t have access to machine with 4 GPUs.