NVIDIA Quadro RTX 5000 Windows 10 Issue

camda03 · April 10, 2020, 5:07pm

I have an NVIDIA Quadro RTX 5000 which is listed as GPU 1 in Task Manager (Windows 10)

Per the GPU Support page I’ve added these two parameters to my training parameters.

parameters [‘gpu_id’] = 1 # NVIDIA Quadro RTX 5000
parameters[‘tree_method’] = ‘gpu_hist’

I get the following error.

XGBoostError: [12:52:53] c:\users\administrator\workspace\xgboost-win64_release_1.0.0\src\common\common.h:41: c:\users\administrator\workspace\xgboost-win64_release_1.0.0\src\predictor…/common/device_helpers.cuh: 126: invalid device ordinal

I’ve tried a few other things including 0 and none of them work.
(The Intel UHD Graphics 630 GPU is at 0.)

Is there something else that I need to do in order to get the NVIDIA GPU to work?

I have run tests using the CPU (XEON) and so far so good with those.

Thanks!

camda03 · April 10, 2020, 5:10pm

BTW I’m running XGBoost 1.0.2 through a Jupyter notebook launched from Anaconda.

hcho3 · April 10, 2020, 8:42pm

Did you install CUDA driver?

camda03 · April 11, 2020, 12:54am

I think (hope) I have.

Unfortunately I’m still getting the same error, even after a reboot to install the NVIDIA CUDA Toolkit 10.2.

Thanks for your help with this!

Here’s what the GPU is running.

NVIDIA System Information report created on: 04/10/2020 20:40:54
System name: DESKTOP-ECFI88Q

[Display]
Operating System: Windows 10 Pro for Workstations, 64-bit
DirectX version: 12.0
GPU processor: Quadro RTX 5000
Driver version: 441.22
Driver Type: DCH
Direct3D API version: 12
Direct3D feature level: 12_1
CUDA Cores: 3072
Core clock: 1545 MHz
Memory data rate: 14.00 Gbps
Memory interface: 256-bit
Memory bandwidth: 448.06 GB/s
Total available graphics memory: 81796 MB
Dedicated video memory: 16384 MB GDDR6
System video memory: 0 MB
Shared system memory: 65412 MB
Video BIOS version: 90.04.52.00.36
IRQ: Not used
Bus: PCI Express x16 Gen3
Device Id: 10DE 1EB5 09271028
Part Number: 4914 0010

[Components]

nvui.dll 8.17.14.4122 NVIDIA User Experience Driver Component
nvxdplcy.dll 8.17.14.4122 NVIDIA User Experience Driver Component
nvxdbat.dll 8.17.14.4122 NVIDIA User Experience Driver Component
nvxdapix.dll 8.17.14.4122 NVIDIA User Experience Driver Component
NVCPL.DLL 8.17.14.4122 NVIDIA User Experience Driver Component
nvCplUIR.dll 8.1.940.0 NVIDIA Control Panel
nvCplUI.exe 8.1.940.0 NVIDIA Control Panel
nvWSSR.dll 26.21.14.4122 NVIDIA Workstation Server
nvWSS.dll 26.21.14.4122 NVIDIA Workstation Server
nvViTvSR.dll 26.21.14.4122 NVIDIA Video Server
nvViTvS.dll 26.21.14.4122 NVIDIA Video Server
nvLicensingS.dll 6.14.14.4122 NVIDIA Licensing Server
nvDevToolS.dll 26.21.14.4122 NVIDIA 3D Settings Server
nvDispSR.dll 26.21.14.4122 NVIDIA Display Server
nvDispS.dll 26.21.14.4122 NVIDIA Display Server
PhysX 09.19.0218 NVIDIA PhysX
NVCUDA.DLL 26.21.14.4122 NVIDIA CUDA 10.2.95 driver
nvGameSR.dll 26.21.14.4122 NVIDIA 3D Settings Server
nvGameS.dll 26.21.14.4122 NVIDIA 3D Settings Server

Here’s what I have for XGBoost.

!pip install --upgrade xgboost
Requirement already up-to-date: xgboost in c:\users\camda\anaconda3\lib\site-packages (1.0.2)
Requirement already satisfied, skipping upgrade: numpy in c:\users\camda\anaconda3\lib\site-packages (from xgboost) (1.18.1)
Requirement already satisfied, skipping upgrade: scipy in c:\users\camda\anaconda3\lib\site-packages (from xgboost) (1.4.1)

hcho3 · April 11, 2020, 9:46am

@camda03 I cannot reproduce the issue on my Windows machine (AWS EC2 G4 instance). I just ran this program:

import xgboost as xgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
dtrain = xgb.DMatrix(X, label=y)
params = {'max_depth': 2, 'eta': 0.1, 'gpu_id': 0, 
          'tree_method': 'gpu_hist', 'eval_metric': 'error'}

bst = xgb.train(params, dtrain, num_boost_round=250, evals=[(dtrain, 'train')])

My machine:

AWS EC2 G4 instance
Windows Server 2019
Miniconda 3
CUDA Toolkit 10.2

My suggestion for you is to remove gpu_id parameter and instead use the environment variable CUDA_VISIBLE_DEVICES to choose the GPU you’d like to use.

camda03 · April 11, 2020, 12:47pm

That helped, thanks!

I tried this when I read your message.

The good news - it did connect to GPU 1 and ran for awhile.

However, the kernel kept dying after four or five xgb.train calls.

I looked at the Jupyter log but I didn’t see anything to indicate why the kernel might have died.

I tried restarting Jupyter and then rebooting the machine.

Unfortunately now it won’t run at all. I ge:t XGBoostError: [08:33:51] c:\users\administrator\workspace\xgboost-win64_release_1.0.0\src\gbm\gbtree.h:308: Check failed: gpu_predictor_:

I’ve set the variable at the system and user levels.

Anyway, it did run for awhile. I hope that this info is helpful.

Thanks!

P.S. Here’s a parameter sample and the full traceback.

{‘rate_drop’: 0.34390186718905363, ‘eta’: 0.7436749928679078, ‘min_child_weight’: 39, ‘alpha’: 3.6571649662944705, ‘max_depth’: 25, ‘min_subsample’: 0.07562723872365831, ‘lambda’: 2.6508283625950915, ‘objective’: ‘reg:squaredlogerror’, ‘eval_metric’: ‘rmsle’, ‘tree_method’: ‘gpu_hist’}

XGBoostError Traceback (most recent call last)
in
9 print(parameters)
10
—> 11 bst, results = train_model(parameters, training_dmatrix, TESTS_PER_CYCLE, evallist, local_objective, local_metric)
12
13 print(bst)

in train_model(params, dtrain, num_boost_round, evals, local_objective, local_metric)
12 bst = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round, evals=evals,
13 evals_result=results, verbose_eval=VERBOSE_EVAL_INTERVAL, obj=local_objective,
—> 14 feval=local_metric)
15
16

~\anaconda3\lib\site-packages\xgboost\training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks)
207 evals=evals,
208 obj=obj, feval=feval,
–> 209 xgb_model=xgb_model, callbacks=callbacks)
210
211

~\anaconda3\lib\site-packages\xgboost\training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
72 # Skip the first update if it is a recovery step.
73 if version % 2 == 0:
—> 74 bst.update(dtrain, i, obj)
75 bst.save_rabit_checkpoint()
76 version += 1

~\anaconda3\lib\site-packages\xgboost\core.py in update(self, dtrain, iteration, fobj)
1249 dtrain.handle))
1250 else:
-> 1251 pred = self.predict(dtrain, training=True)
1252 grad, hess = fobj(pred, dtrain)
1253 self.boost(dtrain, grad, hess)

~\anaconda3\lib\site-packages\xgboost\core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features, training)
1450 ctypes.c_int(training),
1451 ctypes.byref(length),
-> 1452 ctypes.byref(preds)))
1453 preds = ctypes2numpy(preds, length.value, np.float32)
1454 if pred_leaf:

~\anaconda3\lib\site-packages\xgboost\core.py in _check_call(ret)
187 “”"
188 if ret != 0:
–> 189 raise XGBoostError(py_str(_LIB.XGBGetLastError()))
190
191

XGBoostError: [08:33:51] c:\users\administrator\workspace\xgboost-win64_release_1.0.0\src\gbm\gbtree.h:308: Check failed: gpu_predictor_: