Validation set hosting during GPU training

lebiathan · January 3, 2024, 3:28pm

Hey all,

Happy New Year!

I’d like to clarify something regarding the validation set that’s used during GPU training, and more specifically where it’s hosted. This isn’t a generic ML question, but rather a question on XGBoost implementation.

Suppose we train a simple XGB Classifier using the code below

import xgboost as xgb
import numpy as np

np.random.seed(0)
X = np.random.random((1000, 10))
y = np.random.binomial(1, 0.5, len(X))

vX = np.random.random((100000, 10))
vy = np.random.binomial(1, 0.1, len(vX))

clf = xgb.XGBClassifier(n_estimators=10, max_depth=4, tree_mehod="gpu_hist")  # I understand "gpu_hist" is an older approach
clf.fit(X, y, eval_set=[(X,y), (vX, vy)], eval_meric=["logloss"])

The machine we train on has a GPU available and we use it for training (gpu_hist param). The model is evaluated on the training and validation sets. To train the model, the training set will be moved to GPU (expected).

Question: Will the validation set (vX, vy) also be moved to GPU? From some personal observations, I think the answer is “yes”, but it’d be good to know for sure.

Question 2: If the answer to the above question is indeed “yes”, is there a way to tell XGBoost to only copy the training set to GPU but not the validation set?

Some Additional Context

Copying the validation set to GPU takes up space from a very precious resource (GPU RAM). In fact, doing so can often cause OOM errors, depending on dataset characteristics, when both the training and validation set cannot both fit in GPU mem. Using external memory slows things down considerably, as it also seems to apply to both the training and the validation set(s), and moving data back ‘n’ forth can add significant overhead. That behavior also depends on certain implementation details, but I digress.

However, the validation set is only needed for inference (prediction). With that in mind, I’d claim that only the training set needs to be on the GPU to boost training speed, while the validation set can remain in (regular) RAM. I understand this may be overly simplistic.

Separately, I’ve seen that one can use a gpu_predictor or cpu_predictor for inference, but I’m unclear whether this setting also affects whether the data are copied or not (specifically the validation set).

Before I make any feature requests, I would like to understand the current implementation a bit more.

Thanks in advance!