GPU accelerated SHAP values crash Google Colab

nijshar28 · July 8, 2021, 1:49pm

Hi. I am trying to utilize Google Colab GPUs to speed up some of my work with xgboost. By default, Colab machines are set up with Python 3.7 and Xgboost 0.90. Running !pip install --upgrade xgboost installs version 1.4. However, when I am trying to run booster.predict(dmat, pred_contribs=True)) in a loop (I am trying to compute SHAP values for different models on the same dataset), it runs for the first 10 iterations or so and then crashes the Colab kernel.

I also run booster.set_param({'predictor': 'gpu_predictor'}) before calling booster.predict(..., pred_contribs=True).

I don’t think the crash is related to resource consumption as RAM and GPU memory loads look low. Also, my dataset and model are pretty small and xgboost predicts the SHAP values blazingly fast until it crashes.

I suspect this is some compatibility issue on the Colab end, but was wondering if there may be things I can try to resolve it, as I don’t have a GPU at home. Any advice would be greatly appreciated. Thanks!

UPDATE:

Here is the crash log from the Colab machine:

|Jul 8, 2021, 12:20:35 AM | WARNING | WARNING:root:kernel 1556ff02-ab27-4cf4-9743-414d48877806 restarted|
|---|---|---|
|Jul 8, 2021, 12:20:35 AM | INFO | KernelRestarter: restarting kernel (1/5), keep random ports|
|Jul 8, 2021, 12:20:32 AM | WARNING | what(): device free failed: an illegal memory access was encountered|
|Jul 8, 2021, 12:20:32 AM | WARNING | terminate called after throwing an instance of 'thrust::system::system_error'|

This is happening on a Tesla P100 card and Ubuntu LTS 18.04.

nijshar28 · July 8, 2021, 10:44pm

After doing some experimentation, I found out that it is the hyper-parameter tuning function I have in my code that is interacting with GPUTreeSHAP and causing the crash, here is the minimal example:

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier, DMatrix

X = np.random.rand(10**4, 10**2)
X_train = np.random.rand(10**2, 10**2)
y_train = np.random.randint(0, 1, X_train.shape[0])

shap_values = []
base_estimator = XGBClassifier(
    tree_method='gpu_hist', gpu_id=0,
    use_label_encoder=False, eval_metric='error', random_state=42
)

def tune_n_estimators(base_estimator, X_train, y_train, early_stopping_rounds=5):
    base_estimator_params = base_estimator.get_params()
    kfold = StratifiedKFold()
    best_iterations = []
    estimator = XGBClassifier(**base_estimator_params)
    X_train_fold, y_train_fold = X_train[:50], y_train[:50]
    X_cv_fold, y_cv_fold = X_train[50:], y_train[50:]
    estimator.fit(
        X_train_fold, y_train_fold,
        early_stopping_rounds=early_stopping_rounds,
        eval_set=[(X_cv_fold, y_cv_fold)],
        eval_metric='error',
        verbose=0
        )
    best_iterations.append(estimator.best_iteration)
    new_params = base_estimator_params.copy()
    new_params['n_estimators'] = int(np.mean(best_iterations))
    estimator = XGBClassifier(**new_params)
    return estimator

for i in range(100):
    estimator = tune_n_estimators(base_estimator, X_train, y_train)
    estimator.fit(X_train, y_train)
    booster = estimator.get_booster()
    booster.set_param({'predictor': 'gpu_predictor'})
    dmat = DMatrix(X)
    shap_values.append(booster.predict(dmat, pred_contribs=True))
    print(i)

It looks like doing tuning with early_stopping_rounds is what is causing the crash. If I remove the stopping rounds and just do estimator.fit(X_train_fold, y_train_fold) instide the tuning function and set n_estimators to some arbitrary number (e.g., 100), it does not crash.

nijshar28 · July 8, 2021, 11:10pm

Figured it out! The tuning function was returning 0 for n_estimators on some iterations. Is this expected behavior for early_stopping? It went away after I bumped the number of early_stopping_rounds, still, it is interesting that it crashes the kernel.