Predicting from multiple jobs & threading issues

Hi,

I’m trying to call my (sklearn api) XGBClassifier’s predict function from multiple jobs. Doing this I run into multiple issues.

First and foremost, unrelated to the multiprocessing: I set the n_jobs parameter of my model to 1. However, if I run it, I notice it creates a thread for every single estimator (144 in my case). This seems wrong… why does this happen and can I do anything about it?

This causes issues when I try running it on e.g. 40 jobs, as it will either throw RuntimeErrors about not being able to create more threads, or just hang silently.

Secondly, in the documentation of the predict method (https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.predict) it is mentioned that this is not thread-safe. If I understand correctly, this may (or will?) cause issues when calling the same classifier’s predict function across multiple workers, correct? The suggested remedy is xgb.copy(), but copy() does not exist on the XGBClassifier. Is it intended to be ran on the booster? Would deepcopy(xgb_classifier) suffice?

Hope someone can help!

(Note: the reason I want to use multiprocessing instead of built in threads is because there are some other steps that need to be done before xgb which are currently the bottleneck. Multiprocessing those steps separately is inefficient and also requires passing a lot of data back and forth between processes.)

Update: I’ve found some information about setting nthread with xgb._Booster.set_param('nthread', 1). This seems to work in a minimal example. However, if I do this from inside a multiprocessing job (or before starting the multiprocessing job) it hangs and doesn’t do anything…

If I create a deepcopy of an XGBClassifier, and also set xgb_copy._Booster = xgb_copy._Booster.copy(), setting nthread on one of them also seems to affect the number of threads used in the other one…

Further update:

export OMP_NUM_THREADS=N works, sort of. The number of threads still isn’t what I’d expect (n_parallel_jobs * OMP_NUM_THREADS), but it gets reduced by a lot.

I’ve also isolated the problem a bit more. It breaks down if I use the XGBClassifier for prediction in the main process, before multiprocessing. Any way around this…?

from multiprocessing import Pool
from xgboost import XGBClassifier
import numpy as np
import pickle

n_samples = 1000
x = np.random.rand(n_samples, 10)
y = np.random.randint(0, 2, n_samples)
load = False
predict_after_load = False
n_jobs = 1

if load:
    with open('xgb_example.p', 'rb') as rf:
        xgb = pickle.load(rf)
        
    if predict_after_load:
        xgb.predict(x)
else:
    xgb = XGBClassifier(n_jobs=n_jobs)
    xgb.fit(x, y)
    with open('xgb_example.p', 'wb') as wf:
        pickle.dump(xgb, wf)

with Pool(4) as p:
    result = np.vstack(p.map(xgb.predict_proba, np.array_split(x, 10)))

Here’s a simple example. In the form above, it works, but this depends on the values of n_jobs, load, and predict_after_load. If n_jobs = 1 it always works (that is, unless you saved a model that didn’t have n_jobs = 1…) (EDIT: actually doesn’t seem to work for n_jobs = 1 either if predict_after_load = True). If n_jobs > 1, it only works if load == True and predict_after_load == False. In other words, it’ll work so long as you don’t use an xgb with more than 1 job in the main process before using multiprocessing.

Really hope anyone knows how to fix this as this is slowly driving me insane!

I’m sorry that XGBoost is not working well with multiprocessing. In your use case, can you limit yourself to n_job = 1, i.e. one thread per job? This may make sense when you have many concurrent jobs to run.

1 Like

if the XGBClassifier has been used anywhere in the main process, it will cause multiprocessing to freeze

The XGBClassifier object stores the memory address of the native XGBoost object. So you cannot share the common XGBClassifier object across multiple processes, since processes have different address spaces. Deep copy is necessary to run prediction on multiple processes.

It appears that, strangely enough, n_jobs = 1 works if I fit the xgb in the same process. Then I can even predict with it before multiprocessing and it’ll still work. However, if I fit a model, even at n_jobs = 1, and then save it to pickle, then load it in a new one and call predict in the main process, then multiprocessing will still freeze. Why would saving the model and loading the model break it…? The behavior is so inconsistent and unpredictable.

The XGBClassifier object stores the memory address of the native XGBoost object. So you cannot share the common XGBClassifier object across multiple processes, since processes have different address spaces. Deep copy is necessary to run prediction on multiple processes.

Could you maybe provide a minimal example of this? Because I’ve tried using both deepcopy(xgb) and xgb._Booster.copy(), and neither works. Do I create the deepcopy in each job individually, or could I create one per worker somehow?

This problem seems to play even when having two completely separate XGB models, which makes me wonder how exactly a deep copy would resolve this.

It seems that adding XGBClassifier().fit(np.array([[0]]), [0]) in front of the code snippet above makes things work. Because who needs logic…

@MvdSteeg90 I created a new issue to track multiprocessing issue: https://github.com/dmlc/xgboost/issues/4246