Hi,
I’m trying to call my (sklearn api) XGBClassifier’s predict function from multiple jobs. Doing this I run into multiple issues.
First and foremost, unrelated to the multiprocessing: I set the n_jobs parameter of my model to 1. However, if I run it, I notice it creates a thread for every single estimator (144 in my case). This seems wrong… why does this happen and can I do anything about it?
This causes issues when I try running it on e.g. 40 jobs, as it will either throw RuntimeErrors about not being able to create more threads, or just hang silently.
Secondly, in the documentation of the predict method (https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.predict) it is mentioned that this is not thread-safe. If I understand correctly, this may (or will?) cause issues when calling the same classifier’s predict function across multiple workers, correct? The suggested remedy is xgb.copy()
, but copy()
does not exist on the XGBClassifier. Is it intended to be ran on the booster? Would deepcopy(xgb_classifier)
suffice?
Hope someone can help!
(Note: the reason I want to use multiprocessing instead of built in threads is because there are some other steps that need to be done before xgb which are currently the bottleneck. Multiprocessing those steps separately is inefficient and also requires passing a lot of data back and forth between processes.)