Training thousands of XGBoost models in parallel

Hi all,

I am looking to train thousands of XGBoost models in parallel. For my specific use case, it is more efficient to have each model be trained single-threaded. Right now, I am parallelizing in Python using multiprocessing. Roughly:

with concurrent.futures.ProcessPoolExecutor() as pool:, x_list)

I know that Python is not the best at parallelism, and for other tasks I have gotten big speedups by instead parallelizing in a lower-level language like Rust. It looks like XGBoost doesn’t have an official Rust binding, but I was considering trying to write my code in C instead. Unfortunately, I barely know C at all, so before I jumped into learning a new language, I was wondering if anyone has experimented with this and if it is likely that re-writing my loop in C would bring non-trivial speedups.

Thanks a lot!

I had the exact same issue (slow training of many small models) but luckily found a great way around it. I used Ray. See Ray Core Quickstart -> Core: Parallelizing Functions with Ray Tasks. In my case it gave me an enormous speedup training hundreds of small models at the same time vs training them in serial. IME it gives the best speedup when all models train for around the same amount of time (early stopping or other factors might slow it down compared to training in serial).
I did not try Dask or other frameworks, but I imagine they would work similarly well.
Note for my own case I did not specify the number of threads to use for each model. Just letting it figure out on its own how to handle the dynamic workload seems to work fine.

After some testing, I see a modest (1.31x) speed-up fitting multiple XGBoost models in parallel on one machine using threading (e.g. ThreadPoolExecutor): I suspect that internally xgboost or a dependent lib is releasing the GIL down in the c code. For example:

I see a modest (1.5x) speed-up using multiprocessing on one machine (via ProcessPoolExecutor), ensuring we are only passing the parameters between processes to avoid IPC (e.g. data loading and model saving happen in each process). For example:

It requires careful tuning of the number of worker threads/processes and the n_jobs parameter used to fit each model. Also disabling OMP/ BLAS threads via OMP_NUM_THREADS, just in case (used only in inference as far as I can tell)