Severely decreased performance when multiple xgboost processes running

Hi all,

We’re currently running xgboost & openmp in a production environment via the R package. See: https://github.com/Displayr/flipMultivariates/blob/master/R/gradientboost.R#L147 for how we’re calling it.

We’ve noticed that if multiple xgboost processes are running at the same time we get horrible runtime performance. I observe that the default xgboost behaviour is to spawn as many threads as their are cores on the machine. This causes a lot of contention. When running 4 xgboost processes on a 16 core machine:

I get slow downs of a factor of 30x

Is there a way to customize this behaviour? I saw that that openmp has a OMP_DYNAMIC flag but it doesn’t seem to work.

Has anyone else encountered such problems?

I observe lots of threads just busy waiting (threads calling do_spin()) around these openmp loops:


Majority of threads are spent busy waitng. Anyway to cutdown on that?

Try setting environment variable OMP_NUM_THREADS to value 1. This should force OpenMP runtime to use a single thread per XGBoost process.