XGBoost not running multhreaded / parallel

Hi all,

I just installed XGBoost from pip3 install xgboost, set up a multiclass classification problem and set nthreads to -1 and i am still not getting any multicore performance. It installed a libxgboost.so in my .local and it points to libgomp.so.1 but still no dice.

As a side note, I switched the backend to gpu on a P100, and nvidia-smi dmon is showing no utilization of the graphics card. – “EDIT - fixed that, forgot to set tree_method appropriately. Still no multithreaded cpu though”

Do I need to compile from scratch?

@cptdime Can you run the following example and see if all cores are used?

#include <iostream>
#include <cstdint>
#include <omp.h>

int main(void) {
  const uint64_t N = 100000000000;
  uint64_t sum[8] = {0};
  #pragma omp parallel for
  for (uint64_t i = 0; i < N; ++i) {
    sum[omp_get_thread_num()] += i;
  }
  for (uint64_t i = 0; i < omp_get_max_threads(); ++i) {
    std::cout << sum[i] << std::endl;
  }
  return 0;
}

Compile it by running gcc -o main main.cc -fopenmp -O3 -msse2.

It does not use more than one thread. What does that mean?

That means that OpenMP is not configured correctly on your machine. Can you post the output of the code? How many numbers were printed out?

Hi,

It means I’m an idiot. I didn’t know I had to set OMP_NUM_THREADS. Setting that to 8 I get the utilization I expect, but I have 88 threads…so I had to change the array size in that test program to avoid segment faults…

Thank you so much.

Oops, I just noticed the hard-coded 8 in the test program.

Anyway, does setting OMP_NUM_THREADS makes XGBoost use all 88 threads?

No, it doesn’t seem so. I thought it was because I had tree_method set to a gpu one, but setting it back to auto still only utilizes 1 thread.

Try compiling from source. Easy way is to run

pip3 install --no-binary :all: xgboost

Also I think you should set nthread to 0 to use all available threads. Also try setting nthread=88.

That did it. I don’t know where I thought I got -1 from, been using many different packages. Even though that works, I still build a model and then when I predict get 0 for all probabilities. But that’s off topic for this thread.

Thanks a million. Probably explains why I was also getting one core usage when using H2O & XGBoost.

One last thing, is there a way to compile in the GPU extensions with the pip3 command, or do I need to compile it normally?

pip command only compiles CPU code. You’ll need to use CMake to compile GPU code.

Can I set nthread=10? or set to some limited number? The system admin will send me warning emails if I take all available cores on the server. Thanks.

Yes, you can use a limited number.

I set nthread=6 in params, passed it to xgb.train(), but it looks that only one core is taken to run the model.

I have another model using XGBRegressor, it takes 6 cores if I set n_jobs=6. By the way, is there a way set learning_rates list or learning rate decay with XGBRegressor?

Setting n_jobs for XGBRegressor is equivalent to setting nthread for xgboost.train():

For your second question: yes, you can pass callbacks to XGBRegressor.fit(). See https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.fit.

It’s fine if I set one learning_rate. It happened only when I set learning_rates to a list of decreasing rates.

Can you post a sample script with dummy data, to show that only one thread is used when a callback is added? This looks like a bug.

This is the code snippet to show the issue.

  1. learn_rates_1k = [0.1]*10 + [0.05]*290 + [0.02]*400 + [0.01]*300
  2. booster = xgb.train(params, dtrain, \
  3.                            #learning_rates=learn_rates_1k, \
    
  4.                            num_boost_round=1000, \
    
  5.                            early_stopping_rounds=20)
    

The code can run on multiple cores. But it will run on a single core if I un-commented line 3.

Which dataset should I use to re-produce the issue?