XGBoost not running multhreaded / parallel

cptdime · September 21, 2018, 10:19pm

Hi all,

I just installed XGBoost from pip3 install xgboost, set up a multiclass classification problem and set nthreads to -1 and i am still not getting any multicore performance. It installed a libxgboost.so in my .local and it points to libgomp.so.1 but still no dice.

As a side note, I switched the backend to gpu on a P100, and nvidia-smi dmon is showing no utilization of the graphics card. – “EDIT - fixed that, forgot to set tree_method appropriately. Still no multithreaded cpu though”

Do I need to compile from scratch?

hcho3 · September 21, 2018, 11:52pm

@cptdime Can you run the following example and see if all cores are used?

#include <iostream>
#include <cstdint>
#include <omp.h>

int main(void) {
  const uint64_t N = 100000000000;
  uint64_t sum[8] = {0};
  #pragma omp parallel for
  for (uint64_t i = 0; i < N; ++i) {
    sum[omp_get_thread_num()] += i;
  }
  for (uint64_t i = 0; i < omp_get_max_threads(); ++i) {
    std::cout << sum[i] << std::endl;
  }
  return 0;
}

Compile it by running gcc -o main main.cc -fopenmp -O3 -msse2.

cptdime · September 22, 2018, 12:04am

It does not use more than one thread. What does that mean?

hcho3 · September 22, 2018, 12:14am

That means that OpenMP is not configured correctly on your machine. Can you post the output of the code? How many numbers were printed out?

cptdime · September 22, 2018, 12:17am

Hi,

It means I’m an idiot. I didn’t know I had to set OMP_NUM_THREADS. Setting that to 8 I get the utilization I expect, but I have 88 threads…so I had to change the array size in that test program to avoid segment faults…

Thank you so much.

hcho3 · September 22, 2018, 12:22am

Oops, I just noticed the hard-coded 8 in the test program.

Anyway, does setting OMP_NUM_THREADS makes XGBoost use all 88 threads?

cptdime · September 22, 2018, 12:27am

No, it doesn’t seem so. I thought it was because I had tree_method set to a gpu one, but setting it back to auto still only utilizes 1 thread.

hcho3 · September 22, 2018, 12:30am

Try compiling from source. Easy way is to run

pip3 install --no-binary :all: xgboost

hcho3 · September 22, 2018, 12:36am

Also I think you should set nthread to 0 to use all available threads. Also try setting nthread=88.

cptdime · September 22, 2018, 12:45am

That did it. I don’t know where I thought I got -1 from, been using many different packages. Even though that works, I still build a model and then when I predict get 0 for all probabilities. But that’s off topic for this thread.

Thanks a million. Probably explains why I was also getting one core usage when using H2O & XGBoost.

cptdime · September 22, 2018, 12:51am

One last thing, is there a way to compile in the GPU extensions with the pip3 command, or do I need to compile it normally?

hcho3 · September 22, 2018, 12:52am

pip command only compiles CPU code. You’ll need to use CMake to compile GPU code.

quant108 · October 7, 2018, 2:15am

Can I set nthread=10? or set to some limited number? The system admin will send me warning emails if I take all available cores on the server. Thanks.

hcho3 · October 7, 2018, 2:27am

Yes, you can use a limited number.

quant108 · October 7, 2018, 2:33am

I set nthread=6 in params, passed it to xgb.train(), but it looks that only one core is taken to run the model.

I have another model using XGBRegressor, it takes 6 cores if I set n_jobs=6. By the way, is there a way set learning_rates list or learning rate decay with XGBRegressor?

hcho3 · October 7, 2018, 2:50am

Setting n_jobs for XGBRegressor is equivalent to setting nthread for xgboost.train():

github.com

dmlc/xgboost/blob/e0fd60f4e5a492adefa9ae4bef353b9d9a526b80/python-package/xgboost/sklearn.py#L211


                      'Please use random_state instead.'
                      'seed is deprecated.', DeprecationWarning)
    else:
        xgb_params['seed'] = random_state
    n_jobs = xgb_params.pop('n_jobs')
    if 'nthread' in xgb_params and xgb_params['nthread'] is not None:
        warnings.warn('The nthread parameter is deprecated as of version .6.'
                      'Please use n_jobs instead.'
                      'nthread is deprecated.', DeprecationWarning)
    else:
        xgb_params['nthread'] = n_jobs


    xgb_params['silent'] = 1 if self.silent else 0


    if xgb_params['nthread'] <= 0:
        xgb_params.pop('nthread', None)
    return xgb_params


def save_model(self, fname):
    """
    Save the model to a file.

For your second question: yes, you can pass callbacks to XGBRegressor.fit(). See https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.fit.

quant108 · October 7, 2018, 3:29am

It’s fine if I set one learning_rate. It happened only when I set learning_rates to a list of decreasing rates.

hcho3 · October 7, 2018, 3:38am

Can you post a sample script with dummy data, to show that only one thread is used when a callback is added? This looks like a bug.

quant108 · October 8, 2018, 12:45am

This is the code snippet to show the issue.

learn_rates_1k = [0.1]*10 + [0.05]*290 + [0.02]*400 + [0.01]*300
booster = xgb.train(params, dtrain, \

                           #learning_rates=learn_rates_1k, \

                           num_boost_round=1000, \

                           early_stopping_rounds=20)

The code can run on multiple cores. But it will run on a single core if I un-commented line 3.

hcho3 · October 8, 2018, 1:21am

Which dataset should I use to re-produce the issue?