OMP_NUM_THREADS not working correctly for training

merleyc · December 29, 2020, 12:03am

Hi,

I am setting OMP_NUM_THREADS=36 but it is always using my max CPU value (72) in the XGBoost C++ code. I realized that when I printed ntrhead in the C++ code, for example, in src/common/quantile.cc:
int nthread = omp_get_max_threads(); --> edited this line after posting it.

This happens in these scenarios, when I set:

Setting inside my python code: os.environ[‘OMP_NUM_THREADS’] = “36”
Setting in my env: export OMP_NUM_THREADS=36
Setting when calling the DMLC command line:
PYTHONPATH=~/xgboost/python-package/ ~/xgboost/dmlc-core/tracker/dmlc-submit --cluster=local --num-workers=1 OMP_NUM_THREADS=36 python3 myfile.py

Details:
Centos 7
lscpu: CPU(s): 72

My script:
import xgboost as xgb
xgb.rabit.init()
dtrain = xgb.DMatrix()
param = {
‘verbosity’: 3,
‘n_estimators’: 4
‘max_depth’: 8,
‘max_leaves’: 256,
‘reg_alpha’: 0.9,
‘learning_rate’: 0.1,
‘gamma’: 0.1,
‘subsample’: 1.0,
‘reg_lambda’: 1.0,
‘scale_pos_weight’: 2.0,
‘min_child_weight’: 30.0,
‘max_bin’: 16,
‘tree_method’: ‘hist’,
‘objective’: ‘multi:softmax’,
‘num_class’: 3,
‘grow_policy’: ‘lossguide’,
‘numWorkers’: 1
}
watchlist = [(dtrain,‘train’)]:set
bst = xgb.train(param, dtrain, num_round, watchlist)
xgb.rabit.finalize()

I understood that when setting OMP_NUM_THREADS=36, it should always use 36 threads to train the model. Is there a bug with OMP_NUM_THREADS?
Please advice if I am not using the variable in a correct way.

There is a past related issue: Predicting from multiple jobs & threading issues

Thank you.

hcho3 · December 28, 2020, 10:35pm

@merleyc Are you using the latest version of XGBoost? Also, have you tried explicitly specifying the nthread parameter?

merleyc · December 29, 2020, 12:04am

Hi
Yes, I just tried setting in my python script nthread=36 and, separately, njobs=36 and didn’t work as well.
My C++ code is still printing 76 in this line: int nthread = omp_get_max_threads(); –> edited this line after posting it.

The latest commit I have is:
commit 0c85b90671a06f702bbe7489a126176642513b17
Author: Philip Hyunsu Cho chohyu01@cs.washington.edu
Date: Sun Nov 22 05:49:09 2020 -0800

Thanks

hcho3 · December 28, 2020, 11:57pm

@merleyc We pushed a bug fix for controlling number of threads: https://github.com/dmlc/xgboost/pull/6186. This is part of XGBoost 1.3.0 version, which is newer than the version you are currently using.

merleyc · December 29, 2020, 12:03am

After doing git pull origin master, now I have the latest commit, which is:

commit 610ee632ccceafd266f1dea8c8ebfed051f044a1 (origin/master, origin/HEAD)
Author: Jiaming Yuan jm.yuan@outlook.com
Date: Mon Dec 28 21:36:03 2020 +0800

Results:

I printed nthread=1 when using the script I pasted in my origin post above (without setting n_jobs neither nthread in the python script) and calling the DMLC command line:
PYTHONPATH=~/xgboost/python-package/ ~/xgboost/dmlc-core/tracker/dmlc-submit --cluster=local --num-workers=1 OMP_NUM_THREADS=36 python3 myfile.py
Setting in the python script ‘nthread’:36 I printed nthread=1.
Setting in the python script ‘n_jobs’:36 I printed nthread=1.

Should the nthread be 36 when printing it from int nthread = omp_get_max_threads();?

Thanks.

hcho3 · December 29, 2020, 1:01am

In your code, you are setting dtrain = xgb.DMatrix() which will lead to an empty DMatrix. Please use a matrix with sufficient number of rows.

merleyc · December 29, 2020, 1:05am

Sorry I omitted the content in the post, but I am using:
dtrain = xgb.DMatrix(’/train.csv?format=csv&label_column=0’)

where my dataset has ncols=18 and nrows=8000.
Thanks.

hcho3 · December 29, 2020, 1:07am

@merleyc Here is screenshot from my machine, with htop:

Notice that all the cores are active.

import xgboost as xgb
import numpy as np

X_train = np.random.rand(10000000, 10)
y_train = np.random.rand(10000000, 1)
dtrain = xgb.DMatrix(X_train, label=y_train)

params = {'objective': 'reg:squarederror', 'tree_method': 'hist'}
bst = xgb.train(params, dtrain, 100, evals=[(dtrain, 'train')])

On the other hand, setting nthread=4 will result in the following CPU usage, with only 4 cores active:

Can you check how many cores are being utilized with htop?

merleyc · December 29, 2020, 10:40pm

It works for me as well by setting nthread=4 in my python script:

However it is not working when using the below command line to execute my script:
PYTHONPATH=~/xgboost/python-package/ ~/xgboost/dmlc-core/tracker/dmlc-submit --cluster=local --num-workers=1 OMP_NUM_THREADS=4 python3 myscript.py

I will set nthread in the script from now. But just to understand, should OMP_NUM_THREADS also work as nthread?

Thanks!

hcho3 · December 29, 2020, 11:56pm

OMP_NUM_THREADS should work, although we officially recommend users to use nthread parameter.

Also, we suggest using the Dask API for doing distributed training with Python. The old method of using the DMLC tracker is hard to set up, so currently the XGBoost team is investing heavily on Dask as a better alternative for distributed training.

Here are some tutorials to get started:

merleyc · December 30, 2020, 10:06pm

OMP_NUM_THREADS is not working for me but I am currently using the nthread param successfully.
Thanks for the help and links!