R-xgboost 0.71.2 multi-thread is much slower than 0.6.4

hadjipantelis · October 5, 2018, 3:58pm

Hello,

This is related to:
MultiThreading not working with version 0.71.2 and
Big data will break the nthread setting in R-xgboost 0.71.2

Running the same model on R 3.5.0 and comparing 0.6.4.1 vs 0.71-2 results in very inferior timing performances for 0.71-2. I tested 0.6.4.1 using approx and exact and 0.71-2 by using approx, exact and hist. The machine has 2 CPUs (E5-2690), 14 cores each for a total of 56 threads.

In general, on 0.71-2 neither method, exact or approx manage to use the same number of threads as in 0.6.4.1. Interestingly hist does use a lot of cores but it is still slower than exact and approx.

Goodness-of-fit-wise in my applications, usual metrics (MAE, RMSE, etc.) are perfectly comparable. I do not observed any deterioration in GoF.

library(xgboost) 
library(microbenchmark)
set.seed(222)
N <- 2*10^5
p <- 350
x <- matrix(rnorm(N  * p), ncol = p)
y <- rnorm(N )

bnc_exact <- microbenchmark( 
  mymodel <- xgboost(data = x, label = y, nrounds = 5, 
                     objective = "reg:linear", "tree_method" = "exact",
                     max_depth = 10, min_child_weight = 1, eta = 1, 
                     subsample = 0.66, colsample_bytree = 0.33), 
  times = 6) 

bnc_approx <- microbenchmark( 
  mymodel <- xgboost(data = x, label = y, nrounds = 5, 
                     objective = "reg:linear", "tree_method" = "approx",
                     max_depth = 10, min_child_weight = 1, eta = 1, 
                     subsample = 0.66, colsample_bytree = 0.33), 
  times = 6) 

bnc_hist <- microbenchmark( 
  mymodel <- xgboost(data = x, label = y, nrounds = 5, 
                     objective = "reg:linear", "tree_method" = "hist",
                     max_depth = 10, min_child_weight = 1, eta = 1, 
                     subsample = 0.66, colsample_bytree = 0.33), 
  times = 6)

For the above example:
On 0.6.4.1, approx and exact required ~4.5" and ~3.0" for their fastest run respectively.
On 0.71-21, approx and exact were pretty comparable with ~14.4" and ~14.1" for their fastest run - hist's fastest run was 57.5". On other applications with ~1.5M points the performance hit is nearly 10x.

The observed performance deterioration on exact and approx is problematic. I appreciate that dual-CPUs multi-core systems are reasonably uncommon, but any help would be greatly appreciated. Are there any steps I could take to alleviate this performance hit?

I can down-grade my xgboost version (to 0.6.4.1) for the time being but can you please look into newer versions of xgboost becoming as fast as previously?
(0.71.1 shows a minor decrease in performance compared to 0.6.4.1.)

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-4 xgboost_0.6.4.1

hcho3 · October 5, 2018, 8:35pm

It has been reported that hist scales poorly for 40+ threads. See performance pattern of XGBoost at https://public.tableau.com/shared/T3TTRW292?:toolbar=no&:display_count=no (Ignore Total column)

As for behavior of exact and approx, let’s see if it can be re-produced. I’m not sure if I have bandwidth to tackle this before release 0.82, but I’d like to come back to it.

The machine has 2 CPUs (E5-2690), 14 cores each for a total of 56 threads.

Is this a local machine? Would it exhibit similar performance characteristics as c5.9xlarge or c5.18xlarge EC2 instances?

hcho3 · October 5, 2018, 8:37pm

Also, try setting the environment variable OMP_NUM_THREADS to force XGBoost to use all threads.

hadjipantelis · October 5, 2018, 10:21pm

@hcho3: Thank you very much about your swift reply.

Particulars:

Yes, it is a local machine.
I do not know if it would exhibit similar characteristics. Probably, yes.
Sorry, I forgot to mention that. I have tried that OMP_NUM_THREADS and I observed no difference. Thank you for suggesting it though.
As the tree_method= "hist" uses a lot of cores (but ends up treading water) I would assume that the XGBoost can detect all the threads. When using XGBoost with hist, XGBoost will get the CPU-usage in the lower 5000% which is expected for this system. (@Xburtsch described something similar with all cores engaged just none of them hitting more than 50-60% usage.) For exact and approx the load ends up bouncing between 400-1000% signifying 4 to 10 fully engaged threads.

Thanks again for you looking into this.

hcho3 · October 6, 2018, 11:08pm

Do you see similar issues with Python or is this problem specific to R binding?

hadjipantelis · October 7, 2018, 8:27pm

I have not tested this on Python. I will test it some time this week and let you know.

hadjipantelis · October 10, 2018, 3:32pm

I tested this on Python. I used xgboost 0.80, 0.72.1 and 0.6a2 installed on Python 3.6.3. Based on 0.80 and using the script:

import numpy as np
from xgboost  import XGBRegressor

N = 2*10**5
p = 350 
np.random.seed(1)

X =  np.random.randn(N,p) 
y =  np.random.randn(N,1)
 
model1 = XGBRegressor(nthread=56,  max_depth=10,  tree_method = 'exact',
                      subsample = 0.66, learning_rate = 1)
model1.fit(X, y)


model2 = XGBRegressor(nthread=56,  max_depth=10,  tree_method = 'hist',
                      subsample = 0.66, learning_rate = 1)
model2.fit(X, y)


model3 = XGBRegressor(nthread=56,  max_depth=10,  tree_method = 'approx',
                      subsample = 0.66, learning_rate = 1)
model3.fit(X, y)

exact was the fastest method, it managed to used consistently 50+ threads. hist was the slowest method, it managed to used consistently ~ 22 threads. approx had similar performance with exact.

Comparing between the versions and using the script:

import numpy as np
import xgboost  as xgb
import time 

N = 5*10**5
p = 350 
np.random.seed(1)

X =  np.random.randn(N,p) 
y =  np.random.randn(N,1) 

param = {'max_depth': 10, 'eta': 1, 'silent': 1, 'nthread': 56,
         'objective': 'reg:linear', 'subsample': 0.66} 
param['eval_metric'] = 'rmse'
dtrain = xgb.DMatrix( X, label=y)
nrounds = 10
for x in range(6):
    t0 = time.time()  
    model1 = xgb.train( param, dtrain, nrounds ) 
    t1= time.time() 
    print(t1- t0)

The performance of 0.80, in terms of timing, compared with 0.72.1 and 0.6a2 had no noticeable differences. (Maybe 0.80 was a bit faster, all ~14")

hcho3 · October 15, 2018, 6:35am

Thanks, this is valuable information. So there are two separate bugs:

‘hist’ is slow, regardless of versions
‘approx’ and ‘exact’ are slower for version 0.71 and higher when using R package.

hcho3 · October 27, 2018, 11:02pm

I posted an issue post to address the performance issue of ‘hist’: https://github.com/dmlc/xgboost/issues/3810

As for deterioration of ‘approx’ and ‘exact’, I am currently working on a bug fix. Hopefully I can get it in before the 0.82 release.

hcho3 · October 28, 2018, 4:19am

@hadjipantelis I ran your script on EC2 and here’s what I got:

C5.9xlarge	‘Exact’ run time (sec)	‘Approx’ run time (sec)
latest master (commit hash `d81fedb`)	3.16	3.16
0.71.2 (commit hash `1214081`)	2.53	3.49
0.71.1 (commit hash `098075b`)	2.57	3.52
0.6.4 (commit hash `ce84af7`)	2.91	3.18

C5.18xlarge	‘Exact’ run time (sec)	‘Approx’ run time (sec)
latest master (commit hash `d81fedb`)	3.23	3.04
0.71.2 (commit hash `1214081`)	2.28	5.02
0.71.1 (commit hash `098075b`)	2.50	4.56
0.6.4 (commit hash `ce84af7`)	2.58	4.47

So I’m not seeing the performance degradation you were describing (3 minutes -> 14 minutes).

This issue may be unique to physical, non-virtualized machine with dual processors. I’m afraid I can’t do much at this point since I don’t have access to such machine.

Xburtsch · June 17, 2019, 3:49pm

@hcho3 as mentioned here (https://github.com/dmlc/xgboost/issues/3543), version 0.82.1 solved this performance issue for me. Thanks!