R-xgboost 0.71.2 multi-thread is much slower than 0.6.4

Hello,

This is related to:
MultiThreading not working with version 0.71.2 and
Big data will break the nthread setting in R-xgboost 0.71.2

Running the same model on R 3.5.0 and comparing 0.6.4.1 vs 0.71-2 results in very inferior timing performances for 0.71-2. I tested 0.6.4.1 using approx and exact and 0.71-2 by using approx, exact and hist. The machine has 2 CPUs (E5-2690), 14 cores each for a total of 56 threads.

In general, on 0.71-2 neither method, exact or approx manage to use the same number of threads as in 0.6.4.1. Interestingly hist does use a lot of cores but it is still slower than exact and approx.

Goodness-of-fit-wise in my applications, usual metrics (MAE, RMSE, etc.) are perfectly comparable. I do not observed any deterioration in GoF.

library(xgboost) 
library(microbenchmark)
set.seed(222)
N <- 2*10^5
p <- 350
x <- matrix(rnorm(N  * p), ncol = p)
y <- rnorm(N )

bnc_exact <- microbenchmark( 
  mymodel <- xgboost(data = x, label = y, nrounds = 5, 
                     objective = "reg:linear", "tree_method" = "exact",
                     max_depth = 10, min_child_weight = 1, eta = 1, 
                     subsample = 0.66, colsample_bytree = 0.33), 
  times = 6) 

bnc_approx <- microbenchmark( 
  mymodel <- xgboost(data = x, label = y, nrounds = 5, 
                     objective = "reg:linear", "tree_method" = "approx",
                     max_depth = 10, min_child_weight = 1, eta = 1, 
                     subsample = 0.66, colsample_bytree = 0.33), 
  times = 6) 

bnc_hist <- microbenchmark( 
  mymodel <- xgboost(data = x, label = y, nrounds = 5, 
                     objective = "reg:linear", "tree_method" = "hist",
                     max_depth = 10, min_child_weight = 1, eta = 1, 
                     subsample = 0.66, colsample_bytree = 0.33), 
  times = 6)

For the above example:
On 0.6.4.1, approx and exact required ~4.5" and ~3.0" for their fastest run respectively.
On 0.71-21, approx and exact were pretty comparable with ~14.4" and ~14.1" for their fastest run - hist's fastest run was 57.5". On other applications with ~1.5M points the performance hit is nearly 10x.

The observed performance deterioration on exact and approx is problematic. I appreciate that dual-CPUs multi-core systems are reasonably uncommon, but any help would be greatly appreciated. Are there any steps I could take to alleviate this performance hit?

I can down-grade my xgboost version (to 0.6.4.1) for the time being but can you please look into newer versions of xgboost becoming as fast as previously?
(0.71.1 shows a minor decrease in performance compared to 0.6.4.1.)

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-4 xgboost_0.6.4.1

It has been reported that hist scales poorly for 40+ threads. See performance pattern of XGBoost at https://public.tableau.com/shared/T3TTRW292?:toolbar=no&:display_count=no (Ignore Total column)

As for behavior of exact and approx, let’s see if it can be re-produced. I’m not sure if I have bandwidth to tackle this before release 0.82, but I’d like to come back to it.

The machine has 2 CPUs (E5-2690), 14 cores each for a total of 56 threads.

Is this a local machine? Would it exhibit similar performance characteristics as c5.9xlarge or c5.18xlarge EC2 instances?

Also, try setting the environment variable OMP_NUM_THREADS to force XGBoost to use all threads.

@hcho3: Thank you very much about your swift reply.

Particulars:

  1. Yes, it is a local machine.
  2. I do not know if it would exhibit similar characteristics. Probably, yes.
  3. Sorry, I forgot to mention that. I have tried that OMP_NUM_THREADS and I observed no difference. Thank you for suggesting it though.
    As the tree_method= "hist" uses a lot of cores (but ends up treading water) I would assume that the XGBoost can detect all the threads. When using XGBoost with hist, XGBoost will get the CPU-usage in the lower 5000% which is expected for this system. (@Xburtsch described something similar with all cores engaged just none of them hitting more than 50-60% usage.) For exact and approx the load ends up bouncing between 400-1000% signifying 4 to 10 fully engaged threads.

Thanks again for you looking into this.

Do you see similar issues with Python or is this problem specific to R binding?

I have not tested this on Python. I will test it some time this week and let you know.

I tested this on Python. I used xgboost 0.80, 0.72.1 and 0.6a2 installed on Python 3.6.3. Based on 0.80 and using the script:

import numpy as np
from xgboost  import XGBRegressor

N = 2*10**5
p = 350 
np.random.seed(1)

X =  np.random.randn(N,p) 
y =  np.random.randn(N,1)
 
model1 = XGBRegressor(nthread=56,  max_depth=10,  tree_method = 'exact',
                      subsample = 0.66, learning_rate = 1)
model1.fit(X, y)


model2 = XGBRegressor(nthread=56,  max_depth=10,  tree_method = 'hist',
                      subsample = 0.66, learning_rate = 1)
model2.fit(X, y)


model3 = XGBRegressor(nthread=56,  max_depth=10,  tree_method = 'approx',
                      subsample = 0.66, learning_rate = 1)
model3.fit(X, y)

exact was the fastest method, it managed to used consistently 50+ threads. hist was the slowest method, it managed to used consistently ~ 22 threads. approx had similar performance with exact.

Comparing between the versions and using the script:

import numpy as np
import xgboost  as xgb
import time 

N = 5*10**5
p = 350 
np.random.seed(1)

X =  np.random.randn(N,p) 
y =  np.random.randn(N,1) 

param = {'max_depth': 10, 'eta': 1, 'silent': 1, 'nthread': 56,
         'objective': 'reg:linear', 'subsample': 0.66} 
param['eval_metric'] = 'rmse'
dtrain = xgb.DMatrix( X, label=y)
nrounds = 10
for x in range(6):
    t0 = time.time()  
    model1 = xgb.train( param, dtrain, nrounds ) 
    t1= time.time() 
    print(t1- t0) 

The performance of 0.80, in terms of timing, compared with 0.72.1 and 0.6a2 had no noticeable differences. (Maybe 0.80 was a bit faster, all ~14")

Thanks, this is valuable information. So there are two separate bugs:

  1. ‘hist’ is slow, regardless of versions
  2. ‘approx’ and ‘exact’ are slower for version 0.71 and higher when using R package.
1 Like

I posted an issue post to address the performance issue of ‘hist’: https://github.com/dmlc/xgboost/issues/3810

As for deterioration of ‘approx’ and ‘exact’, I am currently working on a bug fix. Hopefully I can get it in before the 0.82 release.

@hadjipantelis I ran your script on EC2 and here’s what I got:

C5.9xlarge ‘Exact’ run time (sec) ‘Approx’ run time (sec)
latest master (commit hash d81fedb) 3.16 3.16
0.71.2 (commit hash 1214081) 2.53 3.49
0.71.1 (commit hash 098075b) 2.57 3.52
0.6.4 (commit hash ce84af7) 2.91 3.18
C5.18xlarge ‘Exact’ run time (sec) ‘Approx’ run time (sec)
latest master (commit hash d81fedb) 3.23 3.04
0.71.2 (commit hash 1214081) 2.28 5.02
0.71.1 (commit hash 098075b) 2.50 4.56
0.6.4 (commit hash ce84af7) 2.58 4.47

So I’m not seeing the performance degradation you were describing (3 minutes -> 14 minutes).

This issue may be unique to physical, non-virtualized machine with dual processors. I’m afraid I can’t do much at this point since I don’t have access to such machine.

1 Like

@hcho3 as mentioned here (https://github.com/dmlc/xgboost/issues/3543), version 0.82.1 solved this performance issue for me. Thanks!