XGBoost CLI runs extremely slow on 700G data and never parallel


#1

Hi guys,

I’m using a x1.32xlarge EC2 machine (128 cpu, 1.9T memory) to train a XGBoost model on very large data sets (training =700G , test=200G). I’m using XGBoost command line tool (similar to what demo/binary_classification/ shows). The datasets are processed libsvm files that can be directly read by the command line tool (they are small partitions contained in a training folder and a test folder). I have set nthread=32, 50, 126 but all of them never finished loading data and started the first iteration after 13+ hours.

I then tried on a smaller dataset (training=226G, test=142G) and set nthread=126 and 32. It loaded the data for 2700 sec. However, after that, it never finished the first iteration for over 30+ hours! I also tried setting n_job=32 as I see nthread is deprecated, but nothing has changed. I also noticed that for the first 5 min, CPU% could be as high as 11000% meaning that 110 CPUs are used. However, after that, it keeps at 100% CPU. Further, memory usage is slowly increasing (1G increase per minute).

Here is the parameters I use:

booster = gbtree
objective = binary:logistic
eval_metric = logloss
eval_metric = auc
#nthread = 126
n_jobs = 32
verbosity = 3

Tree method is automatically selected to be ‘approx’.

Does anyone have any idea why it is so slow??

Thanks!
Lassell


#2

What is the feature dimension of your training data?


#3

About 1500 features (and about 80M rows.) Some cells could be missing (but missing proportion is not too much, should be < 5%.) Oh, also, it just loaded data, but took 20hs

[12:15:07] 60350020x1520 matrix with 89273843048 entries loaded from /home/ubuntu/data/train/
[14:40:00] 19450000x1520 matrix with 28788975345 entries loaded from /home/ubuntu/data/test/
[14:40:00] INFO: src/cli_main.cc:198: Loading data: 73744.9 sec


#4

I wonder if this is a fundamental limitation of LIBSVM format. Have you tried loading your data using other tools (Pandas, scikit-learn etc)? XGBoost core itself should have no issue handling large data. For instance, Uber is using XGBoost for big datasets with billions of rows: https://github.com/dmlc/rabit/pull/73. (They use Spark to process their big data.)


#5

I use libsvm is because I thought it’s faster (and using command line tool is really neat! =D) but I will try to use other formats too :grinning:


#7

Have you considered distributed training for this problem? If disk read speeds are the bottleneck here having the data partitioned might help greatly (essentially should have a linear speedup in the number of machines)


#8

Presumably you are storing your data on EBS? What volume type are you using? It’s possible you are throttled on I/O. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html.