XGBoost CLI runs extremely slow on 700G data and never parallel

Hi guys,

I’m using a x1.32xlarge EC2 machine (128 cpu, 1.9T memory) to train a XGBoost model on very large data sets (training =700G , test=200G). I’m using XGBoost command line tool (similar to what demo/binary_classification/ shows). The datasets are processed libsvm files that can be directly read by the command line tool (they are small partitions contained in a training folder and a test folder). I have set nthread=32, 50, 126 but all of them never finished loading data and started the first iteration after 13+ hours.

I then tried on a smaller dataset (training=226G, test=142G) and set nthread=126 and 32. It loaded the data for 2700 sec. However, after that, it never finished the first iteration for over 30+ hours! I also tried setting n_job=32 as I see nthread is deprecated, but nothing has changed. I also noticed that for the first 5 min, CPU% could be as high as 11000% meaning that 110 CPUs are used. However, after that, it keeps at 100% CPU. Further, memory usage is slowly increasing (1G increase per minute).

Here is the parameters I use:

booster = gbtree
objective = binary:logistic
eval_metric = logloss
eval_metric = auc
#nthread = 126
n_jobs = 32
verbosity = 3

Tree method is automatically selected to be ‘approx’.

Does anyone have any idea why it is so slow??

Thanks!
Lassell

What is the feature dimension of your training data?

About 1500 features (and about 80M rows.) Some cells could be missing (but missing proportion is not too much, should be < 5%.) Oh, also, it just loaded data, but took 20hs

[12:15:07] 60350020x1520 matrix with 89273843048 entries loaded from /home/ubuntu/data/train/
[14:40:00] 19450000x1520 matrix with 28788975345 entries loaded from /home/ubuntu/data/test/
[14:40:00] INFO: src/cli_main.cc:198: Loading data: 73744.9 sec

I wonder if this is a fundamental limitation of LIBSVM format. Have you tried loading your data using other tools (Pandas, scikit-learn etc)? XGBoost core itself should have no issue handling large data. For instance, Uber is using XGBoost for big datasets with billions of rows: https://github.com/dmlc/rabit/pull/73. (They use Spark to process their big data.)

I use libsvm is because I thought it’s faster (and using command line tool is really neat! =D) but I will try to use other formats too :grinning:

Have you considered distributed training for this problem? If disk read speeds are the bottleneck here having the data partitioned might help greatly (essentially should have a linear speedup in the number of machines)

Presumably you are storing your data on EBS? What volume type are you using? It’s possible you are throttled on I/O. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html.