I’m using a x1.32xlarge EC2 machine (128 cpu, 1.9T memory) to train a XGBoost model on very large data sets (training =700G , test=200G). I’m using XGBoost command line tool (similar to what demo/binary_classification/ shows). The datasets are processed libsvm files that can be directly read by the command line tool (they are small partitions contained in a training folder and a test folder). I have set nthread=32, 50, 126 but all of them never finished loading data and started the first iteration after 13+ hours.
I then tried on a smaller dataset (training=226G, test=142G) and set nthread=126 and 32. It loaded the data for 2700 sec. However, after that, it never finished the first iteration for over 30+ hours! I also tried setting n_job=32 as I see nthread is deprecated, but nothing has changed. I also noticed that for the first 5 min, CPU% could be as high as 11000% meaning that 110 CPUs are used. However, after that, it keeps at 100% CPU. Further, memory usage is slowly increasing (1G increase per minute).
Here is the parameters I use:
booster = gbtree
objective = binary:logistic
eval_metric = logloss
eval_metric = auc
#nthread = 126
n_jobs = 32
verbosity = 3
Tree method is automatically selected to be ‘approx’.
Does anyone have any idea why it is so slow??