Xgboost distribute training is slow!

guanxiang · June 19, 2019, 10:18am

i add some debug code to analyze the issue, found allreduce synchronization is time-expensive

i tested with 10G-size data, it take 40seconds to run an iteration
and all allreduce time sums up to nearly 20 seconds

so, how to solve this issue? and how to speedup large dataset trainning?
any reply is welcome, thanks!

hcho3 · June 26, 2019, 7:10pm

How many features (columns) does you data have? The amount of AllReduce communication linearly increases with respect to the number of features.

guanxiang · June 27, 2019, 2:27am

hi hcho3,

thanks for your reply!
there’re about 50000+ features, with about 300-500 non-zero features.
so is the time is as expected?

thanks!

hcho3 · June 27, 2019, 2:38am

And how many workers are you using?

guanxiang · June 28, 2019, 2:37am

i’ve 1000000 samples, each sample has 300-500 non-zero features(about 50000 total feature, many one-hot code, many float value feature, but about 300-500 non-zero features)

i used 3 workers

hcho3 · June 28, 2019, 2:46am

Are the workers connected through Ethernet? What’s the bandwidth of connection?