Xgboost distribute training is slow!


#1

i add some debug code to analyze the issue, found allreduce synchronization is time-expensive

i tested with 10G-size data, it take 40seconds to run an iteration
and all allreduce time sums up to nearly 20 seconds

so, how to solve this issue? and how to speedup large dataset trainning?
any reply is welcome, thanks!


#2

How many features (columns) does you data have? The amount of AllReduce communication linearly increases with respect to the number of features.


#3

hi hcho3,

thanks for your reply!
there’re about 50000+ features, with about 300-500 non-zero features.
so is the time is as expected?

thanks!


#4

And how many workers are you using?


#5

i’ve 1000000 samples, each sample has 300-500 non-zero features(about 50000 total feature, many one-hot code, many float value feature, but about 300-500 non-zero features)

i used 3 workers


#6

Are the workers connected through Ethernet? What’s the bandwidth of connection?