i add some debug code to analyze the issue, found allreduce synchronization is time-expensive
i tested with 10G-size data, it take 40seconds to run an iteration
and all allreduce time sums up to nearly 20 seconds
so, how to solve this issue? and how to speedup large dataset trainning?
any reply is welcome, thanks!