Is this a true statement?
When using distributed training in Python with
xgb.dask, the data distribution is “data parallel”, meaning that each worker has a subset of the training data’s rows. Locally on each worker, the process of training a model is “feature parallel”, which means that candidate splits for different features are evaluated in parallel using multithreading.
As a result, the same feature + split combinations may be evaluated on multiple workers (using their local pieces of the data).
I ask because I’m trying to form good expectations for how training time with
dask.xgb should change as I add more machines.
If my statement above is correct, I’d expect a decreasing rate of improvement from adding more workers because some duplicated work is being done (e.g. workers 1, 2, and 3 might all evaluate the split
feat_1 > 0.75 on their local piece of the data).
I have read the original xgboost paper, and saw this description of the
approx tree method used in distributed training:
In order to reduce the cost of sorting, we propose to store the data in in-memory units, which we called block. Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. This input data layout only needs to be computed once before training, and can be reused in later iterations
The block structure also helps when using the approximate algorithms. Multiple blocks can be used in this case, with each block corresponding to subset of rows in the dataset.
Different blocks can be distributed across machines, or stored on disk in the out-of-core setting. Using the sorted structure, the quantile finding step becomes a linear scan over the sorted columns.
As far as I can tell, the
DaskDMatrix isn’t equivalent to that and can’t be, since Dask collections are lazy loaded and
xgb.dask tries to avoid moving any data off of the worker it’s loaded onto.
I’m using these terms “data parallel” and “feature parallel” because that is how LightGBM describes them: https://lightgbm.readthedocs.io/en/latest/Features.html#optimization-in-parallel-learning.
Notes for reviewers
I have tried searching this discussion board, XGBoost issues, Stack Overflow, and the source code for
rabit and could not find an answer to this. Apologies in advance if this is covered somewhere already.
Thanks for your time and consideration!