Can someone give an brief overview of how distributed training with XGBoost-4J Spark works?

I want to get a general overview of how XGBoost4J on Spark works, specifically:

  1. How is the training data partitioned to the xgboost workers? Is the the dataset simply randomly, equally partitioned to the xgboost workers?

  2. How do trees get constructed? If each xgboost worker only has a subset of the data, how can splits on the full dataset be considered? How exactly do the workers “team up” to construct a single tree? I assume the trees still have to be generated iteratively, and not in parallel. Related: where is the current model ensemble stored while training? Does it reside on each node?

  3. If there is synchronization between the workers, which I think there must be, where is that happening? Would that occur on the master node of my Spark app? I noticed that while training, my master node has almost no memory load or CPU usage so it seems like nothing is ever happening there.

  4. What is XGBoost using the Spark node memory for? I noticed for a dataset of size 107GB on disk, run on a cluster with 8 nodes, XGBoost uses about 40GB of physical memory and 140GB of virtual memory on average per node while training, which is much more than the dataset itself, split into 8. I’m getting errors like “std:bad_alloc” with certain datasets, and it seems to be related to XGBoost core code memory requests.

Not specific to Spark, but this article provides an excellent introduction to how XGBoost uses multiple worker processes (nodes) to fit a tree ensemble.

As for 1): XGBoost has no control over how the Dataset gets distributed in Spark. You can call repartition or coalesce to control how data gets distributed among the workers.

3): No, the master node performs zero work. The workers perform AllReduce operation to synchronize among themselves, and then the workers perform their slice of computation.

Thanks! that article is very helpful