I want to get a general overview of how XGBoost4J on Spark works, specifically:
-
How is the training data partitioned to the xgboost workers? Is the the dataset simply randomly, equally partitioned to the xgboost workers?
-
How do trees get constructed? If each xgboost worker only has a subset of the data, how can splits on the full dataset be considered? How exactly do the workers “team up” to construct a single tree? I assume the trees still have to be generated iteratively, and not in parallel. Related: where is the current model ensemble stored while training? Does it reside on each node?
-
If there is synchronization between the workers, which I think there must be, where is that happening? Would that occur on the master node of my Spark app? I noticed that while training, my master node has almost no memory load or CPU usage so it seems like nothing is ever happening there.
-
What is XGBoost using the Spark node memory for? I noticed for a dataset of size 107GB on disk, run on a cluster with 8 nodes, XGBoost uses about 40GB of physical memory and 140GB of virtual memory on average per node while training, which is much more than the dataset itself, split into 8. I’m getting errors like “std:bad_alloc” with certain datasets, and it seems to be related to XGBoost core code memory requests.