I have had trouble determining if out-of-core training is possible with Distributed XGBoost with Dask. By out of core I mean iterative loading and training on, for example, a dataset that is larger than the available RAM on a single worker. I created an example below that is similar to the training example in the XGBoost dask documentation.
I attempted to run the below on a local cluster with a single worker limited to 500MB of memory. The total dataset is ~500MB and I have created 20 chunks (10 for each array) so about 50MB per partition. Naively, I expected that each partition would be loaded iteratively on the worker and xgboost would train on a single partition, save the result back to disk and then collect all results to create a single booster. However, when experimenting with this the creation of the DaskDMatrix
hang, the worker fails and gets restarted repeatedly. From reading the XGBoost docs it is unclear to me if it is possible to train out-of-core with Dask as I am attempting or if I need to add additional workers such that my entire dataset (and intermediate DMatrix) can fit into the combined RAM of all my workers. (Note: I am actually interested in running on a cluster, but wanted to run a simple experiment to determine if OOC training is possible).
import dask.array as da
from dask import persist
from dask.distributed import Client
import xgboost as xgb
client = Client(<address>)
client
num_obs = 1e6
num_features = 50
chunk_size = 1e5
X = da.random.random(size=(num_obs, num_features), chunks=(chunk_size, num_features))
y = da.random.random(size=(num_obs, 1), chunks=(chunk_size, 1))
dtrain = xgb.dask.DaskDMatrix(client, X, y)
output = xgb.dask.train(
client,
{"verbosity": 2, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=4,
evals=[(dtrain, "train")],
)