Kernel crashes with no error message when I train on my 100GB dataset

krissyfond · March 24, 2021, 1:25pm

I am using the following code to create my daskdmatrix and train on it. The kernel crashes without an error message on the last line:

cluster = dask.distributed.LocalCluster(n_workers = 48, threads_per_worker=1)
client = dask.distributed.Client(cluster)

xTrain = dd.from_pandas(db.iloc[:,1:],chunksize=1000)
yTrain = dd.from_pandas(db.iloc[:,0:1],chunksize=1000)

dTrain = xgb.dask.DaskDMatrix(client=client, data=xTrain, label=yTrain)
params = {'tree_method':'hist','objective':'reg:squarederror'}
reg = xgb.dask.train(client, params, dTrain, num_boost_round=numRounds,verbose_eval=1)

the dataset is ~100GB. I am training it on a AWS instance with 480GB of RAM and 48 CPUs. I have no idea how to fix this since there is no error message.

hcho3 · March 24, 2021, 4:32pm

Have you tried using the regular xgb.train() (no Dask)? Since you are using a single machine with multiple CPU cores, Dask may not give you much performance benefit.

As for the error, I have no idea what happened either, since there is no error message.

krissyfond · March 24, 2021, 5:39pm

OK I will try no Dask. Is there an additional memory spike when the training actually begins?

I have asked before about the memory spike when creating Device QuantileDMatrix, maybe there is another one when the training starts that is maxing out my RAM? Running out of RAM is the only type of error I have come across so far that doesn’t result in a message.

hcho3 · March 24, 2021, 6:41pm

DeviceQuantileMatrix is only supported when you are using the GPU algorithm (hence the “Device” prefix). It won’t help you here since you are using the CPU right now.

krissyfond · March 24, 2021, 7:25pm

I am asking whether there is a memory spike when training starts, similar to the memory spike that occurs when making the DMatrix (GPU or CPU).

hcho3 · March 24, 2021, 8:34pm

The short answer is yes. DeviceQuantizeDMatrix would lessen memory usage somewhat, but unfortunately we do not support it in the CPU algorithm.

krissyfond · March 24, 2021, 8:50pm

OK I moved to a bigger instance. 500GB of RAM was necessary to cover the memory spikes involved with making the DMatrix and training the model. Thanks for your help.

rrpelgrim · August 19, 2021, 12:32pm

I’m running into a similar issue processing a ~80GB dataset on a Coiled cluster (hosted Dask cluster) with 20 workers (4CPUs with 16GB RAM each). My jupyter notebook kernel dies without an error message when trying to create the XGBoost DMatrix using:

# Create the XGBoost DMatrix
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)

I’ve tried the following without any luck:

processing only half the dataset
upgrading my worker’s RAM to 30GB

Any pointers on what I could be tweaking to make this work would be highly appreciated! Thanks for your time.