Kernel crashes with no error message when I train on my 100GB dataset

I am using the following code to create my daskdmatrix and train on it. The kernel crashes without an error message on the last line:

cluster = dask.distributed.LocalCluster(n_workers = 48, threads_per_worker=1)
client = dask.distributed.Client(cluster)

xTrain = dd.from_pandas(db.iloc[:,1:],chunksize=1000)
yTrain = dd.from_pandas(db.iloc[:,0:1],chunksize=1000)

dTrain = xgb.dask.DaskDMatrix(client=client, data=xTrain, label=yTrain)
params = {'tree_method':'hist','objective':'reg:squarederror'}
reg = xgb.dask.train(client, params, dTrain, num_boost_round=numRounds,verbose_eval=1)

the dataset is ~100GB. I am training it on a AWS instance with 480GB of RAM and 48 CPUs. I have no idea how to fix this since there is no error message.

Have you tried using the regular xgb.train() (no Dask)? Since you are using a single machine with multiple CPU cores, Dask may not give you much performance benefit.

As for the error, I have no idea what happened either, since there is no error message.

1 Like

OK I will try no Dask. Is there an additional memory spike when the training actually begins?

I have asked before about the memory spike when creating Device QuantileDMatrix, maybe there is another one when the training starts that is maxing out my RAM? Running out of RAM is the only type of error I have come across so far that doesn’t result in a message.

DeviceQuantileMatrix is only supported when you are using the GPU algorithm (hence the “Device” prefix). It won’t help you here since you are using the CPU right now.

1 Like

I am asking whether there is a memory spike when training starts, similar to the memory spike that occurs when making the DMatrix (GPU or CPU).

The short answer is yes. DeviceQuantizeDMatrix would lessen memory usage somewhat, but unfortunately we do not support it in the CPU algorithm.

1 Like

OK I moved to a bigger instance. 500GB of RAM was necessary to cover the memory spikes involved with making the DMatrix and training the model. Thanks for your help.

1 Like

I’m running into a similar issue processing a ~80GB dataset on a Coiled cluster (hosted Dask cluster) with 20 workers (4CPUs with 16GB RAM each). My jupyter notebook kernel dies without an error message when trying to create the XGBoost DMatrix using:

# Create the XGBoost DMatrix
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)

I’ve tried the following without any luck:

  • processing only half the dataset
  • upgrading my worker’s RAM to 30GB

Any pointers on what I could be tweaking to make this work would be highly appreciated! Thanks for your time.