I have a dataset with around 150m rows, and 250 columns. It comes to approximately 405GB on RAM. Using less data is not an option.
I’ve successfully trained smaller dataset using dask with XGBoost to distribute the dataset across 4 GPUs… but that was a much smaller dataset. It would be way too expensive to train a model this way, I don’t even think AWS has an instance with that many GPUs.
Is there a way of loading, and training the dataset in chunks (and still getting the same / similar answer as if I had just done it normally)?