How can I train a model on a ~400 GB dataset?

I have a dataset with around 150m rows, and 250 columns. It comes to approximately 405GB on RAM. Using less data is not an option.

I’ve successfully trained smaller dataset using dask with XGBoost to distribute the dataset across 4 GPUs… but that was a much smaller dataset. It would be way too expensive to train a model this way, I don’t even think AWS has an instance with that many GPUs.

Is there a way of loading, and training the dataset in chunks (and still getting the same / similar answer as if I had just done it normally)?

No, XGBoost requires access to the whole data in memory. You may try using external memory to keep only part of the data in memory, but keep in mind its experimental status.

I don’t understand, your answer was essentially no, but yes.

External memory is not really loading the data in chunks. It’s more like “virtual memory” feature in OS, where the hard drive is used as an extension of the main memory. So there will be many accesses to the hard drive throughout the training process.

I am a bit hesitant to recommend this feature because 1) it will be slow (due to the use of hard disk) and 2) it may break. There are many unresolved bugs around it that we currently lack means to address. If this task is mission critical for you, you should consider getting a big EC2 instance with lots of main memory and use the CPU for training.

Ah ok that makes sense. Is there any way of estimating training time on a CPU for something that large?

No, there isn’t a convenient way to do that.

OK thanks. How is this done in a commercial setting? Would professionals just use a massive GPU cluster to do this?

The short answer is yes.

Enterprises already maintain in-house clusters with CPU cores in them. Spark and Dask are popular frameworks for managing clusters. GPU-enabled clusters are increasingly popular.

An alternative is to sign a contract with a cloud vendor like AWS to reserve a high number of GPUs in advance. Each EC2 instance gets only a few GPUs, so the trick here is to launch many EC2 instances.

You should always ask in the end whether using GPU will give you better performance per cost. Sometimes, it might be better to use CPU only to simplify management cost (i.e. use a single EC2 instance with lots of RAM vs. using many EC2 instances with GPU)

@krissyfond You should also consider using algorithms that let you load only a small chunk of data at a time, such as logistic regression or neural networks.

Thank you for all the advice. I’ll give it a go with a huge CPU only instance in AWS.

Its a complicated dataset with a very small signal and lots of noise… I’ve only had success with gradient boosting so far, as neural networks are just too slow and logistic regression doesn’t seem to be sophisticated enough.