How can I train a model on a ~400 GB dataset?

krissyfond · March 17, 2021, 9:59pm

I have a dataset with around 150m rows, and 250 columns. It comes to approximately 405GB on RAM. Using less data is not an option.

I’ve successfully trained smaller dataset using dask with XGBoost to distribute the dataset across 4 GPUs… but that was a much smaller dataset. It would be way too expensive to train a model this way, I don’t even think AWS has an instance with that many GPUs.

Is there a way of loading, and training the dataset in chunks (and still getting the same / similar answer as if I had just done it normally)?

hcho3 · March 17, 2021, 10:20pm

No, XGBoost requires access to the whole data in memory. You may try using external memory to keep only part of the data in memory, but keep in mind its experimental status.

krissyfond · March 17, 2021, 10:24pm

I don’t understand, your answer was essentially no, but yes.

hcho3 · March 18, 2021, 12:08am

External memory is not really loading the data in chunks. It’s more like “virtual memory” feature in OS, where the hard drive is used as an extension of the main memory. So there will be many accesses to the hard drive throughout the training process.

I am a bit hesitant to recommend this feature because 1) it will be slow (due to the use of hard disk) and 2) it may break. There are many unresolved bugs around it that we currently lack means to address. If this task is mission critical for you, you should consider getting a big EC2 instance with lots of main memory and use the CPU for training.

krissyfond · March 18, 2021, 1:43am

Ah ok that makes sense. Is there any way of estimating training time on a CPU for something that large?

hcho3 · March 18, 2021, 1:46am

No, there isn’t a convenient way to do that.

krissyfond · March 18, 2021, 2:05am

OK thanks. How is this done in a commercial setting? Would professionals just use a massive GPU cluster to do this?

hcho3 · March 18, 2021, 2:35am

The short answer is yes.

Enterprises already maintain in-house clusters with CPU cores in them. Spark and Dask are popular frameworks for managing clusters. GPU-enabled clusters are increasingly popular.

An alternative is to sign a contract with a cloud vendor like AWS to reserve a high number of GPUs in advance. Each EC2 instance gets only a few GPUs, so the trick here is to launch many EC2 instances.

You should always ask in the end whether using GPU will give you better performance per cost. Sometimes, it might be better to use CPU only to simplify management cost (i.e. use a single EC2 instance with lots of RAM vs. using many EC2 instances with GPU)

hcho3 · March 18, 2021, 2:36am

@krissyfond You should also consider using algorithms that let you load only a small chunk of data at a time, such as logistic regression or neural networks.

krissyfond · March 18, 2021, 10:04am

Thank you for all the advice. I’ll give it a go with a huge CPU only instance in AWS.

Its a complicated dataset with a very small signal and lots of noise… I’ve only had success with gradient boosting so far, as neural networks are just too slow and logistic regression doesn’t seem to be sophisticated enough.