AFAIK, xgboost provide two options: one is in-memory version, which loads all the data into RAM; the other is external memory version, which writes all the data into cache files and loads one piece at a time when training. Also, the external memory version is still in beta version, if I decide to use it, I need to provide a cache file name (after ‘#’) as a suffix of input file name.
So if I am right, the only way to control the memory use of xgboost under a certain level is to evaluate the scale of the data set beforehand and choose the right mode (in-memory or external memory) of xgboost to run.
Is there a empirical formula that could help to estimate the memory usage given a size of data set as input?
Such as: n(num of non-empty entries) * factor = max_memory_usage
Thanks in advance!