How many factors make impact on the memory use of xgboost?

AFAIK, xgboost provide two options: one is in-memory version, which loads all the data into RAM; the other is external memory version, which writes all the data into cache files and loads one piece at a time when training. Also, the external memory version is still in beta version, if I decide to use it, I need to provide a cache file name (after ‘#’) as a suffix of input file name.

So if I am right, the only way to control the memory use of xgboost under a certain level is to evaluate the scale of the data set beforehand and choose the right mode (in-memory or external memory) of xgboost to run.

Is there a empirical formula that could help to estimate the memory usage given a size of data set as input?
Such as: n(num of non-empty entries) * factor = max_memory_usage
Thanks in advance!

I don’t know of an easy empirical formula for the memory usage, but on of the main parameters that affects memory use is the max_depth parameter of the trees, and if using the histogram updater which you should for large data sizes, the max_sketch_size.

A formula for memory usage would definitely have to include these two parameters.

@thvasilo Thank you for your reply! I realize maybe I was asking for an short answer of a complicated question. :sweat:

So I’d like to change my question to How many factors make impact on the memory use of xgboost?
Thank you.