XGBoost performance scaling with RAM speeds

Hey guys, looking to benefit from the communities’ wisdom here and possibly spark a bit of discussion.

Short version: does anyone know if the training time of CPU implementations of tabular learning algorithms (XGBoost in particular but also LightGBM, TabNet, etc) depend on RAM speeds?

Longer version. I recently switched from an i7 12700KF CPU to an i9 13900K. Doing a somewhat heavy AutoGluon training (most time spent in the algorithms above) that takes 4 hours I got a 1.6X speedup from the newer processor which is great, training now takes 2.5 hours so more trials per day of work). My RAM is a 2x32GB kit of DDR4 memory that can work overclocked at 3200MHz. However while installing the new CPU it defaulted back to 2133MHz. At that speed, training was far slower, I don’t recall the exact figure but something like 50% as fast. After overclocking to 3200MHz, the 1.6X speedup.

There’s thousands of RAM benchmarks for games (where RAM speeds have a limited impact) but I’ve found none for ML. Closest I got was this video from LTT https://www.youtube.com/watch?v=b-WFetQjifc where he shows for some productivity apps it has a major impact but none of those are ML applications.

So my question is: are these algorithms training times sensitive to RAM bandwidth? More so for CPUs with higher core counts?

Yes, they depend on both CPU cache and ram, the effect can be changed by the number of threads you are using.

From previous profiling results, the histogram kernel used by XGBoost hist and approx tree methods, along with LightGBM, memory bandwidth can be a bottleneck. During the histogram build, multiple threads are writing to the histogram on memory in parallel. As a result, having more threads might lead to greater pressure on the host memory, making it a bottleneck for training.

Hey, thank you so much for answering and for those insights.
I’ve picked up a DDR5@6600MHz kit which should be arriving in some days.
I’ll try and put together a simple benchmark for trying the algorithms and test them with my current DDR4, without the XMP@2133MHz, with the XMP@3200MHz, and then with the DDR5 without XMP@4800MHz I think, and with XMP@6600MHz.
That should give us a few datapoints to understand the scaling. I’ll let you know when I have those ready!

Hey guys, I still don’t have the RAMs with me yet but I’ve put together this repo for the benchmark.
It’s very very much a wip and definitely has a lot to improve but should work as a basis.
Comments, issues and PRs welcome!

Interesting. Modern systems are bit more complicated, for instance, AMD is known to put huge CPU cache in their CPUs, which might mitigate the pressure on main memory.

Hey there so, I finished running the benchmark and have the results with me.
This is the benchmark’s repo.
There’s a SystemInfo folder with HWInfo screenshots showing system information for the different RAM speeds. The 3200MHz one is identical to the 2133 one but I forgot to save the pic.
These are the Speedups vs DDR4@2133MHz for XGBoost with default settings and using method=hist.

We see relatively large gains for bigger dataset sizes. The 2000000 x 10 one saw a 193% speedup, nearly twice as fast. For reference, moving from the an i7 12700k to an i9 13900k gave me a 50% speedup in a real world AutoGluon problem, so, 193% is quite respectable.
For smaller problem sizes there’s no real difference though. I guess RAM speeds start to matter, once we saturate CPU caches.
AutoGluon in particular, which was the longest running of all algorithms, does see big gains as well, and starting from smaller but still relatively big problem sizes (500000 x 10).
The trend is more or less similar across all algorithms, save XGBoost when not using the histogram method, which sees little gains at all across RAM speeds.
You can find results for all models in this spreadsheet. Change the model filter in the Speedups sheet to get the graphics for the different models. The linear regression timings are not super accurate because they take too little time to execute so don’t pay much attention to them.

And that’s that. I guess the final answer to the question of whether investing in faster RAM depends on how big your problem is and how often you need to retrain. For more than 2M rows, I’d say yes, it’s a no brainer, for 1M it may be worth it, for less than that, it’s likely not. Also, these results are valid for the 13900k with 32 threads. I don’t know how well these would carry out to other CPUs, but anyone is welcome to test out with the benchmark.

1 Like

Thank you for doing the detailed benchmarks! Quite informative. Would be great if we can find an appropriate way to reference this in the document. I’m sure it will be useful to many others.

@Ludecan I made a copy of your spreadsheet to my private drive, but won’t share it without your permission.

Hey there @jiamingy, sure, please make the best use of this info you see fit. My idea with this was to benefit the community with a piece of information I wasn’t able to find anywhere else. Only thing if you don’t mind, please cite the original repo so any improvements and discussion can be made there as well.

Which is the document in which you would be referencing the benchmarks?

We are likely to do some more optimization, especially with multi-target, which is quite memory intensive, your benchmark can be a good starting point.

1 Like