XGBoost performance improvement using ARM SVE intrinsics

Hi!

This discussion aims to explore the potential performance improvements of XGBoost by utilizing ARM Scalable Vector Extension (SVE) intrinsics. ARM SVE provides advanced vector processing capabilities that could enhance the efficiency and speed of XGBoost’s computations, especially on ARM-based architectures.

I’m looking into the following functions:

  1. RowsWiseBuildHistKernel in /xgboost/src/common/hist_util.cc
  2. PartitionBuilder in /xgboost/src/tree/common_row_partitioner.h
  3. CalcSplitGain in /xgboost/src/tree/hist/evaluate_splits.h

I see some possible scope of improvement with the use of SVE intrinsics replacing the cpp code in the above mentioned functions based on the performance analysis.

I tried to optimize the RowsWiseBuildHistKernel function, including improving memory access patterns, leveraging parallelism, reducing redundant computations, and ensuring better use of the CPU cache. Below are some modifications:

  1. Parallelism with OpenMP: parallelize the outer loop to process rows across multiple threads.
  2. Prefetch Condition: Modified the prefetch condition to ensure we do not access out-of-bounds memory.
  3. Loop Unrolling: Unrolled the inner loop, reducing loop control overhead.

This approach didn’t give me much performance gain. So now I’m thinking to shift the focus to using SVE intrinsics.

We invite community members with experience in ARM architecture, SVE intrinsics, and performance optimization to contribute to this discussion. Your insights and expertise will be invaluable in evaluating the potential of this initiative and in planning the next steps.

Thank you in advance!