Native RandomForestRegressor takes less time than XGBRFRegressor

I would like to add more estimators in the pipeline, so our customers can try
them out based on their requirements. Unfortunately, I am unable to do that
with XGBRegressor because of the memory constraints. Here is a bit of context:

I was trying to do some experiments with the native XGBRegressor in order to
understand how much resources are required based on the size of data, number of
estimators, max_depth, etc. before starting the actual training process.
Unfortunately, increading number of max_depth increases memory consumptions a
lot. For example, I set max_depth=40 and n_estimators=3000, it consumed ~110 GB
of memory, even though the actual shape of data was (6666, 8750), and the size
is 445 Mib. Moreover, if I use native cross_validation method with 5 folds, it
consumes even more than 500GB memory. I am suprised, why xgboost consumes such a
huge memory.

I was a bit skeptical, and I did a small experiment to know whether there is an
issue with sklearn wrapper or not. I tried to train the model with
XGBRFRegressor and RandomForestRegressor with the same parameters and, it
tunred out that RandomForestRegressor consumed only ~0.8 GB of memory, and all
the 50 threads were used. I am not sure why XGBRFRegressor is not using all
the threads and consuming a lot of memory.

I would like to know if the native train function of XGBRegressor creates
copies of the data or not. Or, am I missing something?

Below is the code snippet for reproducing the issue:

import numpy as np
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = np.random.rand(6666, 8750)
y = np.random.rand(6666)

# Training the below model took ~14.97 Minutes. Also, only ~12 to 13 threads
# were used. The additional amount of memory consumed by this model was ~11 GB.
xgb_model = XGBRFRegressor(
    n_estimators=500,
    max_depth=10,
    tree_method="hist",
    n_jobs=50,
    random_state=42,
)
xgb_model.fit(X, y)

# Training the below model took 8.30 Minutes. Also, all the 50 threads were
# used. The additional amount of memory consumed by this model was ~0.8 GB.
rf_model = RandomForestRegressor(
    n_estimators=500,
    max_depth=10,
    n_jobs=50,
    random_state=42,
)
rf_model.fit(X, y)

I used the below versions of the libraries:

  • xgboost==2.1.0
  • scikit-learn==1.5.1

Here is a bit more information about the system:

  • Processor: AMD EPYC 7702 64-Core
  • Core: 64
  • Thread: 128
  • Memory: 512 GB
  • Architecture: x86_64
  • OS: Ubuntu 22.04.4 LTS

According to https://xgboost.readthedocs.io/en/stable/parameter.html:

Beware that XGBoost aggressively consumes memory when training a deep tree.

You may consider limiting max_depth to a lower number to reduce the amount of memory.

Alternatively, you can limit the number of nodes in the trees so that the resulting tree would be less bushy and more sparse, by explicitly setting grow_policy='lossguide' and setting max_leaves to a reasonable number. Can you check how many nodes the RandomForestRegressor is generating per tree?

In general, the memory footprint of XGBoost increases linearly with respect to the number of nodes in the tree. So you may be able to grow deeper trees with smaller memory footprint, as long as max_leaves is suitably set (to crease a sparse tree).