I would like to add more estimators in the pipeline, so our customers can try
them out based on their requirements. Unfortunately, I am unable to do that
with XGBRegressor
because of the memory constraints. Here is a bit of context:
I was trying to do some experiments with the native XGBRegressor
in order to
understand how much resources are required based on the size of data, number of
estimators, max_depth, etc. before starting the actual training process.
Unfortunately, increading number of max_depth increases memory consumptions a
lot. For example, I set max_depth=40 and n_estimators=3000, it consumed ~110 GB
of memory, even though the actual shape of data was (6666, 8750), and the size
is 445 Mib. Moreover, if I use native cross_validation method with 5 folds, it
consumes even more than 500GB memory. I am suprised, why xgboost consumes such a
huge memory.
I was a bit skeptical, and I did a small experiment to know whether there is an
issue with sklearn wrapper or not. I tried to train the model with
XGBRFRegressor
and RandomForestRegressor
with the same parameters and, it
tunred out that RandomForestRegressor
consumed only ~0.8 GB of memory, and all
the 50 threads were used. I am not sure why XGBRFRegressor
is not using all
the threads and consuming a lot of memory.
I would like to know if the native train
function of XGBRegressor
creates
copies of the data or not. Or, am I missing something?
Below is the code snippet for reproducing the issue:
import numpy as np
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
X = np.random.rand(6666, 8750)
y = np.random.rand(6666)
# Training the below model took ~14.97 Minutes. Also, only ~12 to 13 threads
# were used. The additional amount of memory consumed by this model was ~11 GB.
xgb_model = XGBRFRegressor(
n_estimators=500,
max_depth=10,
tree_method="hist",
n_jobs=50,
random_state=42,
)
xgb_model.fit(X, y)
# Training the below model took 8.30 Minutes. Also, all the 50 threads were
# used. The additional amount of memory consumed by this model was ~0.8 GB.
rf_model = RandomForestRegressor(
n_estimators=500,
max_depth=10,
n_jobs=50,
random_state=42,
)
rf_model.fit(X, y)
I used the below versions of the libraries:
- xgboost==2.1.0
- scikit-learn==1.5.1
Here is a bit more information about the system:
- Processor: AMD EPYC 7702 64-Core
- Core: 64
- Thread: 128
- Memory: 512 GB
- Architecture: x86_64
- OS: Ubuntu 22.04.4 LTS