Help understanding memory spike when creating DMatrix

I would like more information on the memory spike when creating DMatrix. There doesn’t seem to be any reasoning to how big the spike is - sometimes its 3x my original dataset and sometimes its >10x.

Also I cannot load a parquet file after I create a DMatrix without memory spiking and crashing. Can load the same file beforehand though with no issues.

See Why does DMatrix copy numpy data even when it meets C_CONTIGUOUS and float32 constriants?. Currently, XGBoost creates a new representation of the data when DMatrix is created. If you are using GPU for training, you can use DeviceQuantileDMatrix to avoid the additional copy, as it builds the data representation “in-place”. Unfortunately, we don’t yet support in-place construction for CPU algorithm.

1 Like

Also, there is this issue that’s potentially related: https://github.com/dmlc/xgboost/issues/6552