Create DMatrix without huge spike in memory

krissy_fong · December 21, 2020, 2:19pm

I have been loading a pandas Dataframe from a parquet file, splitting it into train and test sets and loading that into DMatrix. The original file is around 5 or 6GB and when I load it into a DMatrix the memory usage spikes up to 15 to 20GB momentarily.

Is there a better way to store / load my dataset and slice it to prevent the memory spike? My problem is that soon I will run out of memory as I only have 25GB and am already getting crashes sometimes.

hcho3 · December 21, 2020, 4:56pm

See the discussion in Why does DMatrix copy numpy data even when it meets C_CONTIGUOUS and float32 constriants?. In short, it is not currently possible to avoid the memory spike.

krissy_fong · December 21, 2020, 9:37pm

What if I save it in libSVM format first and load that right into DMatrix?

hcho3 · December 22, 2020, 2:41am

@krissy_fong That may be a possible way to reduce peak memory consumption.

krissy_fong · December 22, 2020, 2:59am

@hcho3 Can you elaborate on why the memory spike occurs? For example, my feature names are long strings, perhaps changing them to integers would help given the libSVM format repeats the feature names over and over?

hcho3 · December 22, 2020, 3:45am

@krissy_fong No, the primary reason is that there is multiple copies of data in the memory: one for the Pandas dataframe, and another for the internal representation of DMatrix. XGBoost requires all data to be in the particular internal representation for optimal performance.

krissy_fong · December 22, 2020, 12:18pm

@hcho3 Hmmmm ok could I use a dask array to get around this issue? So it doesn’t try to load the whole array at the same time?

EDIT: it seems that learn to rank is not supported for the distributed version, so I am out of luck…