Hi all, I am currently trying to figure out what the best way to get data which is located in Amazon S3 as a parquet dataset (a columnar data format) into a DMatrix for xgboost training using as little memory as possible. My current approach is using awswrangler to get the data as a Pandas DF in memory, convert that DF to a numpy array, then construct a DMatrix from that numpy array. It turns out when certain constraints are met (which they are in my case), you can get a numpy array from a DF without any data copying. The problem comes when I try to get a DMatrix from my numpy array.
Looking at the xgboost source code for processing numpy arrays:
the comment says: “Initialize data from a 2-D numpy matrix. If
mat does not have
order='C' (aka row-major) or is not contiguous, a temporary copy will be made. If
mat does not
dtype=numpy.float32, a temporary copy will be made. So there could be as many as two temporary data copies; be mindful of input layout and type if memory use is a concern.” With this in mind, I made sure that my numpy array’s memory layout is C_CONTIGUOUS, or row-major, and that the dtype is float32. I also ran experiments with memory profiling and confirmed that having dtype=float64 or column-major memory layout lead to temporary data copies. The strange thing is it seems like even when I have row-major memory layout (C_CONTIGUOUS in numpy) and float32 dtype, DMatrix construction still seems to be making a copy of my data, at least that is what I think I’m observing in the memory profile graph below:
The code that creates my dummy numpy data is in the blue brackets. The spike is probably due to downsizing from float64 to float32. The code that creates a DMatrix is executed in the green brackets. The location of the opening green bracket is the memory usage before the DMatrix is created, where only my numpy array exists (about 1100 MB). I can see the memory usage about triples during DMatrix creation (3300 MB), so assume the data was copied from my numpy to the DMatrix. This trend occurs for various data sizes. Is this intended behavior? Is there any way around this? I will be dealing with datasets close to memory capacity so I can’t afford data copies, if possible.