How are cached DMatrix are used in a Booster?

Both the C API and Python API allow for multiple DMatrix to be cached when creating a Booster:

// --- start XGBoost class
/*!
 * \brief create xgboost learner
 * \param dmats matrices that are set to be cached
 * \param len length of dmats
 * \param out handle to the result booster
 * \return 0 when success, -1 when failure happens
 */
XGB_DLL int XGBoosterCreate(const DMatrixHandle dmats[],
                            bst_ulong len,
                            BoosterHandle *out);

Is there any information on how these cached DMatrix are used?

Most Booster functions (boost/update/eval/predict etc.) take one or more DMatrix as arguments, so trying to figure out what is the benefit of providing matrices to cache when creating a Booster.

I’m also wondering whether the Booster makes a copy of these matrices or just keeps a handle to them, mostly to work out whether it’s safe to free a DMatrix that has been cached in a Booster.

it has something to do with the algorithm being used, and we use internal cached data structure to speedup the computation

Thanks, so it’s recommended to always provide any DMatrix to cache if you know it’s going to be evaluated later I guess.

Thanks for asking this question.

I am still unable to understand the functional/speed aspect of the cache DMatrix array being passed to Booster. Could someone please explain its role further?

There are multiple ways of creating the Booster
a. Pass NULL as cache
b. Pass Dmatrix array of 1 size
c. Pass Dmatrix array of more than 1 size

  1. Could someone please throw more light on the difference in outcome of these 3 different scenarios? latency sensitive?
  2. Also in cases where we do pass the cache DMatrix,
    a. how do we fill data in these? I have seen examples where random data filling is done which doesn’t make sense to me.
    b. Do we maintain the DMatrix being passed until the end?
  3. Does cache have equal importance for model training vs only prediction functionality? Would our usage change depending on model training vs just prediction?

@hcho3 Please help me find the answers about usage of cache DMatrix

@mohitk08 The cache argument should be set to empty (nullptr) for your use case, since this is only relevant for model training and not for model inference.

If you are also training models as well, then the cache DMatrix argument should be set to the list of all DMatrix objects that are relevant to model training: training and evaluation data matrices. The role of the cache is to make prediction zero-cost for the training and evaluation data matrices, since the prediction has already been performed by the training process.

Thank you, you are the lifeline of this forum.

In my usecase, I’ll be calling predict method every few seconds/milliseconds based on dynamic data as input.
Just to confirm, will DMatrix Cache not be useful even in this repetitive prediction scenarios?

Correct. The cache is strictly for training models only.

Wait, the cache is not training only.

m = xgb.DMatrix(X)
booster.predict(m)  # not cached
booster.predict(m)  # cached !

m_1 = xgb.DMatrix(X)
booster.predict(m_1)  # not cached
booster.predict(m)  # cached !
booster.predict(m_1) # cached !

del m
gc.collect()
booster.predict(m)  # not cached