Custom Matrix Implementation

pranavsingh3 · November 7, 2023, 9:51pm

Hello!

I was wondering what would be the recommended way of implementing a custom matrix representation in the XGBoost library. I am trying to follow the examples of the Scipy Sparse matrices that are supported within xgboost with some optimizations tailored for my use-case.

I am trying to find the least invasive way of supporting my custom data format (ideally satisfying an internal interface). I would appreciate any tips or guidance!

Sincerely,

Pranav Singh

hcho3 · November 7, 2023, 10:34pm

Have you considered using the data iterator? The data iterator lets you pass data from a custom data source to XGBoost. See the example at https://xgboost.readthedocs.io/en/stable/python/examples/quantile_data_iterator.html.

pranavsingh3 · November 7, 2023, 10:57pm

Hello! Thank you for the prompt reply.

Does the data iterator approach you mention rely on external memory and writing to disk? With my custom matrix I am able to comfortably store everything in memory so for performance reasons want to avoid any disk writes.

If it doesn’t rely on external memory, if I have a data iterator backed by in-memory numpy arrays do you think I will get good performance during training?

hcho3 · November 8, 2023, 12:03am

Does the data iterator approach you mention rely on external memory and writing to disk?

No, the data iterator does not require the use of external memory.

if I have a data iterator backed by in-memory numpy arrays do you think I will get good performance during training?

In the best case scenario, you’ll get similar performance as Dask XGBoost, which uses the data iterator to enable distributed training. Your mileage may vary.

pranavsingh3 · November 8, 2023, 1:05am

Thank you so much! I will investigate this approach.

One follow-up question. If I wanted to extend the implementation within C++ (in the event that Data Iter does not work for my use-case) what interface should I be looking to fulfill with my custom matrix implementation?

hcho3 · November 8, 2023, 1:35am

If performance is a concern, you can use the C API of XGBoost. Use the function XGDMatrixCreateFromCallback. Here is an example: https://github.com/dmlc/xgboost/blob/master/demo/c-api/external-memory/external_memory.c. (The example says “external memory,” but it actually allocates arrays in memory, so you should be able to adapt it for your application.)

If you are asking about the possibility of replacing the QuantileDMatrix used in XGBoost, it will probably a Herculean effort, since many parts of XGBoost are tightly integerated with the QuantileDMatrix class. You are free to fork XGBoost and modify the DMatrix class, at your own risk.

pranavsingh3 · November 8, 2023, 6:50pm

Thank you for the pointers!

Does the C API duplicate copy all the data into memory or during the training loop does it call batches?

Essentially, I want to emulate what the DMatrix w/ External memory does during training but instead of writing to disk, I want to provide a wrapper into my data structure.

Again really appreciate all of the pointers, super helpful!

hcho3 · November 8, 2023, 11:41pm

The original data don’t get duplicated, but it does get transformed into quantile bins after the quantile sketch. So The operation is not completely zero-copy. (Now you have original data + quantile bin IDs (integers) in memory.)