Some __init__ parameters are ignored when initializing a DMatrix from a DataIter


Recently, I had to deal with quite an unfortunate and difficult to figure out problem. At my place of work, we are currently trying to create XGB models on a Kubernetes cluster, and thus, certain limitations are imposed upon us (mostly memory limitations, to be more precise). To get around said limitations, I attempted to implement my own DataIter.

When attempting to teach the model using a DMatrix created from a DataIter, I kept receiving an exception saying feature names are not unique. We would pass the feature_names variable to the __init__ of the DMatrix, yet, as I figured out later, when initializing the DMatrix from an iterator, the feature_names parameter is ignored completely. The fix was rather simple, all I had to do was call set_info right after the DMatrix was initialized, which set feature_names to its proper value and I stopped receiving the aforementioned exception.

I know the DataIter feature is still considered experimental, yet, I wanted to ask. Is this behavior intended, or is it a bug? The fact that DMatrix.__init__ may ignore some parameters when initializing from an iterator is not documented anywhere. I did not notice anyone anywhere encountering the same issue. Thus, I was curious whether I was doing something wrong and feature_names are left uninitialized for a reason that is yet elusive to me, or whether I could at least file a bug report or submit a pull request.

Thanks a lot and sorry if I’m just missing something.

I had a similar issue attempting to use the a custom DataIter with DeviceQuantileDMatrix.
You can pass the feature_names into the input_data call inside your DataIter implementation.
Take a look at the definition for input_data:

        A function with same data fields like `data`, `label` with

With that said the other trick is to use parameter names when you call input_data.
e.i.: input_data(data=x, label=y) instead of input_data(x, y)

To get DeviceQuantileDMatrix to work, I also needed to pass a cupy array.

Take a look at the source for dispatch_proxy_set_data: