Hello.
Recently, I had to deal with quite an unfortunate and difficult to figure out problem. At my place of work, we are currently trying to create XGB models on a Kubernetes cluster, and thus, certain limitations are imposed upon us (mostly memory limitations, to be more precise). To get around said limitations, I attempted to implement my own DataIter
.
When attempting to teach the model using a DMatrix
created from a DataIter
, I kept receiving an exception saying feature names are not unique
. We would pass the feature_names
variable to the __init__
of the DMatrix
, yet, as I figured out later, when initializing the DMatrix
from an iterator, the feature_names
parameter is ignored completely. The fix was rather simple, all I had to do was call set_info
right after the DMatrix
was initialized, which set feature_names
to its proper value and I stopped receiving the aforementioned exception.
I know the DataIter
feature is still considered experimental, yet, I wanted to ask. Is this behavior intended, or is it a bug? The fact that DMatrix.__init__
may ignore some parameters when initializing from an iterator is not documented anywhere. I did not notice anyone anywhere encountering the same issue. Thus, I was curious whether I was doing something wrong and feature_names
are left uninitialized for a reason that is yet elusive to me, or whether I could at least file a bug report or submit a pull request.
Thanks a lot and sorry if I’m just missing something.