I’m using the python API for xgboost with a pandas dataframe used as the input to the xgb dmatrix. Every one of my training dataframe records corresponds to a customer record ID from a database. I figured out I could set the index of the pandas df to be this customer record ID.
However, the xgboost.predict method only produces a numpy array : same number of records, but we cannot tell which customer ID, and I don’t feel comfortable assuming the sort order is identical.
How do I match my predicted records back to my customer record ID?
furthermore, what happens when we parallelize the training and inference process via the Dask API, and we spread the training and inference process across cores and across nodes in a cluster? Does this problem get exacerbated? or simplified?
Does a different API, such as the Scala api on the JVM solve this problem?