Python xgboost how does a record ID or index align with predict numpy array?

robineway · January 10, 2020, 11:33pm

I’m using the python API for xgboost with a pandas dataframe used as the input to the xgb dmatrix. Every one of my training dataframe records corresponds to a customer record ID from a database. I figured out I could set the index of the pandas df to be this customer record ID.

However, the xgboost.predict method only produces a numpy array : same number of records, but we cannot tell which customer ID, and I don’t feel comfortable assuming the sort order is identical.

How do I match my predicted records back to my customer record ID?

furthermore, what happens when we parallelize the training and inference process via the Dask API, and we spread the training and inference process across cores and across nodes in a cluster? Does this problem get exacerbated? or simplified?

Does a different API, such as the Scala api on the JVM solve this problem?

hcho3 · January 13, 2020, 8:09pm

The order is identical, since we use the values field of the data frame and obtain the NumPy array that has identical row order as the data frame.