How can I extract classification probabilities from xgboost with out using the .predict_proba sklearn API?

Tony363 · April 6, 2022, 5:21pm

I have generated data from a random uniform distribution of 5 categories. The datas minimal value is 0 and the datas maximum value is 1. The data values are numerical floats. Category 1 is from 0.0 - 0.05, category 2 is from 0.05-0.1, category 3 is from 0.1-0.15,category 4 is from 0.15-0.2 and category 5 is from 0.2-1.0. These are the “actual” X term. I then have an “actual” Y term where I impute each data point into its corresponding category. And finally, I have Z term from a random normal distribution that I then add to the X term to form the Xe term. I fit using the Xe term and the Y term into xgboost multi:softmax with a training, validation and testing set of 0.6,0.3 and 0.1.

I would like to extract the probabilities of the xgboost models 5 categorical probabilities. As my code is done without using the sklearn API so as to more finely use xgboost, how can I use xgboosts existing python API to extract probabilities from a xgboost model without interacting with the sklearn API?

I am currently exploring xgb.Booster.predict method where it has an “output_margin” argument. I read from this post how probabilities can be extracted using the “output_margin” argument via transforming the raw untransformed output margin by the softmax function to get the probabilities. However, when I try to mimic the description from this post, I got a matrix with 6 categories(columns) even though my input data only has 5 categories. Am I interpreting the output of .predict(dtest,output_margin=True) wrong? How am I supposed to interpret it correctly? Are there other alternatives to extract my 5 categorical probabilities from xgboost with out using the .predict_proba sklearn API?