How to normalise negative values output by XGB?

Fwd · August 16, 2019, 10:36am

Hello,

I’m using the Sk-learn implementation of XGBranker and my target labels are [0,1]. For a given group (query), XGBranker yields predictions which might include negative values such as:
0.31
0.72
-0.27
0.56
-1.56
-2.51
I’d like to normalise these values such that the -2.51 document has a very low, but non-zero probability of being the target label 1. The most common solution suggests normalising these values such that -2.51 = 0 and 0.72 = 1, and then dividing by the sum to get the percentage/probability, but this does not solve my problem since the -2.51 document retains the 0 value.

I realise this question might be better suited a general math/statistics audience, but hoped someone had encountered a similar problem with negative XGB outputs and found a solution?

Thanks

hcho3 · August 16, 2019, 9:28pm

Take a look at Evaluating XGBoost Ranking. You should treat the output as indicators of relative ordering between documents.

Fwd · August 17, 2019, 9:43am

Thanks, yea I noticed that post previously and found it helpful. In this case, I’m looking to extract probabilities from the ranking values to convert to betting odds since I’m testing XGboost against a sports betting problem. XGB ranking outperforms XGB regressor I presume because of its ability to embed context with grouping, hence why I’d like to obtain all-positive values from its output.

As you note in your post, the scores are arbitrary, but I guess they still represent relative strengths within the group so I’d like to turn these into all-positive values somehow

hcho3 · August 17, 2019, 4:53pm

That’s not possible, because the outputs are just arbitrary numbers, and they represent neither probabilities nor log odds. If you had trained a classifier instead, the outputs would have probabilistic meaning.

Suggestion: Given a set of test examples, you can sort them by the predicted scores and choose top K examples as positive label and the rest as negative. Adjust K so that you don’t make too many errors on the training data.