Evaluating XGBoost Ranking


Hi all,

I’m unsure if this is the correct place to ask this question,so apologies in advance. I’m using the python implementation of XGboost Pairwise ranking.

The results of my prediction is a list of probabilities, however, I am wondering what the best way is to evaluate such an outcome, or if I made the correct predictions.

I’ve searched multiple online forums but I can’t seem to find a good answer online of how to evaluate the predictions of XGboost Learning to rank.

To illustrate, I’ve followed the python file on github as an example, and it shows:
pred = model.predict(x_test)

When I run my code the outcome is a list of values between 0 and 1. So how do I evaluate how good my predictions are? Is there a built-in way to see what the rankings the predictions have, as compared to the actual rankings?

Again, sorry if this is not the appropriate place to ask such a question.

Thanks in advance.


It’s best to think of the outputs as arbitrary scores you can use to rank documents. To evaluate your model on a query session, you first make prediction on the documents in the query session and then sort them by the predicted scores. Finally, you can compute the ranking metric.

Consider an example query session with three documents:

Document Relevance judgment (label) Predicted score
Document 0 1 -0.3
Document 1 2 +0.2
Document 2 0 +0.1

Sorting by the predicted score (descending order), we get

Document Relevance judgment (label) Predicted score
Document 1 2 +0.2
Document 2 0 +0.1
Document 0 1 -0.3

This ordering (Document 1, Document 2, followed by Document 0) is the ranking predicted by the model.

Now we can compute DCG (Discounted Cumulative Gain):

DCG  = (2^2 - 1)/log_2(2) + (2^0 - 1)/log_2(3) + (2^1 - 1)/log_2(4) = 3.5

To put the DCG value 3.5 in context, we normalize it by IDCG (Ideal Discounted Cumulative Gain). IDCG is the highest possible DCG, and we obtain it by sorting the relevance judgment in descending order:

IDCG = (2^2 - 1)/log_2(2) + (2^1 - 1)/log_2(3) + (2^0 - 1)/log_2(4) = 3.6309297535714578

So the NDCG (Normalized Discounted Cumulative Gain) is given by

NDCG = DCG / IDCG = 3.5 / 3.6309297535714578 = 0.9639404333166532


Hi hcho3,

Many thanks for the elaborate response and clearing up how the evaluation is done on learning to rank methods! Again apologies if the following question is a bit silly, I just want to understand this correctly.

If I were to plot my predictions, I’d want to combine my list of predictions(e.g. the list of labels resulting from 'pred = model.predict(x_test)), in combination with the query data(e.g. x_test)?

So if I understand correctly in short; how do I connect the list of arbitrary scores back to their corresponding documents? Perhaps this isn’t the most difficult task, so again apologies if this isn’t the correct place or level of question for this forum.


When you first train your model, you will be asked to divide the documents into query groups. So after you compute predictions, you should divide the scores using the same query groups as the documents. Treat each query group separately from other query groups when interpreting quality of ranking.