Evaluating XGBoost Ranking

Hi all,

I’m unsure if this is the correct place to ask this question,so apologies in advance. I’m using the python implementation of XGboost Pairwise ranking.

The results of my prediction is a list of probabilities, however, I am wondering what the best way is to evaluate such an outcome, or if I made the correct predictions.

I’ve searched multiple online forums but I can’t seem to find a good answer online of how to evaluate the predictions of XGboost Learning to rank.

To illustrate, I’ve followed the python file on github as an example, and it shows:
pred = model.predict(x_test)

When I run my code the outcome is a list of values between 0 and 1. So how do I evaluate how good my predictions are? Is there a built-in way to see what the rankings the predictions have, as compared to the actual rankings?

Again, sorry if this is not the appropriate place to ask such a question.

Thanks in advance.

It’s best to think of the outputs as arbitrary scores you can use to rank documents. To evaluate your model on a query session, you first make prediction on the documents in the query session and then sort them by the predicted scores. Finally, you can compute the ranking metric.

Consider an example query session with three documents:

Document Relevance judgment (label) Predicted score
Document 0 1 -2.3
Document 1 2 +1.2
Document 2 0 +0.5

Sorting by the predicted score (descending order), we get

Document Relevance judgment (label) Predicted score
Document 1 2 +1.2
Document 2 0 +0.5
Document 0 1 -2.3

This ordering (Document 1, Document 2, followed by Document 0) is the ranking predicted by the model.

Now we can compute DCG (Discounted Cumulative Gain):

DCG  = (2^2 - 1)/log_2(2) + (2^0 - 1)/log_2(3) + (2^1 - 1)/log_2(4) = 3.5

To put the DCG value 3.5 in context, we normalize it by IDCG (Ideal Discounted Cumulative Gain). IDCG is the highest possible DCG, and we obtain it by sorting the relevance judgment in descending order:

IDCG = (2^2 - 1)/log_2(2) + (2^1 - 1)/log_2(3) + (2^0 - 1)/log_2(4) = 3.6309297535714578

So the NDCG (Normalized Discounted Cumulative Gain) is given by

NDCG = DCG / IDCG = 3.5 / 3.6309297535714578 = 0.9639404333166532
1 Like

Hi hcho3,

Many thanks for the elaborate response and clearing up how the evaluation is done on learning to rank methods! Again apologies if the following question is a bit silly, I just want to understand this correctly.

If I were to plot my predictions, I’d want to combine my list of predictions(e.g. the list of labels resulting from 'pred = model.predict(x_test)), in combination with the query data(e.g. x_test)?

So if I understand correctly in short; how do I connect the list of arbitrary scores back to their corresponding documents? Perhaps this isn’t the most difficult task, so again apologies if this isn’t the correct place or level of question for this forum.

When you first train your model, you will be asked to divide the documents into query groups. So after you compute predictions, you should divide the scores using the same query groups as the documents. Treat each query group separately from other query groups when interpreting quality of ranking.

But I think I would need to be able to specify the groups directly on, for example, test data (e.g., test_X) for which I am trying to predict. Otherwise, how are those relative scores by query group being determined? Sounds like you’re assuming we’re predicting on the training data, for which the groups were specified in the call to the fit method.

No, only assumption is that you have query groups defined on the test data. Then you can compute relative ordering between test documents.

Where do you specify the groups? I don’t see it in the documentation for predict here

For now, you will have to predict one group at a time, since prediction function doesn’t let you specify group boundaries. Just sort by the raw prediction and that should give you the ordering.

I agree that ranking support has rooms for improvements.

Thanks.

On another note: your comments in this thread have assumed (if I understand correctly) that every document in the training data has a relevance label that communicates degree of relevance, e.g., say integers between 0 and 10. However, I’m wondering if the pairwise ranking approach can successfully be applied to 0-1 data, e.g., click or not?

Yes. In the original scenario, 0 means irrelevance and >0 means some relevance. You can apply the same interpretation with 0-1 data. And yes, you can compute NDCG with 0-1 data too.

I"m comparing a logistic regression (scikit-learn) with a pairwise ranking approach (xgboost) where the relevance labels are 0-1 (click or not, as I mentioned above) and getting very little difference in the rankings–which is not what I am hoping/expecting! But this could be because the dataset is very unbalanced, with something like 1.2% 1s. Obviously, this means that for many of the query groups there will be “no signal”. Intuitively, it seems that it would be hard for pairwise ranking to perform well in this case, so I wonder if there is any insight into this use case.

You can set scale_pos_weight in order to give more weight to the minority class. Or you can assign individual weights to data points.

Hi @hcho3 ,
I’ve read this post that is very interesting for me.
I’m using xgboost with LambdaMART model for ranking.
As you wrote, in the original scenario 0 means irrelevance and >0 means relevance.

In my specific case I’m computing the relevance label starting from a statistical click-through-rate (stat_ctr) that can take continuous values from 0 to infinite.
I would like to normalize and round to integer these rates in order to use them as relevance labels for the training.
I was wondering, since in my scenario stat_ctr <= 1 means not relevant and stat_ctr >1 relevant.

During the normalization/rounding to integer, it would be better to associate a 0 label for all the documents with a stat_ctr <=1, so:

  • 0 <= stat_ctr <= 1 -> relevance_label = 0
  • 1 < stat_ctr < … -> relevance_label > 0

Or to use both the labels 0 and 1 to represent not relevant values with a classic round. Therefore:

  • 0 = stat_ctr < 0.5 -> relevance_label = 0
  • 0.5 <= stat_ctr < … -> relevance_label > 0

In this latest case label 1 will represent both non relevant and relevant values, representing a sort of uncertainty situation, because a stat_ctr values like 0.8 will be mapped to a relevance label of 1 and the same for a stat_ctr of 1.3

From what you said I suppose that case 1 should be better.

Thank you.