Rank ndcg pair sample and position

doramir · February 28, 2019, 9:20am

I’m running xgboost with objective rank:ndcg and I have 3 questions:

is the model knows how to use the position of the data? i need to send to the model the data sorted by qid and by position (ascending)?
I know that the model is using num_pairsample - what is the value of this variable and can we change it? also it this variable mean the number of pairs the model is checking for the loss?

hcho3 · March 1, 2019, 7:30am

Data should be sorted by query ID (qid). On the other hand, the relative ordering of rows within the same query group in the training data does not matter. (Why? Because the rows already have relevance judgment labels, and the optimal ordering of relevance judgment labels what LambdaMART is going to optimize for.)
You can set it like any other training parameter. The larger this number is, you’d get larger samples when computing the gradients for the ranking objective.

doramir · March 3, 2019, 10:58am

thanks for the answer!
about the first answer, if the relative ordering of rows within the same query group is not matter how the model learns about which item is better up than the other items? only by the label? if item was labeled larger than 0 and its position is 12 and a different item was labeled in position 4 they have different scoring for relevance in the same group no?

hcho3 · March 4, 2019, 6:07am

Yes, only the label matters.

Think of the tree ensemble model as a function f(data) -> score, where data is a collection of feature values and score is a real-valued output. The model will predict the relative ordering within each query group by first computing f(document_1), f(document_2), …, f(document_n) and then sorting the documents by the output of the function f. The goal of the ranking task is to get this predicted ordering to be as close to the optimal ordering as possible. The optimal ordering is when you sort the document by the labels in descending order.