Very large NDCG result

Using XGBoost with the following parameters:
{‘objective’: ‘rank:ndcg’, ‘eta’: 0.1, ‘gamma’: 1.0, ‘eval_metric’: ‘ndcg@3’, ‘min_child_weight’: 0.1, ‘max_depth’: 6}
After building a model I get NDCG@3 larger than 1:
Given what I know of NDCG (that is normalized by the ideal ranking DCG and has to be within the (0,1) range) this has to be a bug, right?
I’m not that concerned with this bug in the evaluation since I can compute NDCG on my own from the ranking and the labels. However, I am concerned about how the model was trained given that the objective may also reach wrong values.

Another thing to note: the labels were originally between 0 and 1. But I discovered that XGBoost rounds down the labels to the nearest integer, meaning that most labels were 0. To overcome this issue, I multiplied all the labels by a factor of 1000. (Is there another way around this?). That resulted in some very large labels.
Going back to the large NDCG values, I guess it might be caused by an overflow (raising 2 to the power of a large number in the computation of NDCG).

Any suggestions?

Thank you!

the labels were originally between 0 and 1.

I think this is the issue. XGBoost assumes the label is nonnegative integer, e.g. 0, 1, 2, 3, … Please transform your label to obtain discrete levels of relevance, and do not multiply by 1000.

@hcho3, thanks for your response!
That’s exactly what I was trying to do.
I’m transforming the label values from:
0.0001, 0.001, 0.002, 0.003, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1
0, 1, 2, 3, 10, 100, 200, 300, 400,…,1000
These are non-negative integers.
Did you means something different? Is there max value for the labels?
As an experiment, I removed all examples with a label above 100. Still got a very large NDCG value.

1000 is a very big number. Typically, relevance judgment is out of 5 or 10. For example:

  • 0: not relevant at all
  • 1: a little relevant
  • 2: moderately relevant
  • 3: quite relevant
  • 4: highly relevant
  • 5: extremely relevant.

Try to use 0-5 or 0-10 for your labels. This is so that you won’t suffer from numerical overflow (seems like that’s what happened to your original example).

@hcho3, thanks again!

Is there a recommended method to transform such values:
0.0001, 0.001, 0.002, 0.003, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1
Into buckets from 0 to 10?

Maybe categorize them in bands? So 0-0.01 would be mapped to 0, 0.01-0.02 would be mapped to 1, and so forth.

Hi, @hcho3, may I know why do you suggest that “relevance judgment is out of 5 or 10”? Is this based on your personal experience or do you have any references for that? Thanks.

Hi, @hcho3, may I ask another related question? According to the definition of NDCG, having relevance judgments of 0, 1, 2, and 3 should be the same as having those of 0, 10, 20, and 30 when using NDCG as the objective. But this doesn’t seem to be true with XGBoost. Have you any idea why?