Imbalanced Dataset: Difference between the two ways to improve

orrymr · November 19, 2019, 8:25am

I refer to https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

The page says that “there are two ways to improve [the model]” which depends on what you are trying to improve:

Firstly:
If you care only about the overall performance metric (AUC) of your prediction
Secondly:
If you care about predicting the right probability

What is the difference in these cases? When would you prefer the one over the other?

hcho3 · December 6, 2019, 6:41pm

If you correct data imbalance via assigning data weights, you will introduce a bias to the predicted probability, i.e. the predicted probability for the minority class will be over-estimated.

orrymr · December 11, 2019, 6:59am

Right, so using scale_pos_weight may overestimate the minority class then.

Curious though, in what cases would you be interested in optimizing AUC more than getting the right probability? IE, why optimize for the first, rather than the second.