Why can't you rebalance the dataset if you want to predict probabilities?

OldMortality · September 19, 2020, 8:04pm

In the notes on parameter tuning

https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

it says:

“If you care about predicting the right probability In such a case, you cannot re-balance the dataset”"

Why is that?

Many thanks,

hcho3 · September 20, 2020, 5:27am

See discussion in How does scale_pos_weight affect probabilities?. If your goal is to predict well-calibrated probabilities, you should not use scale_pos_weight or assign data weights.

OldMortality · September 20, 2020, 6:27am

Hi,
Thank you. Yes, I had read that. I think that the root of the matter is that the prediction is the proportion of the positive class in that particular part of the trees, and when you balance the data, you change those proportions. So would it be possible to estimate the model on the balanced data, and then run it on the original unbalanced data to get the predictions?
Thank you again.

hcho3 · September 20, 2020, 7:07am

You can try it and see if you are able to get well-calibrated probabilities. To my knowledge, any re-balancing is incompatible with well-calibrated probabilities.

OldMortality · September 20, 2020, 7:31am

I would like to try it.

Now I don’t know xgboost that well, so I am not entirely sure how to achieve this. Fortunately, I did not use scale_weight_x to balance the dataset, but instead I added a weight parameter when I created the x-matrix. So would it be right that all I need to do is run predict on the original x-matrix, with no weights parameter? Many thanks for your help, I am quite at sea here.
Cheers,

Michel

hcho3 · September 20, 2020, 7:36am

Yes, the prediction value itself should not be affected by the data weights, since the weights only control how the evaluation metric gets computed.

OldMortality · September 20, 2020, 8:00am

If that is the case, then I don’t see why setting the weights should mess up the predicted probabilities. I can understand why that would happen if setting a weight of, say, 10, would be equivalent to cloning that data point 9 times. But if the weights are used only in the loss function, then why are the probabilites not calibrated?

hcho3 · September 20, 2020, 6:37pm

Actually that’s what happens at training time.

OldMortality · September 21, 2020, 4:23am

It seems to me, from some toy examples, that setting the weights in the matrix does not clone the observations, but instead it presumably adds a weight in the loss function, while scale_pos_weight clones the observations for the minority class. Could that be right?

hcho3 · September 21, 2020, 5:20am

No, scale_pos_weight does not clone observations. It behaves as if you assigned an identical data weight to the positive class. The effect is the same as cloning observations, however, in that you would have poorly calibrated probability predictions.

zahs123 · March 16, 2022, 12:23pm

ok - question on this - i understand the affect of using scale pos weight. However if i rebalance my dataset why does this result in the same affect as using scale pos weight? Does xgboost take into account the ratio of positives to negatives in training set?