Why can't you rebalance the dataset if you want to predict probabilities?

In the notes on parameter tuning

https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html

it says:

“If you care about predicting the right probability In such a case, you cannot re-balance the dataset”"

Why is that?

Many thanks,

1 Like

See discussion in How does scale_pos_weight affect probabilities?. If your goal is to predict well-calibrated probabilities, you should not use scale_pos_weight or assign data weights.

1 Like

Hi,
Thank you. Yes, I had read that. I think that the root of the matter is that the prediction is the proportion of the positive class in that particular part of the trees, and when you balance the data, you change those proportions. So would it be possible to estimate the model on the balanced data, and then run it on the original unbalanced data to get the predictions?
Thank you again.

1 Like

You can try it and see if you are able to get well-calibrated probabilities. To my knowledge, any re-balancing is incompatible with well-calibrated probabilities.

I would like to try it.

Now I don’t know xgboost that well, so I am not entirely sure how to achieve this. Fortunately, I did not use scale_weight_x to balance the dataset, but instead I added a weight parameter when I created the x-matrix. So would it be right that all I need to do is run predict on the original x-matrix, with no weights parameter? Many thanks for your help, I am quite at sea here.
Cheers,

Michel

Yes, the prediction value itself should not be affected by the data weights, since the weights only control how the evaluation metric gets computed.

If that is the case, then I don’t see why setting the weights should mess up the predicted probabilities. I can understand why that would happen if setting a weight of, say, 10, would be equivalent to cloning that data point 9 times. But if the weights are used only in the loss function, then why are the probabilites not calibrated?

1 Like

Actually that’s what happens at training time.

It seems to me, from some toy examples, that setting the weights in the matrix does not clone the observations, but instead it presumably adds a weight in the loss function, while scale_pos_weight clones the observations for the minority class. Could that be right?

No, scale_pos_weight does not clone observations. It behaves as if you assigned an identical data weight to the positive class. The effect is the same as cloning observations, however, in that you would have poorly calibrated probability predictions.

1 Like

ok - question on this - i understand the affect of using scale pos weight. However if i rebalance my dataset why does this result in the same affect as using scale pos weight? Does xgboost take into account the ratio of positives to negatives in training set?