Why can't you rebalance the dataset if you want to predict probabilities?

In the notes on parameter tuning


it says:

“If you care about predicting the right probability In such a case, you cannot re-balance the dataset”"

Why is that?

Many thanks,

See discussion in How does scale_pos_weight affect probabilities?. If your goal is to predict well-calibrated probabilities, you should not use scale_pos_weight or assign data weights.

Thank you. Yes, I had read that. I think that the root of the matter is that the prediction is the proportion of the positive class in that particular part of the trees, and when you balance the data, you change those proportions. So would it be possible to estimate the model on the balanced data, and then run it on the original unbalanced data to get the predictions?
Thank you again.

You can try it and see if you are able to get well-calibrated probabilities. To my knowledge, any re-balancing is incompatible with well-calibrated probabilities.

I would like to try it.

Now I don’t know xgboost that well, so I am not entirely sure how to achieve this. Fortunately, I did not use scale_weight_x to balance the dataset, but instead I added a weight parameter when I created the x-matrix. So would it be right that all I need to do is run predict on the original x-matrix, with no weights parameter? Many thanks for your help, I am quite at sea here.


Yes, the prediction value itself should not be affected by the data weights, since the weights only control how the evaluation metric gets computed.

If that is the case, then I don’t see why setting the weights should mess up the predicted probabilities. I can understand why that would happen if setting a weight of, say, 10, would be equivalent to cloning that data point 9 times. But if the weights are used only in the loss function, then why are the probabilites not calibrated?

Actually that’s what happens at training time.

It seems to me, from some toy examples, that setting the weights in the matrix does not clone the observations, but instead it presumably adds a weight in the loss function, while scale_pos_weight clones the observations for the minority class. Could that be right?

No, scale_pos_weight does not clone observations. It behaves as if you assigned an identical data weight to the positive class. The effect is the same as cloning observations, however, in that you would have poorly calibrated probability predictions.