Why has rebalancing my data actually lead to better calibration?

zahs123 · March 10, 2022, 10:22am

I have a binary classification problem where i am trying to predict default using XGBOOST. I originally have a class imbalance of 5%- upon training i get a accuracy and recall of 0.91 & 0.5 respectively.

I then decide to up-sample the majority class this then brings my new class imbalance to just 25%. Upon training i get a accuracy and recall of 0.7 and 0.75 respectively. I also noticed that my calibration plot was a lot better. This goes against what I thought. I thought that if i up-sampled the minority class it would lead to worse calibration. I thought upsampling/undersampling had same affect as scale pos weight, i.e. it will affect probabilities. but in my case it has lead to better calibration:

after upsampling:

the brier score of uncalibrated above is 0.15

whereas before upsampling the majority class it was:

and the brier score of uncalibrated was 0.05, so actually now the brier was infact less before upsampling and increased to 0.15 AFTER upsampling.

Why is this happening? I thought upsampling lead to worse calibration but from above it looks better.

Rpinto02 · April 27, 2022, 7:50pm

I’m curious about this as well since in the docs states that:
“If you care about predicting the right probability
In such a case, you cannot re-balance the dataset”