How does scale_pos_weight affect probabilities?

i am using scale pos weight on an imbalanced dataset. I am trying to optimize the brier score since obtaining well calibrated probabilites is important to me. however i find that wuthout using calibration (e.g. isotonic) my uncalibrated calibration curve is not very well calibrated. i have read online that adding scale_pos_weight could be distorting my results… but why is this? what is the mathematical basis.

also when i plot the kde of my isotonic probabilities, it is not smooth… what could i be doing wrong?

Setting scale_pos_weight gives greater weight to the positive class, which roughly has the same effect as oversampling the data points with the positive class. Try disabling scale_pos_weight and see if it gives you better calibrated results.

yes i disabled it and now my xgboost curve is more or less towards the reliability diagonal curve as you can see below. However i have an imbalanced problem where i only have 2% positive class therefore i thought using scale pos weight would be beneficial yet it is very important that i have well calibrated results. Wouldn’t disabling scale pos weight then not reflect the fact i have an imbalanced problem?

Screenshot 2020-09-01 at 9.17.03 AM

does increasing scale pos weight mean it increases the affect it has on the weights vector? therefore when we use the logit transform equation to calcaulte the probability a higher weight will decrease the probability…?

A possible alternative in handling the imbalanced problem is to use a suitable metric together with a hyperparameter search. If your metric is robust with highly imbalanced problems (e.g. AUCPR), the hyperparameter search would choose the model that adequately accounts for the minority class. (After all, the risk in the imbalanced problem is that you’d ignore the minority class. For example, with the accuracy metric, you could predict the negative class always and obtain 98% accuracy.) You should even consider building a custom metric yourself that gives higher weight to the fit with the minority class.

i am using brier score and i am not sure if this is suitable or not. The reason for choosing this is because i would like well calibrated probabilities from my model. Can i ask though how exactly scale pos weight acts on the model? Am i correct in thinking it acts on the first and second (hessian) gradient of the loss function. i.e. the ‘weights’? in doing so then when we come to calculate probabilities which is the inverse logit i,e 1/(1+e^(-z)) where z = Xw, and w= -g(i)/((h(i)+lambda), having a larger weight vector will make the probability smaller. Is this correct?

The scale_pos_weight hyperparameter scales gradient pairs associated with each data point.

having a larger weight vector will make the probability smaller

Not necessarily, since the scaling occurs with the loss function being optimized.

My suggestion is to avoid scale_pos_weight all together, if you want calibrated probabilities. You can try to improve the metric so that the hyperparameter search picks a model that adequately performs on the minority class.

is there a mathetmatical resource for this you can point me to so i can better understand what gradient pairs these are?

Here is the KDD paper: https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf. If you want more detailed math derivations, take a look at my master’s thesis: https://drive.google.com/file/d/0B0c0MbnP6Nn-eUNRRkVOOGpkbFk/view?usp=sharing

many thanks, just so i am clear in your thesis what do you mean by instance sets, is this set of values for given feature in training matrix ? Also, when we add weights is it similar to below? where here alpha is our weights, or scale_pos_weight vector for each data point?

No. The instance set is defined to be the set of data points that are associated with each tree node. From the thesis:

Each leaf node gives rise to an instance set, the set of all data points for which traversal ended at that leaf node.

Yes, that is correct.

many thanks for your help

Instead of using scale_pos_weight, we can also add weights to the matrix, as in:

xgb.DMatrix(data=mydata ,label=mylabels,weight = myweights)

Does that come to the same thing as using scale_pos_weight?

@OldMortality Yes, setting scale_pos_weight to a value X is equivalent to assigning weight X to every data point with the positive class label.

1 Like