Weighting: best practise? Weighted and Imbalanced datasets?

FrederikLauber · November 27, 2018, 11:16am

Hello,

I am using xgboost to separate simulated signal from simulated noise. My events have both a weight (simulation due to an energy spectrum) and are imbalanced (much more background events than signal events). Taking this as an example, how would I represent this in the parameters of xgboost?
At the moment, I am leaving pos_scale at the default 1 and normalize the total weight of the background and signal to 0.5 each. Is this the best way to do this?
Also, how does this interact with other parameters like min_child_weight? I guess it takes the integrated weight of the events going this way instead of just the number of events but with this method, interpreting this becomes very hard. Or should I just keep the weights unnormalised and then scale up with pos_scale_weight?

As I said, I am only interested in binary classification (I read that there might be problems if you want to get proper probabilities for each class while using pos_scale_weight).

Thanks for any help understanding proper weighting and its effect in xgboost!
Frederik

thvasilo · November 27, 2018, 7:59pm

Hello @FrederikLauber, have you tried out classic class imbalance approaches?

They could very well be of more help here than messing around with boosting parameters.

Check this review paper and a sklearn package.

FrederikLauber · November 27, 2018, 8:33pm

Hi,
thanks for your answer but my problem is not understanding how to handle weighted and imbalanced classes. I have done that already in the past in the keras framework and there it normally just came down to adjusting the weights of the events accordingly but most other parameters were independent from this scaling (like the learning rate).

My problem is understanding how I implement what I learned there in the context of xgboost.
For example:
Is setting scale_pos=10 the same as increasing the weight of every positive event by 10? Or are there differences between this two cases? I would for example imagine that scale_pos is only taken into account for the minimization and the boosting but not for stuff like min_child_weight. That would make a ton of sense because the interpretation of min_child_weight would become way easier. The problem is that the documentation of xgboost quite often is not really clear on this.
Take again min_child_weight, this is its documentation:
“Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be.”
From that I do not understand how it will interact for example with scale_pos. Is it the same as scaling all the positive weights or is not? Its very unclear to me.

Same for the learning rate and other parameters. I would think that if I scale up all weights by a factor of lets say 10**6 should change nothing as only the relatives weights should matter for the optimization but for some parameters (i.e. min_child_weight) this would make a difference because they might just sum up over all the weights without any normalization. I am pretty sure this is the case for min_child_weight but for what other parameter does it make a difference?