Can't regularize effectively when weights are skewed

sc3512 · May 16, 2024, 11:15pm

Consider this simple setup where a single feature can take one of 4 values. The value X=4 is rare (it occurs only once in the training set) but it is important to the loss function (it is weighted 120x more than other examples).

Feature value (X)	Count	Total weight	Sample mean	True mean
1	1	1	1	2
2	100	100	2.01	2
3	100	100	2.93	3
4	1	120	2	3

If min_child_weight < 120 then the single item with X=4 could get its own leaf, leading to overfitting. But if we set min_child_weight > 120 then the X=2 and X=3 buckets will have to be combined, leading to underfitting.

If min_child_weight was compared with the raw counts (or the sum of the unweighted hessian), we could use 1 < min_child_weight < 100 and recover the optimal split. But this not how XGBoost works - it seems to assume that an item having 100x weight means we are 100x more confident in the label.

Does anyone know of a way to work around this issue and regularize effectively across the full range of weights?