Consider this simple setup where a single feature can take one of 4 values. The value X=4 is rare (it occurs only once in the training set) but it is important to the loss function (it is weighted 120x more than other examples).
Feature value (X) | Count | Total weight | Sample mean | True mean |
---|---|---|---|---|
1 | 1 | 1 | 1 | 2 |
2 | 100 | 100 | 2.01 | 2 |
3 | 100 | 100 | 2.93 | 3 |
4 | 1 | 120 | 2 | 3 |
If min_child_weight < 120 then the single item with X=4 could get its own leaf, leading to overfitting. But if we set min_child_weight > 120 then the X=2 and X=3 buckets will have to be combined, leading to underfitting.
If min_child_weight was compared with the raw counts (or the sum of the unweighted hessian), we could use 1 < min_child_weight < 100 and recover the optimal split. But this not how XGBoost works - it seems to assume that an item having 100x weight means we are 100x more confident in the label.
Does anyone know of a way to work around this issue and regularize effectively across the full range of weights?