Justification for using sum of hessian for min_child_weight?

Why is the sum of the hessian used for min_child_weight and not just the sum of the weight?

I believe in squared error loss the hessian is directly proportional to the weight so I can see why the hessian works in this situation. What is the justification for other loss functions?

Is this a performance optimization given that the sum of the (weighted) hessian will already have been calculated as part of the splitting procedure?

To me it’s more or less a heuristic. Using newton method the hessian represents step size of a sample during gradient boosting. So it’s another form of “weight”.

I think it’s more accurate to say that the inverse of the hessian is the step size in Newton’s method, would you agree? In any case, I worked through the math and think I now see why the loss, following a second order approximation, is equivalent to squared error problem with the hessian as weight. Here are my notes (ignoring the regularization terms):

1 Like