Justification for using sum of hessian for min_child_weight?

pat · April 2, 2021, 2:50pm

Why is the sum of the hessian used for min_child_weight and not just the sum of the weight?

I believe in squared error loss the hessian is directly proportional to the weight so I can see why the hessian works in this situation. What is the justification for other loss functions?

Is this a performance optimization given that the sum of the (weighted) hessian will already have been calculated as part of the splitting procedure?

jiamingy · April 8, 2021, 11:28am

To me it’s more or less a heuristic. Using newton method the hessian represents step size of a sample during gradient boosting. So it’s another form of “weight”.

pat · April 12, 2021, 8:22am

I think it’s more accurate to say that the inverse of the hessian is the step size in Newton’s method, would you agree? In any case, I worked through the math and think I now see why the loss, following a second order approximation, is equivalent to squared error problem with the hessian as weight. Here are my notes (ignoring the regularization terms):

sebov · February 23, 2023, 9:52am

Hi there. This is an old topic, but as it is the closest one to my questions, I am writing here.

I would like to ask if there are any intuitions/interpretations behind weighting instances with hessians? For squared difference, hessians equals to 1 and you can think of the coverage/weights as the size of the node (expressed in terms of the number of instances). But are there natural (such that could be understood without going deep into maths and loss function derivatives) interpretations for other loss functions?

I also would like to ask about the fragment of the “XGBoost: A Scalable Tree Boosting System” paper that refers to hessians used as weights:

To see why $h_i$ represents the weight, we can rewrite Eq (3) as $\sum_{i=1}^n \frac{1}{2}h_i(f_t(x_i) - \frac{g_i}{h_i})^2 + \Omega(f_t) + constant$

I’m probably missing something, but is this formula correct? I mean, referring only to the summation terms, shouldn’t it be rather $\frac{1}{2}h_i(f_t(x_i) - (- \frac{g_i}{h_i}))^2$, i.e., “-(-g_i/h_i)” instead of just “-g_i/h_i”?