Hinge loss explanation

nikolas · August 8, 2023, 6:24pm

Hello,

I’ve been trying to understand how the hinge loss for binary classification is implemented but I haven’t been able to recover sufficient information from the code itself. In addition, it seems like no one uses hinge for decision trees so I hope someone here can help me understand.

We know that the hinge is h = max{0, 1-labelprediction}. When we use in something like SVM then it’s easy to represent the prediction as the output of a function, for example prediction = wx. From there we can take the derivative with respect to w (wherever it’s defined, or take subgradients). In my attempts to formulate the hinge loss for a decision tree I’ve always been stumped because the only way I’ve been able to formulate the output of a decision tree is just some piecewise, conjunction mess that has zero gradient everywhere. Moreover there is no second order derivative.

In the code of hinge.cu we have that:

bst_float p = _preds[_idx];
bst_float w = is_null_weight ? 1.0f : _weights[_idx];
bst_float y = _labels[_idx] * 2.0 - 1.0;
bst_float g, h;
          if (p * y < 1.0) {
            g = -y * w;
            h = w;
          } else {
            g = 0.0;
            h = std::numeric_limits<bst_float>::min();
          }

What is the derivative take with respect to define g? How is h=w? Is there some regularization parameter that is not being mentioned?