Hello,
I’ve been trying to understand how the hinge loss for binary classification is implemented but I haven’t been able to recover sufficient information from the code itself. In addition, it seems like no one uses hinge for decision trees so I hope someone here can help me understand.
We know that the hinge is h = max{0, 1-labelprediction}. When we use in something like SVM then it’s easy to represent the prediction as the output of a function, for example prediction = wx. From there we can take the derivative with respect to w (wherever it’s defined, or take subgradients). In my attempts to formulate the hinge loss for a decision tree I’ve always been stumped because the only way I’ve been able to formulate the output of a decision tree is just some piecewise, conjunction mess that has zero gradient everywhere. Moreover there is no second order derivative.
In the code of hinge.cu we have that:
bst_float p = _preds[_idx]; bst_float w = is_null_weight ? 1.0f : _weights[_idx]; bst_float y = _labels[_idx] * 2.0 - 1.0; bst_float g, h; if (p * y < 1.0) { g = -y * w; h = w; } else { g = 0.0; h = std::numeric_limits<bst_float>::min(); }
What is the derivative take with respect to define g? How is h=w? Is there some regularization parameter that is not being mentioned?