Catastrophically bad performance with absolute-error-like objective functions

Kodiologist · February 17, 2022, 9:53pm

I was using a custom log-cosh objective function for a study in which I’m interested in absolute error rather than square error, and despite what I believe was good tuning, the mean absolute error (MAE) of a log-cosh-fit model always seems to be a little worse than the MAE of a model fit with reg:squarederror. Experimenting, I found that the built-in reg:pseudohubererror (which, like a log-cosh objective, is motivated as being a differentiable substitute for absolute error) seems to behave similarly. I was able to construct a toy example in which XGBoost does particularly badly with psuedo-Huber loss. Things did not seem to improve with tuning or when using a separate test set (omitted for simplicity). Here’s a reproducible version in R:

library(xgboost)

rmse = function(x, y)
    sqrt(mean((x - y)^2))
mae = function(x, y)
    mean(abs(x - y))

set.seed(5)

N = 1000
x = rep(c(0L, 1L, 10L), len = N)
y = x^2 + rnorm(N)^2

for (loss in c("reg:squarederror", "reg:pseudohubererror"))
    {m = xgboost::xgboost(
         verbose = 0,
         params = list(objective = loss),
         data = matrix(x),
         label = y,
         nrounds = 50)
    p = predict(m, newdata = matrix(x))
    message("RMSE ", loss, " - ", rmse(y, p))
    message("MAE ", loss, " - ", mae(y, p))}

The result is:

RMSE reg:squarederror - 1.46276476010163
MAE reg:squarederror - 0.999394763853888
RMSE reg:pseudohubererror - 51.5824812474803
MAE reg:pseudohubererror - 33.1663704182267

Simplifying further, consider this trivial example:

x = c(0L, 1L)
y = x

for (loss in c("reg:squarederror", "reg:pseudohubererror"))
    {m = xgboost::xgboost(
         verbose = 0,
         params = list(
             lambda = 0, eta = 1,
             objective = loss),
         data = matrix(x),
         label = y,
         nrounds = 1)
    p = predict(m, newdata = matrix(x))
    message("Predictions for ", loss, ": ")
    print(p)}

This prints:

Predictions for reg:squarederror: 
[1] 0 1
Predictions for reg:pseudohubererror: 
[1] 0.5 0.5

So under reg:squarederror, XGBoost can reproduce the input with one tree, as expected, but not, mysteriously, under reg:pseudohubererror.

What’s going on? Are there underlying statistical reasons that XGBoost can’t handle absolute loss well? Or is there a bug?

hcho3 · February 17, 2022, 8:21pm

Looks like a bug to me. Two possibilities:

Software bug.
Statistical issue: XGBoost uses second-order approximation of the true objective (see paper). The approximation relies on Majorization-Minimization, where we optimize a function F by optimizing another function that’s an upper bound on F. The pseudo-Huber loss may need to use a different constant than is usually used.

I think a mathematically robust whitepaper on the XGBoost algorithm (with complete steps) is warranted. I sadly haven’t gotten around to writing one.