Understanding training error and validation error

OldMortality · September 15, 2020, 9:58pm

I have a toy binary classification with xgboost, and I would like to understand how training error and validation error are calculated. I would have thought I could use the model to predict the training set, and then the training error would be the mean of y log ( p ) + (1-y)log(1-p), while the validation error would be mean((y-p)^2). Is that wrong? In my toy example, none of these numbers match. The training output ends with : train-error:0.022000 val-error:0.373000, but:

mean(y*log(pred.train) + (1-y)*log(1-pred.train))
[1] -1.258822

mean((pred.train-y)^2)
[1] 0.3731826
mean((pred.val-y)^2)
[1] 0.2698876
mean(y*log(pred.val) + (1-y)*log(1-pred.val))
[1] -0.8609891

The code is below. Many thanks.

library(xgboost)

N <- 1000

getData <- function(N) {
x1 <- runif(N)
x2 <- runif(N)
x3 <- runif(N)
z <- x1^2 + x2^2 + x3^2
z <- ( z - min(z) ) / (max(z)-min(z))
y <- rbinom(N,size=1,prob=z)
X <- as.matrix(cbind(y,x1,x2,x3))
return(X)
}

X <- getData(N)
X.val <- getData(N)

xgtrain <- xgb.DMatrix(X[, -c(1)],
label = X[, 1])
xgval <- xgb.DMatrix(X.val[, -c(1)],
label = X.val[, 1])

watchlist = list(train = xgtrain,val=xgval)
param <- list(max_depth = 2, eta = 0.3, nthread = 2, gamma = 0, min_child_weights = 1,
objective = “binary:logistic”, eval_metric = “error”, subsample = 1, colsample_bytree = 1)
m <- xgb.train(param, xgtrain, nrounds = 1000, watchlist = watchlist, verbose = TRUE)

pred.train <- predict(m, X[,-1])
pred.val <- predict(m, X.val[,-1])

y <- X[,1]
mean((pred.train-y)^2)
mean(y*log(pred.train) + (1-y)*log(1-pred.train))

y <- X.val[,1]
mean((pred.val-y)^2)
mean(y*log(pred.val) + (1-y)*log(1-pred.val))

hcho3 · September 16, 2020, 6:32am

The error is calculated as follows:

sum(y == (p >= 0.5))

i.e. the fraction of the data points whose class prediction matches the true label.

OldMortality · September 16, 2020, 7:36am

Thank you, that is useful to know. So this is not the value of the loss function.?

For the record, that would be the accuracy, so the error printed would be:

(1/N) * sum(y == (p >= 0.5))
```
Many thanks for your help, also for your answer to my other question.

hcho3 · September 16, 2020, 8:30am

Sorry I made a mistake in my earlier post. The error is 1 - accuracy, so the error is

(1/N) * sum(y != (p >= 0.5))

OldMortality · September 16, 2020, 7:34pm

Yes indeed, I had meant to write:

(1/N) * sum(y == (p < 0.5))

So the error printed is determined by eval_metric. If we have eval_metric=“error”, it does as written here, but if you write “eval_metric” is “auc”, then obviously it prints auc. Is the following true or false? Whatever I write in eval_metric, the loss function minimised is the binary cross entropy function, but eval_metric determines the early stopping, and it also determines what is printed as training-error, which may not be what is actually minimised.

hcho3 · September 16, 2020, 7:47pm

Yes, that’s right. The evaluation metric is not necessarily same as the loss function that’s being minimized.