I have a toy binary classification with xgboost, and I would like to understand how training error and validation error are calculated. I would have thought I could use the model to predict the training set, and then the training error would be the mean of y log ( p ) + (1-y)log(1-p), while the validation error would be mean((y-p)^2). Is that wrong? In my toy example, none of these numbers match. The training output ends with : train-error:0.022000 val-error:0.373000, but:

mean(y*log(pred.train) + (1-y)*log(1-pred.train))

[1] -1.258822mean((pred.train-y)^2)

[1] 0.3731826

mean((pred.val-y)^2)

[1] 0.2698876

mean(y*log(pred.val) + (1-y)*log(1-pred.val))

[1] -0.8609891

The code is below. Many thanks.

library(xgboost)

N <- 1000

getData <- function(N) {

x1 <- runif(N)

x2 <- runif(N)

x3 <- runif(N)

z <- x1^2 + x2^2 + x3^2

z <- ( z - min(z) ) / (max(z)-min(z))

y <- rbinom(N,size=1,prob=z)

X <- as.matrix(cbind(y,x1,x2,x3))

return(X)

}

X <- getData(N)

X.val <- getData(N)

xgtrain <- xgb.DMatrix(X[, -c(1)],

label = X[, 1])

xgval <- xgb.DMatrix(X.val[, -c(1)],

label = X.val[, 1])

watchlist = list(train = xgtrain,val=xgval)

param <- list(max_depth = 2, eta = 0.3, nthread = 2, gamma = 0, min_child_weights = 1,

objective = “binary:logistic”, eval_metric = “error”, subsample = 1, colsample_bytree = 1)

m <- xgb.train(param, xgtrain, nrounds = 1000, watchlist = watchlist, verbose = TRUE)

pred.train <- predict(m, X[,-1])

pred.val <- predict(m, X.val[,-1])

y <- X[,1]

mean((pred.train-y)^2)

mean(y*log(pred.train) + (1-y)*log(1-pred.train))

y <- X.val[,1]

mean((pred.val-y)^2)

mean(y*log(pred.val) + (1-y)*log(1-pred.val))