I have a toy binary classification with xgboost, and I would like to understand how training error and validation error are calculated. I would have thought I could use the model to predict the training set, and then the training error would be the mean of y log ( p ) + (1-y)log(1-p), while the validation error would be mean((y-p)^2). Is that wrong? In my toy example, none of these numbers match. The training output ends with : train-error:0.022000 val-error:0.373000, but:
mean(y*log(pred.train) + (1-y)*log(1-pred.train))
[1] -1.258822mean((pred.train-y)^2)
[1] 0.3731826
mean((pred.val-y)^2)
[1] 0.2698876
mean(y*log(pred.val) + (1-y)*log(1-pred.val))
[1] -0.8609891
The code is below. Many thanks.
library(xgboost)
N <- 1000
getData <- function(N) {
x1 <- runif(N)
x2 <- runif(N)
x3 <- runif(N)
z <- x1^2 + x2^2 + x3^2
z <- ( z - min(z) ) / (max(z)-min(z))
y <- rbinom(N,size=1,prob=z)
X <- as.matrix(cbind(y,x1,x2,x3))
return(X)
}
X <- getData(N)
X.val <- getData(N)
xgtrain <- xgb.DMatrix(X[, -c(1)],
label = X[, 1])
xgval <- xgb.DMatrix(X.val[, -c(1)],
label = X.val[, 1])
watchlist = list(train = xgtrain,val=xgval)
param <- list(max_depth = 2, eta = 0.3, nthread = 2, gamma = 0, min_child_weights = 1,
objective = “binary:logistic”, eval_metric = “error”, subsample = 1, colsample_bytree = 1)
m <- xgb.train(param, xgtrain, nrounds = 1000, watchlist = watchlist, verbose = TRUE)
pred.train <- predict(m, X[,-1])
pred.val <- predict(m, X.val[,-1])
y <- X[,1]
mean((pred.train-y)^2)
mean(y*log(pred.train) + (1-y)*log(1-pred.train))
y <- X.val[,1]
mean((pred.val-y)^2)
mean(y*log(pred.val) + (1-y)*log(1-pred.val))