 Calculating probabilities with XGBoost - binary:logistic vs custom logloss give different results

#1

I’m getting started with XGBoost in R, and am trying to match up the predictions from the binary:logistic model with what’s generated by using a custom log loss function. I’d expect the following two calls to predict to generate the same results:

require(xgboost)

loglossobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1/(1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
}

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train<-agaricus.train
test<-agaricus.test

model<-xgboost(data = train\$data, label = train\$label, nrounds=2,objective="binary:logistic")
preds <- predict(model,test\$data)

model<-xgboost(data = train\$data, label = train\$label, nrounds=2,objective=loglossobj, eval_metric = "error")
preds <- predict(model,test\$data)
x <- 1 / (1+exp(-preds))

The model output from a custom log loss function does not have the logistic transformation 1/(1+exp(-x)) applied. However, if I do so the resulting probabilities are different between the two calls to predict:

 0.2582498 0.7433221 0.2582498 0.2582498 0.2576509 0.2750908
versus

 0.3076240 0.7995583 0.3076240 0.3076240 0.3079328 0.3231709

Any suggestions?

#2

(Cross-posted from Stack Overflow)

It turns out this behaviour is due to initial conditions. xgboost implicitly assumes base_score=0.5 when calling binary:logistic or binary:logit_raw, but base_score must be set to 0.0 to replicate their output when using a custom loss function. Here, base_score is the initial prediction score of all instances.

To illustrate, the following R code generates the same predictions in all three cases:

require(xgboost)

loglossobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1/(1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
}

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train<-agaricus.train
test<-agaricus.test

model<-xgboost(data = train\$data, label = train\$label, objective = "binary:logistic", nround = 10, eta = 0.1, verbose=0)
preds <- predict(model,test\$data)

model<-xgboost(data = train\$data, label = train\$label, objective = "binary:logitraw", nround = 10, eta = 0.1, verbose=0)
preds <- predict(model,test\$data)
x <- 1 / (1+exp(-preds))

model<-xgboost(data = train\$data, label = train\$label, objective = loglossobj, base_score = 0.0, nround = 10, eta = 0.1, verbose=0)
preds <- predict(model,test\$data)
x <- 1 / (1+exp(-preds))

which outputs

 0.1814032 0.8204284 0.1814032 0.1814032 0.1837782 0.1952717
 0.1814032 0.8204284 0.1814032 0.1814032 0.1837782 0.1952717
 0.1814032 0.8204284 0.1814032 0.1814032 0.1837782 0.1952717

#3

I’m having a similar problem which is not readily fixed by changing the base score of the custom loss function mode to zero. I create the same loss function, create some dummy data and train on it. The results from the built-in objectives “binary:logistic” and “reg:logistic” are materially different from using the custom objective, no matter how I set the base score. Did I misunderstand something?

The below script reproduces the problem. I am aware RMSE is not really the right metric but it shows very neatly the differences in behaviour. Differences are also noticeable in other metrics (e.g. AUC).

# Attempt to reproduce log-loss objective

library(data.table)
library(xgboost)

# custom objective function
logloss <- function(preds, dtrain){

# Get weights and labels
labels<- getinfo(dtrain, "label")

# Apply logistic transform to predictions
preds <- 1/(1 + exp(-preds))

# Find gradient and hessian
grad <- (preds - labels)
hess <- preds * (1-preds)

}

# Generate test data
generate_test_data <- function(n_rows = 1e5, feature_count = 5){

# Make targets
test_data <- data.table(
target = sign(runif(n = n_rows, min=-1, max=1))
)

# Add feature columns.These are normally distributed and shifted by the target
# in order to create a noisy signal
for(feature in 1:feature_count){

# Randomly create features of the noise
mu <- runif(1, min=-1, max=1)
sdev <- runif(1, min=5, max=10)

# Create noisy signal
test_data[, paste0("feature_", feature) := rnorm(
n=n_rows, mean = mu, sd = sdev)*target + target]
}

# Make vector of feature names
feature_names <- paste0("feature_", 1:feature_count)

# Make training matrix and labels
split_data[["train_trix"]] <- as.matrix(split_data\$train[, feature_names, with=FALSE])
split_data[["train_labels"]] <- as.logical(split_data\$train\$target + 1)

return(split_data)
}

# Build the tree
build_model <- function(split_data, objective, params = list()){

# Make evaluation matrix
train_dtrix <-
xgb.DMatrix(
data = split_data\$train_trix, label = split_data\$train_labels)

# Train the model
model <- xgb.train(
data = train_dtrix,
watchlist = list(
train = train_dtrix),
nrounds = 5,
objective =  objective,
eval_metric = "rmse",
params = params
)

return(model)
}

split_data <- generate_test_data()
cat("\nUsing built-in binary:logistic objective.\n")
test_1 <- build_model(split_data, "binary:logistic")
cat("\nUsing built-in reg:logistic objective.\n")
test_2 <- build_model(split_data, "reg:logistic")
cat("\n\nUsing custom objective\n")
test_3 <- build_model(split_data, logloss, params = list(base_score = 0.0))

This produces the following output:

Using built-in binary:logistic objective.
	train-rmse:0.476833
	train-rmse:0.463433
	train-rmse:0.455049
	train-rmse:0.449588
	train-rmse:0.446047

Using built-in reg:logistic objective.
	train-rmse:0.476833
	train-rmse:0.463433
	train-rmse:0.455049
	train-rmse:0.449588
	train-rmse:0.446047

Using custom objective
	train-rmse:0.481920
	train-rmse:0.554571
	train-rmse:0.641242
	train-rmse:0.719437
	train-rmse:0.784012

I would have assumed that the custom objective produces an output pretty close to that observed for reg:logistic and binary:logistic.

#4

Seeing that your RMSE is increasing for subsequent iterations my guess would be there’s something wrong with your gradient calculation.

As you can see in the XGB code, there are safeguards against exploding gradients.

Try printing out your gradient and see what you observe.