How does offset in XGBoost is handled in binary:logistic objective function

arinadak · April 22, 2019, 4:52pm

I am working on a mortality prediction (binary outcome) problem with “base mortality probability” as my offset in the XGboost problem.
I have used gbtree booster and binary:logistic objective function. In my data data I have multiple observations/records having same X values but different offset values.

As per my understanding (please correct me, if wrong) the XGBoost under binary:logistic setup tries to fit a model of below representation. log(p/1-p) = offset + F(x). Where F(x) is optimized (for a specific loss function) using splits with various X values.

Thus, when the X values are exactly same, to get the F(x), I can use the predicted output (with outputmargin = True option) and subtract the offset from here. However, when I got the output, it turned out in the above mentioned approach, I am getting different values F(X) for a same set X. I believe the way offset is handled internally in the XGBoost is different from my understanding. Can anyone explain me this method/mathematical formulation of handlng offset.

I am specifically interested in extracting the value of F(x) (as this is additional information the model is providing) by adjusting the model prediction from the offset values.

Here are the sample codes:

library(xgboost)
x1 = runif(1000)
y1 = as.numeric(runif(1000)>.8)
y2 = as.numeric(runif(1000)>.8)
off1 = runif(1000)
off2 = runif(1000)

#stacking the data to have same X values
x= c(x1,x1)
y = c(y1,y2)
off = c(off1,off2)

length(unique(off)) # shows unique 2000 values
length(unique(x)) # shows unique 1000 values, i.e. each X is repeated once (as expected)

fulldata = cbind.data.frame(x,y,off)

train_dMtrix = xgb.DMatrix(data = as.matrix(x),
label = y,
base_margin = off)

params_list=list(booster = “gblinear”, objective = “binary:logistic”, eta = 0.05,
max_depth= 4, min_child_weight = 10, eval_metric = ‘logloss’)

set.seed(100)

xgbmodel = xgb.train(params = params_list, data = train_dMtrix, nrounds=100)

Getting the prediction in link format

fulldata$Predicted_link = predict(xgbmodel, train_dMtrix, outputmargin = TRUE)

Assuming Predicted_link = offset + F(x), calculating F(x) for each values of X

fulldata$F_x = fulldata$Predicted_link - fulldata$off

As per my understanding, since the F(X) in purely independent of offset, the model predictions of F_x (not the predicted probability) should be exactly same for same values of x, irrespective of the corresponding offsets. Given I have 1000 distinct X values, I’m expecting 1000 distinct F_x values

length(unique(fulldata$F_x))

shows almost 2000 unique values, which is contrary to my expectation.

hcho3 · April 22, 2019, 5:03pm

This is not the case. The base_margin parameter specifies the base function

base(x_i) = base_margin(i)

from which boosting starts.

Recall that in gradient boosting, each new decision tree is fit using the residuals between the prediction of the current ensemble and the true label. The base function is used for fitting the first decision tree. (By default, the base function is base(x) = 0.5, and you can override it using base_margin vector.)

If you assign different margin to data points, the residuals used in boosting will be different for the first boosting round, so your assumption about F(x) being independent of offset does not hold.