Hi,
After training an R xgboost model as described below, I would like to calculate the probability
prediction by hand using the tree that is output by xgb.model.dt.tree().
For a test row, I thought that the correct calculation would use the leaves from all 4 trees as shown here:
Tree Node ID Feature Split Yes No Missing Quality Cover
1: 0 0 0-0 V8 0.012865 0-1 0-2 0-1 20.127027500 61.50000
2: 0 1 0-1 Leaf NA <NA> <NA> <NA> 0.009677419 30.00000
3: 0 2 0-2 Leaf NA <NA> <NA> <NA> 0.350769252 31.50000
4: 1 0 1-0 V15 0.625835 1-1 1-2 1-1 19.353305800 60.54989
5: 1 1 1-1 Leaf NA <NA> <NA> <NA> -0.034775745 30.22977
6: 1 2 1-2 Leaf NA <NA> <NA> <NA> 0.300693214 30.32012
7: 2 0 2-0 Leaf NA <NA> <NA> <NA> 0.098971337 59.27218
8: 3 0 3-0 Leaf NA <NA> <NA> <NA> 0.071213789 58.25556
I expected that the probability for the following input:
# V8 V15
# -0.93597 -0.51685
would be the result of the following calculation:
logfun <- function(x){1/(1 + exp(x))}
logfun(-sum(0.009677419,-0.034775745,0.098971337,0.071213789))
#0.5362082
Where each of the numbers are the leaf values for each of the 4 trees.
However, the predict() function produces the following:
predict(xgbModel,test_row_1)
#0.5184599
The leaf indices returned by predict(xgbModel,test_row_1,predleaf = T)
are c(1,1,0)
, meaning that the last tree is not considered by the predict function.
Is my approach to hand-calculation correct? Should all 4 trees be considered when calculating the probability?
Thank you.