 # Predicted probabilities fail tests

For a binary classification problem, on my training dataset when I am minimizing my log-loss function that:

1. I should not be able to rescale the probabilities and lower the log-loss of any model produced, as I am not changing the complexity of the model and the log-loss function should be minimized. The model should have done that for me.

2. The expected number of positive labels in my training dataset should be equal the actual.

I am finding this is not true with XGBoost for this case. I create a test dataset where the probability of the class being 1 is 75% if x<=0.01 and 1% otherwise. If I do a model with 1 tree and 1 split, I get the results below:

count of trees 0
Pred tot counts 1294.1549072 Actual tot counts 173.0000000 logloss 0.1552956 logloss rescaled prob 0.1490095

Furthermore the results for the first tree depend on the number of trees.
Code and initial output is below:

``````import xgboost as xgb
np.random.seed(42)
xtrain = np.random.rand(10000)
noise = np.random.rand(10000)
"""
ytrain will be 1 with probability of 75% if xtrain <= .01
and 1% if xtrain >.01
"""
y_train =  (xtrain <= .01).astype(int) * (noise < .75).astype(int) +\
(xtrain >=  .01).astype(int) * ( noise < .01).astype(int)
dtrain = xgb.DMatrix(xtrain.reshape((-1,1)), label=y_train)
params = {
'objective': 'binary:logistic',
'n_estimators': 1,
'max_depth': 1,
'verbose': 1,
'eta': 1.,
'lambda': 0.
}
"""
Train models of 1 to 10 trees with learning rate of 1.0
We would expect for each model and each level of trees that
E( number of y=1 from prediction ) = number of y=1 in data
"""
for n_rounds in range(1,11):
print("doing nrounds ", n_rounds)
model = xgb.train(params, dtrain, verbose_eval=True,
evals=[(dtrain, 'train')],
num_boost_round=n_rounds)
for ix in range(model.best_ntree_limit):

y_train_pred_prob = model.predict(dtrain, ntree_limit=ix)
logloss = -(y_train * np.log(y_train_pred_prob) +
(1-y_train)*np.log((1-y_train_pred_prob))).sum() /\
len(y_train)
logloss2 = -(y_train*np.log(.95 * y_train_pred_prob) +
(1-y_train)*np.log((1-.95 * y_train_pred_prob))).sum() / \
len(y_train)
print("# trees %5d" % ix,
"Pred tot counts   %12.7f" % y_train_pred_prob.sum(),
"Actual tot counts %12.7f" % y_train.sum(),
"logloss %12.7f" % logloss,
"logloss rescaled prob %12.7f" % logloss2)
xgb.__version__
``````

OUTPUT:

``````doing nrounds  1
[20:28:45] WARNING: ../src/learner.cc:767:
Parameters: { "n_estimators", "verbose" } are not used.

	train-logloss:0.15458
# trees     0 Pred tot counts   1296.6328125 Actual tot counts  176.0000000 logloss    0.1545767 logloss rescaled prob    0.1483100
doing nrounds  2
[20:28:45] WARNING: ../src/learner.cc:767:
Parameters: { "n_estimators", "verbose" } are not used.

	train-logloss:0.15458
	train-logloss:0.08212
# trees     0 Pred tot counts    542.7750244 Actual tot counts  176.0000000 logloss    0.0821235 logloss rescaled prob    0.0802639
# trees     1 Pred tot counts   1296.6328125 Actual tot counts  176.0000000 logloss    0.1545767 logloss rescaled prob    0.1483100
doing nrounds  3
[20:28:45] WARNING: ../src/learner.cc:767:
Parameters: { "n_estimators", "verbose" } are not used.

	train-logloss:0.15458
	train-logloss:0.08212
	train-logloss:0.06375
# trees     0 Pred tot counts    286.4510193 Actual tot counts  176.0000000 logloss    0.0637475 logloss rescaled prob    0.0632427
# trees     1 Pred tot counts   1296.6328125 Actual tot counts  176.0000000 logloss    0.1545767 logloss rescaled prob    0.1483100
# trees     2 Pred tot counts    542.7750244 Actual tot counts  176.0000000 logloss    0.0821235 logloss rescaled prob    0.0802639
doing nrounds  4
[20:28:45] WARNING: ../src/learner.cc:767:
Parameters: { "n_estimators", "verbose" } are not used.

	train-logloss:0.15458
	train-logloss:0.08212
	train-logloss:0.06375
	train-logloss:0.06025
# trees     0 Pred tot counts    198.4432678 Actual tot counts  176.0000000 logloss    0.0602455 logloss rescaled prob    0.0602680
# trees     1 Pred tot counts   1296.6328125 Actual tot counts  176.0000000 logloss    0.1545767 logloss rescaled prob    0.1483100
# trees     2 Pred tot counts    542.7750244 Actual tot counts  176.0000000 logloss    0.0821235 logloss rescaled prob    0.0802639
# trees     3 Pred tot counts    286.4510193 Actual tot counts  176.0000000 logloss    0.0637475 logloss rescaled prob    0.0632427``````