Predict_proba with ntree_limit in sklearn python API

Hi,

I am using the sklearn python wrapper from xgboost 0.72.1 to train multiple boosted decision trees for a binary classification, all of them individually with early stopping, such that the best_ntree_limit differs.
When I use predict_proba on some data, I see that the ranges of the probabilities differ a lot, such that I am pretty sure the output does not correspond to a probability.
Here are example plots for data corresponding to category 0 and data corresponding to category 1, both showing the probability for the data to belong to category 1 predicted by two different bdts.

Both bdts were trained with ‘early_stopping_rounds’: 50, for one the best iteration is 212 for the other 71.

Is it possible, that the normalisation of the output breaks when not all trained trees are used to calculate it?

I saw that there was an issue in the git repo once (see https://github.com/dmlc/xgboost/issues/1897 ), where the solution (or error) was the existence of the key word argument output_margin, that you could pass to predict_proba. However this was removed in a more recent release, so I don’t know how to deal with this problem.

If the function name is predict_proba, I would naively expect the output to be a probability, so maybe its a bug?
Or am I making a mistake?

Many thanks in advance!

Can you clarify this? I see the histogram and all values are between 0 and 1. What is your reasoning for suspecting that the output is not a probability?

Hi,

thanks a lot for your response!

I think you might be right concerning the probability question.
Sorry for explaining my problem so badly.

So I am training the BDT on the same data as I am analysing. In order to avoid bias, I employ cross validation using 10 folds. So I now end up with 10 BDTs whose probability predictions cover different ranges. In the next step I would like to set a threshold, at which I consider data to be category 0 or 1. The fact that the 10 BDTs cover different intervals means, that I need to have 10 different thresholds, which is a complication I would like to avoid. For me it would be much more comfortable to have a common threshold for all 10 folds.

It would be helpful to keep in mind how the probability output is computed. For each prediction, XGBoost first compute the sum of leaf outputs from all trees. This sum is referred to as the margin score:

[margin score] =   [leaf output from tree 0]
                 + [leaf output from tree 1]
                 + ...
                 + [leaf output from tree ntree_limit]

Then the probability output is given by the sigmoid function:

[probability output] = sigmoid([margin score])

So it may be helpful to plot distribution of margin scores using 10 BDTs. (Use predict() with output_margin=True.) My suspicion is that somehow each BDT is overfit to a specific partition, so that the 10 BDTs show divergent behaviors. You may want to adjust training parameter to reduce overfitting.

predict() returns integer values (in my case 0 or 1), independent of passing output_margin=False or output_margin=True

Concerning the over fitting, I checked the learning curves and to me they dont look too bad:

So if I understood correctly, predict with output_margin=True should return floating point numbers representing the margin scores. However, for me only integers are returned. Here is a small piece of code to reproduce this behaviour:

from xgboost.sklearn import XGBClassifier

import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
import pandas as pd

iris = datasets.load_iris()

data = pd.DataFrame(
    data=np.c_[iris['data'], iris['target']],
    columns=iris['feature_names'] + ['target']
).query("target!=2")

X_train, X_test, y_train, y_test = train_test_split(
    data[iris['feature_names']],
    data['target'],
    test_size=0.33,
    random_state=42
)
clf = XGBClassifier()
clf.fit(X_train.values, y_train.values)
pred = clf.predict(X_test.values)
pred_margin = clf.predict(X_test.values, output_margin=True)

print(pred)
print(pred_margin)

For me the output is

[1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0.
 1. 0. 0. 1. 0. 1. 0. 0. 1.]
[1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0.
 1. 0. 0. 1. 0. 1. 0. 0. 1.]

I find that some tree maybe not be used. It’s right?