Discrepancy between predict_proba and predict (using output_margin=True) for multi:softprob

kvnhu · April 3, 2020, 1:47am

It’s my understanding that for an XGBoost classifier with objective=‘multi:softprob’, the output of model.predict(data, output_margin=True) should be the class probabilities for each row in data. Also, it’s my understanding that model.predict_proba returns the class probabilities.

This understanding is based on the code here:

github.com

dmlc/xgboost/blob/71a8b8c65afcad6a3e3bf48bf1ee5e66d762543f/python-package/xgboost/sklearn.py#L877


"""
test_dmatrix = DMatrix(data, base_margin=base_margin,
                       missing=self.missing, nthread=self.n_jobs)
if ntree_limit is None:
    ntree_limit = getattr(self, "best_ntree_limit", 0)
class_probs = self.get_booster().predict(
    test_dmatrix,
    output_margin=output_margin,
    ntree_limit=ntree_limit,
    validate_features=validate_features)
if output_margin:
    # If output_margin is active, simply return the scores
    return class_probs


if len(class_probs.shape) > 1:
    column_indexes = np.argmax(class_probs, axis=1)
else:
    column_indexes = np.repeat(0, class_probs.shape[0])
    column_indexes[class_probs > 0.5] = 1


if hasattr(self, '_le'):

github.com

dmlc/xgboost/blob/71a8b8c65afcad6a3e3bf48bf1ee5e66d762543f/python-package/xgboost/sklearn.py#L893


    else:
        column_indexes = np.repeat(0, class_probs.shape[0])
        column_indexes[class_probs > 0.5] = 1


    if hasattr(self, '_le'):
        return self._le.inverse_transform(column_indexes)
    warnings.warn(
        'Label encoder is not defined.  Returning class probability.')
    return class_probs


def predict_proba(self, data, ntree_limit=None, validate_features=True,
                  base_margin=None):
    """
    Predict the probability of each `data` example being of a given class.


    .. note:: This function is not thread safe


        For each booster object, predict can only be called from one
        thread.  If you want to run prediction using multiple thread, call
        ``xgb.copy()`` to make copies of model object and then call predict

However, when I attempt the following, the plot looks not at all 1:1.

import xgboost as xgb
model = xgb.XGBClassifier(
    objective='multi:softprob',
)
model.fit(X_train, y_train)
plt.plot(
    [x[0] for x in model.predict(X_all, output_margin=True)],
    [y[0] for y in model.predict_proba(X_all)],
    '.',
)

discrepancy

What causes this discrepancy? Thanks!

hcho3 · April 4, 2020, 8:31pm

Not true. The margin scores from model.predict(data, output_margin=True) need to be transformed by the softmax function to get class probabilities. Note that the X axis in your graph ranges from -15 to 5, so the margin scores are not proper probabilities.