Why is XGBoost model evaluation different from actual prediction?

uoftsts · March 19, 2021, 4:06am

Upon training the XGboost model for a binary classification problem predicting the loan defaulting. 1 is the case of an event of defaulting and 0 is the nomal case. I built a random gridsearch CV to tune the model’s hypyterparemates and using the xgboost.cv() with the metrics paremeter set to ‘auc’. The outcome of this tuning was very succucessful(the auc was 89% on avg)

`def objective(hyperparameters, iteration):
"""Objective function for grid and random search. Returns
   the cross validation score from a set of hyperparameters."""

 # Perform kfolds cross validation
cv_results = pd.DataFrame(xgboost.cv(hyperparameters, train_boost, num_boost_round = 300, nfold = 5, 
                    early_stopping_rounds = 10, metrics = 'auc', seed = 50))
print(cv_results)
# results to retun
score = cv_results['test-auc-mean'].iloc[-1]

return score

def random_search(param_grid, max_evals = 5):
"""Random search for hyperparameter optimization"""

# Dataframe for results
results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(max_evals)))

# Keep searching until reach max evaluations
for i in range(max_evals):
    
    # Choose random hyperparameters
    hyperparameters = {j: random.sample(k, 1)[0] for j, k in param_grid.items()}
    print(hyperparameters)
    hyperparameters['subsample'] = 1.0 \
    if hyperparameters['booster'] == 'gbtree' else hyperparameters['subsample']

    # Evaluate randomly selected hyperparameters
    eval_score = objective(hyperparameters, i)
    results['score'].iloc[i] = eval_score
    results['params'].iloc[i] = str(hyperparameters)
    results['iteration'].iloc[i] = i+1

# Sort with best score on top
results.sort_values('score', ascending = False, inplace = True)
results.set_index('iteration', inplace = True)
return results`

However when I tried to fit and predict for the entire traning set it gives me the following results.

`xgb_clf_final = xgboost.XGBClassifier(booster = 'dart', learning_rate = 0.0028230894981783133,
                                  max_depth = 10, reg_alpha = 0.36734693877551017,
                                  reg_lambda = 0.5510204081632653, colsample_bytree = 0.6888888888888889,
                                  subsample = 0.7929292929292929, n_estimators = 28, n_jobs = -1)
xgb_clf_final.fit(X_train_final, y_train_final)
train_pred = xgb_clf_final.predict(X_train_final)
cm = confusion_matrix(y_train_final, train_pred)
cm

>>>array([[894536,      3],
          [ 76267,      8]], dtype=int64)`

I thought the cross val with auc was arbirary and I changed the cross val scoring to f1_micro and the model preformed just as well. I then changed the model to random forest to predcit the training set it prefroms quite well.

`rnd_clf.fit(X_train_final, y_train_final)
train_pred = rnd_clf.predict(X_train_final)
cm = confusion_matrix(y_train_final, train_pred)
cm

>>>array([[894539, 0],
          [19,  76256]], dtype=int64)`

Although the trees are not been regularized there cannot be such a great difference. Could someone help to figute this out? This is the model evaluation for the model it seems quite well but when I fit the model and predict the actual instances it cannot be worse. Why is this?

`xgb_clf_final.fit(X_train_final, y_yrain_final, eval_metric = 'auc',
                  eval_set = [(X_train_final, y_train_final), (X_train_final,
                               y_train_final)], early_stopping_rounds = 10)

>>> [0] validation_0-auc: 0.89303    validation_1-auc:0.8930303
          ..................................................
    [27] validation_0-auc: 0.89715   validation_1-auc:0.89715`

My xgboost version is 1.3.3
scikit-learn 2.4.1

hcho3 · March 19, 2021, 9:09pm

When you are calling predict(), it applies the 0.5 threshold on the score output from the classifier. You may want to use predict_proba() instead and then apply a threshold that’s different from 0.5. This is because the AUC-ROC metric computes the general performance of a probabilistic classifier given a range of possible thresholds (not just 0.5). See https://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf for more details.

uoftsts · March 20, 2021, 2:50am

Big thanks sir now I am able to adjust the thresholds to change the prediction. But how do I get the optimal threshold for the model?

hcho3 · March 20, 2021, 3:44pm

You should decide the cost function for false positives and negatives, and then optimize the cost function by trying out many thresholds from 0 to 1.