Upon training the XGboost model for a binary classification problem predicting the loan defaulting. 1 is the case of an event of defaulting and 0 is the nomal case. I built a random gridsearch CV to tune the model’s hypyterparemates and using the xgboost.cv() with the metrics paremeter set to ‘auc’. The outcome of this tuning was very succucessful(the auc was 89% on avg)
`def objective(hyperparameters, iteration):
"""Objective function for grid and random search. Returns
the cross validation score from a set of hyperparameters."""
# Perform kfolds cross validation
cv_results = pd.DataFrame(xgboost.cv(hyperparameters, train_boost, num_boost_round = 300, nfold = 5,
early_stopping_rounds = 10, metrics = 'auc', seed = 50))
print(cv_results)
# results to retun
score = cv_results['test-auc-mean'].iloc[-1]
return score
def random_search(param_grid, max_evals = 5):
"""Random search for hyperparameter optimization"""
# Dataframe for results
results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
index = list(range(max_evals)))
# Keep searching until reach max evaluations
for i in range(max_evals):
# Choose random hyperparameters
hyperparameters = {j: random.sample(k, 1)[0] for j, k in param_grid.items()}
print(hyperparameters)
hyperparameters['subsample'] = 1.0 \
if hyperparameters['booster'] == 'gbtree' else hyperparameters['subsample']
# Evaluate randomly selected hyperparameters
eval_score = objective(hyperparameters, i)
results['score'].iloc[i] = eval_score
results['params'].iloc[i] = str(hyperparameters)
results['iteration'].iloc[i] = i+1
# Sort with best score on top
results.sort_values('score', ascending = False, inplace = True)
results.set_index('iteration', inplace = True)
return results`
However when I tried to fit and predict for the entire traning set it gives me the following results.
`xgb_clf_final = xgboost.XGBClassifier(booster = 'dart', learning_rate = 0.0028230894981783133,
max_depth = 10, reg_alpha = 0.36734693877551017,
reg_lambda = 0.5510204081632653, colsample_bytree = 0.6888888888888889,
subsample = 0.7929292929292929, n_estimators = 28, n_jobs = -1)
xgb_clf_final.fit(X_train_final, y_train_final)
train_pred = xgb_clf_final.predict(X_train_final)
cm = confusion_matrix(y_train_final, train_pred)
cm
>>>array([[894536, 3],
[ 76267, 8]], dtype=int64)`
I thought the cross val with auc was arbirary and I changed the cross val scoring to f1_micro and the model preformed just as well. I then changed the model to random forest to predcit the training set it prefroms quite well.
`rnd_clf.fit(X_train_final, y_train_final)
train_pred = rnd_clf.predict(X_train_final)
cm = confusion_matrix(y_train_final, train_pred)
cm
>>>array([[894539, 0],
[19, 76256]], dtype=int64)`
Although the trees are not been regularized there cannot be such a great difference. Could someone help to figute this out? This is the model evaluation for the model it seems quite well but when I fit the model and predict the actual instances it cannot be worse. Why is this?
`xgb_clf_final.fit(X_train_final, y_yrain_final, eval_metric = 'auc',
eval_set = [(X_train_final, y_train_final), (X_train_final,
y_train_final)], early_stopping_rounds = 10)
>>> [0] validation_0-auc: 0.89303 validation_1-auc:0.8930303
..................................................
[27] validation_0-auc: 0.89715 validation_1-auc:0.89715`
My xgboost version is 1.3.3
scikit-learn 2.4.1