Hi,
I am trying to define a custom loss function for a highly imbalanced medical dataset that replicates the original plain xgboost under a particular parameter setting. To make sure my code works, I tried implementing the basic user-defined log loss but I see that I get very different results for the same logistic loss implementation. Can you please tell me where I may be going wrong? I am struck on this for days.
I followed different advices from this forum such as - setting base score to a value near 0, converted logit to probability etc. No better results yet.
My confusion matrices and params:
-
Custom log-loss implementation:
{ max_depth = 2, eta = 0.4, ‘disable_default_eval_metric’: 1, ‘base_score’: 1e-16, obj = logregobj_for_alpha_not_1, feval = evalerror }
----- 0 ------ 1 —
0 [[75404 1019]
1 [ 7882 3633]] -
Native ‘binary:logistic’ loss -
{ max_depth = 2, eta = 0.4, ‘objective’: ‘binary:logistic’ }
----- 0 ------- 1 —
0 [[74829 1594]
1 [ 7175 4340]]
Code Sample:
def evalerror(self, preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds))
fpr, tpr, thresholds = roc_curve(labels, preds)
return 'alpha-error', auc(fpr, tpr)
def logregobj_for_alpha_not_1(self, preds, dtrain):
labels = dtrain.get_label()
prob = 1.0 / (1.0 + np.exp(-preds)) # transform raw leaf weight
grad = prob - labels
hess = prob * (1.0 - prob)
return grad, hess
I would love to know what factor makes the difference in confusion matrix and how to avoid that? Thanks