Replication of logloss for highly imbalanced dataset

Hi,
I am trying to define a custom loss function for a highly imbalanced medical dataset that replicates the original plain xgboost under a particular parameter setting. To make sure my code works, I tried implementing the basic user-defined log loss but I see that I get very different results for the same logistic loss implementation. Can you please tell me where I may be going wrong? I am struck on this for days.
I followed different advices from this forum such as - setting base score to a value near 0, converted logit to probability etc. No better results yet.

My confusion matrices and params:

  1. Custom log-loss implementation:
    { max_depth = 2, eta = 0.4, ‘disable_default_eval_metric’: 1, ‘base_score’: 1e-16, obj = logregobj_for_alpha_not_1, feval = evalerror }
    ----- 0 ------ 1 —
    0 [[75404 1019]
    1 [ 7882 3633]]

  2. Native ‘binary:logistic’ loss -
    { max_depth = 2, eta = 0.4, ‘objective’: ‘binary:logistic’ }
    ----- 0 ------- 1 —
    0 [[74829 1594]
    1 [ 7175 4340]]

Code Sample:

def evalerror(self, preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds)) 
    fpr, tpr, thresholds = roc_curve(labels, preds)
    return 'alpha-error', auc(fpr, tpr)

def logregobj_for_alpha_not_1(self, preds, dtrain):
    labels = dtrain.get_label()
    prob = 1.0 / (1.0 + np.exp(-preds))  # transform raw leaf weight
    grad = prob - labels
    hess = prob * (1.0 - prob)
    return grad, hess

I would love to know what factor makes the difference in confusion matrix and how to avoid that? Thanks

Did you use data weights or scale_pos_weight? Right now, your custom logloss implementation does not account for data weights.

When I used scale_pos_weight, the difference between the values of confusion matrices of native and custom logloss implementation was huge. To debug this difference in values, I wanted to run without scale_pos_weight. But, as you can see, there is still noteworthy difference in the confusion matrices. Can you please help understand why this happens or how to debug?
When I account for scale_pos weight, my custom log loss changes as below -

def logregobj_for_alpha_not_1(self, preds, dtrain):
    labels = dtrain.get_label()
    weights = np.where(labels == 1.0, self.vxgb_param['scale_pos_weight'], 1.0)
    prob = 1.0 / (1.0 + np.exp(-preds))  # transform raw leaf weight
    grad = prob - labels
    hess = prob * (1.0 - prob)
    return grad, hess

Also, please let me know if you need anything from my side to find out why this happens. Thanks a lot !

You should multiply grad and hess by data weights, like this:
return (grad * weights), (hess * weights).

Do you also see the difference when you run without scale_pos_weight?

Also, please make sure that you are running the latest version of XGBoost. Older versions of XGBoost might have different way of handling custom objectives and metrics.

Thanks for the prompt response.

  1. I am Sorry, I missed the multiplication of weights when I copy-pasted. I used the same return (grad * weights), (hess * weights). So that piece of code is correct from my side. I still see the difference without scale_pos_weight. The difference is explained in the original post.

  2. Yes, I am using the latest XGBoost version - 1.2.0

Here is a working example of the logloss implemented as a custom objective: https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py (This example assumes unweighted data.) Try running it yourself and see if you can adapt it.

Thanks, But I am using the exact same code for my dataset and I get differences between the native and custom implementation. One noteworthy point is that I get same results for small datasets such as sklearn breast_cancer or hastie datasets. But, for my dataset which has 1200 features, ~350K train set, 90K test set with ~8% positive class… the results are not the same for the exact same implementation that you have shared.

I have no idea. It could be due to truncation error in floating-point calculations. Floating-point arithmetic is known to be not associative

If that is the case, how does Native XGBoost handle that in a better way? Can you please suggest any way to overcome that?

Unfortunately I have no idea.

No problem. Thank you so much for the help. Can you please direct me to someone who might help my case? I am working on a RAPID funded contact tracing application and the custom loss that I want to implement is the crux of the project. But, unfortunately, I am not getting better results and this could be one reason as well.

You may consider writing your objective in C++, by adding a new class to https://github.com/dmlc/xgboost/blob/master/src/objective/regression_loss.h. Adding a new objective in C++ is not too difficult if your objective is element-wise (i.e. grad[i] and hess[i] can be computed solely from the i-th row of the input data matrix).

See https://github.com/dmlc/xgboost/pull/4541 for an example of adding a new objective and a metric in C++.

@goku_grad_asu1 One suggestion: you should try explicitly specifying the float32 data type in your customized objective function. Otherwise, NumPy will use float64 for functions like np.exp, whereas XGBoost will consistently use float32 for all internal calculation.

def evalerror(self, preds, dtrain):
    labels = dtrain.get_label()
    one_scalar = np.array([1.0], dtype=np.float32)
    preds = one_scalar / (one_scalar + np.exp(-preds, dtype=np.float32)) 
    fpr, tpr, thresholds = roc_curve(labels, preds)
    return 'alpha-error', auc(fpr, tpr)

def logregobj_for_alpha_not_1(self, preds, dtrain):
    labels = dtrain.get_label()
    one_scalar = np.array([1.0], dtype=np.float32)
    prob = one_scalar / (one_scalar + np.exp(-preds, dtype=np.float32))
    grad = prob - labels
    hess = prob * (one_scalar - prob)
    return grad, hess

Thank you for these suggestions. I will try them out out and let you know the results :slight_smile:

Also, I want to localize the problem. The gradients and hessian could be different because of the floating point precision. Would it be possible to debug that? In other words, is there a way in which I can get the gradients and the hessian values of the native XGBoost ‘binary:logistic’ implementation?

You can build XGBoost with CMake option -DUSE_DEBUG_OUTPUT=ON.

Note that this option will result into lots of outputs in the console, so use it with small amount of data.

Hi @hcho3,
I just wanted to provide an update with respect to the Numpy float32/float64 discrepancy. I tried your suggestion but I am still getting the same results :confused:
I am trying to install XGBoost and build with CMake to get debugging outputs as well to implement custom loss function in C++.

I am following this part of the instructions to install and build XGBoost with CMake. https://xgboost.readthedocs.io/en/latest/build.html#building-on-linux-distributions
I will update you soon on this. Please let me know if I should any other link than the one specified. Thank you.

Hi @hcho3,
I tried a couple of your suggestions and still don’t find out the reason as to why the native XGBoost performs better than the custom log loss implementation.

However, on debugging the Gradients, I found out that the gradients precision differ much more. The precision is higher in the native model and is lower in the python custom log loss.
Ex: Breast Cancer dataset sample one data point –
[NATIVE MODEL] G = 0.0024246801622211933 / H = 0.0024188011884689331
[CUSTOM Logloss MODEL] G = 0.00061967276 / H = 0.00061928877

Could this be the reason for floating point non-associativity when the dataset size increases?

However, I also want to try implementing Alpha loss (https://arxiv.org/pdf/2006.12406.pdf) for imbalanced classes. When, I tried implementing this in Python, I couldn’t get the expected results.

So, I would like to implement this in C++ and see the performance difference. Can you please tell how to write a custom objective in C++? I can document this and contribute for XGBoost :slight_smile:

Take a look at https://github.com/dmlc/xgboost/pull/4763/files#diff-d50bd6f39ba8e23288ec1852970dc9c980244d1cac07d8ddcda7ddbb2e95ba3e. This is a good example for implementing a new loss function in XGBoost.