Replication of logloss for highly imbalanced dataset

goku_grad_asu1 · October 27, 2020, 10:35am

Hi,
I am trying to define a custom loss function for a highly imbalanced medical dataset that replicates the original plain xgboost under a particular parameter setting. To make sure my code works, I tried implementing the basic user-defined log loss but I see that I get very different results for the same logistic loss implementation. Can you please tell me where I may be going wrong? I am struck on this for days.
I followed different advices from this forum such as - setting base score to a value near 0, converted logit to probability etc. No better results yet.

My confusion matrices and params:

Custom log-loss implementation:
{ max_depth = 2, eta = 0.4, ‘disable_default_eval_metric’: 1, ‘base_score’: 1e-16, obj = logregobj_for_alpha_not_1, feval = evalerror }
----- 0 ------ 1 —
0 [[75404 1019]
1 [ 7882 3633]]
Native ‘binary:logistic’ loss -
{ max_depth = 2, eta = 0.4, ‘objective’: ‘binary:logistic’ }
----- 0 ------- 1 —
0 [[74829 1594]
1 [ 7175 4340]]

Code Sample:

def evalerror(self, preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds)) 
    fpr, tpr, thresholds = roc_curve(labels, preds)
    return 'alpha-error', auc(fpr, tpr)

def logregobj_for_alpha_not_1(self, preds, dtrain):
    labels = dtrain.get_label()
    prob = 1.0 / (1.0 + np.exp(-preds))  # transform raw leaf weight
    grad = prob - labels
    hess = prob * (1.0 - prob)
    return grad, hess

I would love to know what factor makes the difference in confusion matrix and how to avoid that? Thanks

hcho3 · October 27, 2020, 7:39pm

Did you use data weights or scale_pos_weight? Right now, your custom logloss implementation does not account for data weights.

goku_grad_asu1 · October 28, 2020, 1:51am

When I used scale_pos_weight, the difference between the values of confusion matrices of native and custom logloss implementation was huge. To debug this difference in values, I wanted to run without scale_pos_weight. But, as you can see, there is still noteworthy difference in the confusion matrices. Can you please help understand why this happens or how to debug?
When I account for scale_pos weight, my custom log loss changes as below -

def logregobj_for_alpha_not_1(self, preds, dtrain):
    labels = dtrain.get_label()
    weights = np.where(labels == 1.0, self.vxgb_param['scale_pos_weight'], 1.0)
    prob = 1.0 / (1.0 + np.exp(-preds))  # transform raw leaf weight
    grad = prob - labels
    hess = prob * (1.0 - prob)
    return grad, hess

Also, please let me know if you need anything from my side to find out why this happens. Thanks a lot !

hcho3 · October 28, 2020, 3:12am

You should multiply grad and hess by data weights, like this:
return (grad * weights), (hess * weights).

Do you also see the difference when you run without scale_pos_weight?

hcho3 · October 28, 2020, 3:12am

Also, please make sure that you are running the latest version of XGBoost. Older versions of XGBoost might have different way of handling custom objectives and metrics.

goku_grad_asu1 · October 28, 2020, 3:20am

Thanks for the prompt response.

I am Sorry, I missed the multiplication of weights when I copy-pasted. I used the same return (grad * weights), (hess * weights). So that piece of code is correct from my side. I still see the difference without scale_pos_weight. The difference is explained in the original post.
Yes, I am using the latest XGBoost version - 1.2.0

hcho3 · October 28, 2020, 3:28am

Here is a working example of the logloss implemented as a custom objective: https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py (This example assumes unweighted data.) Try running it yourself and see if you can adapt it.

goku_grad_asu1 · October 28, 2020, 7:30am

Thanks, But I am using the exact same code for my dataset and I get differences between the native and custom implementation. One noteworthy point is that I get same results for small datasets such as sklearn breast_cancer or hastie datasets. But, for my dataset which has 1200 features, ~350K train set, 90K test set with ~8% positive class… the results are not the same for the exact same implementation that you have shared.

hcho3 · October 28, 2020, 7:42am

I have no idea. It could be due to truncation error in floating-point calculations. Floating-point arithmetic is known to be not associative

goku_grad_asu1 · October 28, 2020, 8:08am

If that is the case, how does Native XGBoost handle that in a better way? Can you please suggest any way to overcome that?

hcho3 · October 28, 2020, 8:10am

Unfortunately I have no idea.

goku_grad_asu1 · October 28, 2020, 8:14am

No problem. Thank you so much for the help. Can you please direct me to someone who might help my case? I am working on a RAPID funded contact tracing application and the custom loss that I want to implement is the crux of the project. But, unfortunately, I am not getting better results and this could be one reason as well.

hcho3 · October 28, 2020, 8:23am

You may consider writing your objective in C++, by adding a new class to https://github.com/dmlc/xgboost/blob/master/src/objective/regression_loss.h. Adding a new objective in C++ is not too difficult if your objective is element-wise (i.e. grad[i] and hess[i] can be computed solely from the i-th row of the input data matrix).

See https://github.com/dmlc/xgboost/pull/4541 for an example of adding a new objective and a metric in C++.

hcho3 · October 28, 2020, 8:29am

@goku_grad_asu1 One suggestion: you should try explicitly specifying the float32 data type in your customized objective function. Otherwise, NumPy will use float64 for functions like np.exp, whereas XGBoost will consistently use float32 for all internal calculation.

def evalerror(self, preds, dtrain):
    labels = dtrain.get_label()
    one_scalar = np.array([1.0], dtype=np.float32)
    preds = one_scalar / (one_scalar + np.exp(-preds, dtype=np.float32)) 
    fpr, tpr, thresholds = roc_curve(labels, preds)
    return 'alpha-error', auc(fpr, tpr)

def logregobj_for_alpha_not_1(self, preds, dtrain):
    labels = dtrain.get_label()
    one_scalar = np.array([1.0], dtype=np.float32)
    prob = one_scalar / (one_scalar + np.exp(-preds, dtype=np.float32))
    grad = prob - labels
    hess = prob * (one_scalar - prob)
    return grad, hess

goku_grad_asu1 · October 28, 2020, 6:13pm

Thank you for these suggestions. I will try them out out and let you know the results

Also, I want to localize the problem. The gradients and hessian could be different because of the floating point precision. Would it be possible to debug that? In other words, is there a way in which I can get the gradients and the hessian values of the native XGBoost ‘binary:logistic’ implementation?

hcho3 · October 28, 2020, 8:36pm

You can build XGBoost with CMake option -DUSE_DEBUG_OUTPUT=ON.

github.com

dmlc/xgboost/blob/143b278267d35a644c0fbe7740dae38394b197a9/CMakeLists.txt#L36


#-- Options
## User options
option(BUILD_C_DOC "Build documentation for C APIs using Doxygen." OFF)
option(USE_OPENMP "Build with OpenMP support." ON)
option(BUILD_STATIC_LIB "Build static library" OFF)
option(RABIT_BUILD_MPI "Build MPI" OFF)
## Bindings
option(JVM_BINDINGS "Build JVM bindings" OFF)
option(R_LIB "Build shared library for R package" OFF)
## Dev
option(USE_DEBUG_OUTPUT "Dump internal training results like gradients and predictions to stdout.
Should only be used for debugging." OFF)
option(FORCE_COLORED_OUTPUT "Force colored output from compilers, useful when ninja is used instead of make." OFF)
option(ENABLE_ALL_WARNINGS "Enable all compiler warnings. Only effective for GCC/Clang" OFF)
option(LOG_CAPI_INVOCATION "Log all C API invocations for debugging" OFF)
option(GOOGLE_TEST "Build google tests" OFF)
option(USE_DMLC_GTEST "Use google tests bundled with dmlc-core submodule" OFF)
option(USE_DEVICE_DEBUG "Generate CUDA device debug info." OFF)
option(USE_NVTX "Build with cuda profiling annotations. Developers only." OFF)
set(NVTX_HEADER_DIR "" CACHE PATH "Path to the stand-alone nvtx header")
option(RABIT_MOCK "Build rabit with mock" OFF)

Note that this option will result into lots of outputs in the console, so use it with small amount of data.

goku_grad_asu1 · October 29, 2020, 8:03pm

Hi @hcho3,
I just wanted to provide an update with respect to the Numpy float32/float64 discrepancy. I tried your suggestion but I am still getting the same results
I am trying to install XGBoost and build with CMake to get debugging outputs as well to implement custom loss function in C++.

I am following this part of the instructions to install and build XGBoost with CMake. https://xgboost.readthedocs.io/en/latest/build.html#building-on-linux-distributions
I will update you soon on this. Please let me know if I should any other link than the one specified. Thank you.

goku_grad_asu1 · November 12, 2020, 8:13am

Hi @hcho3,
I tried a couple of your suggestions and still don’t find out the reason as to why the native XGBoost performs better than the custom log loss implementation.

However, on debugging the Gradients, I found out that the gradients precision differ much more. The precision is higher in the native model and is lower in the python custom log loss.
Ex: Breast Cancer dataset sample one data point –
[NATIVE MODEL] G = 0.0024246801622211933 / H = 0.0024188011884689331
[CUSTOM Logloss MODEL] G = 0.00061967276 / H = 0.00061928877

Could this be the reason for floating point non-associativity when the dataset size increases?

goku_grad_asu1 · November 12, 2020, 8:15am

However, I also want to try implementing Alpha loss (https://arxiv.org/pdf/2006.12406.pdf) for imbalanced classes. When, I tried implementing this in Python, I couldn’t get the expected results.

So, I would like to implement this in C++ and see the performance difference. Can you please tell how to write a custom objective in C++? I can document this and contribute for XGBoost

hcho3 · November 12, 2020, 8:52pm

Take a look at https://github.com/dmlc/xgboost/pull/4763/files#diff-d50bd6f39ba8e23288ec1852970dc9c980244d1cac07d8ddcda7ddbb2e95ba3e. This is a good example for implementing a new loss function in XGBoost.