Behavior of eta and max_depth not as expected

glenn · December 20, 2020, 6:01pm

If anyone can help me understand the below behavior

Using a lower learning rate say .01 and more trees I end up with a poorer model as measured by auc.
Using a higher learning rate say .4 and fewer trees I end up with a superior model as measured by auc.

A smaller and more performant model is preferable to a larger less performant model. So, I am happy with the results. Still, the outcome is different that what I have learned with respect to boosting with respect to learning rate and number of trees. I suppose it is simply the difference in classifiers. Still, I would like to understand the why as best possible as I plant to replace the old classifier with XgBoost in a production environment next year.

The dataset is highly unbalanced . Below is the parameter setup.

  trainlabel <- trainloandata$event
  trainpred <- trainloandata[,!(names(trainloandata) %in% c('event'))]
  dtrain <- xgb.DMatrix(data = data.matrix(trainpred), label = trainlabel)
  
  testlabel <- testloandata$event
  testpred <- testloandata[,!(names(testloandata) %in% c('event'))]
  dtest <- xgb.DMatrix(data = data.matrix(testpred), label = testlabel)
  
    params <- list(booster = "gbtree", 
                 objective = "binary:logistic",
                 tree_method = 'approx',
                 eta = 0.4, 
                 gamma = 0, 
                 max_depth = 8,
                 max_delta_step = 2,
                 min_child_weight = 1, 
                 subsample = .5, 
                 colsample_bytree = 1.0)
  
  params_constrained <- params
  watchlist <- list(train = dtrain, test = dtest)
  
  xgboost_model <- xgb.train(params = params, 
                             data = dtrain, 
                             nrounds = 500, 
                             watchlist = watchlist, 
                             print_every_n = 1, 
                             early_stopping_rounds = 5, 
                             maximize = TRUE, 
                             eval_metric = "auc",
                             metric_name = "dtrain_auc")

hcho3 · December 20, 2020, 6:03pm

Can you post the value of the AUC metric both on the training and test data? It may be that the first model was underfitting, leading to low AUC for both training and test data.

glenn · December 20, 2020, 6:42pm

@hch03 - Sure, will have to make another run at the lower eta and more trees. Takes abit of time the datasets are quite large. The trainset I am working with now is +100mm rows and I think I need to do some work in terms of openmp as I cannot get distributed working yet. Will post later tonight.

Attached is a scatter of predicted vs. actual using the eta = .4 and 26 trees. I no longer have the graph of eta = .01 and 126 trees as it was over written. In the case of eta = .01 most of the observations predicted vs. actual above 25% actual were below the lower of the channel.

That said, I have been working on this for sometime in XGBoost and today is a new configuration of the ML pipeline set-up so I should try to replicate the outcome again. FYI, this is mortgage prepayment and the upper and lower bounds are +/- 2 CPR - so the performance thus far is excellent in that it is very difficult to get these models to stay on the 45 degree line. I still have a few steps to go. Right now, I am working in reducing the overall memory footprint given the size of the datasets.

glenn · December 21, 2020, 7:07pm

I cleared the environment and ran the same model with the below. The only difference is eta the aug score is .80 vs. .81 but the model is a very poor performing model as show by predict vs. actual

params <- list(booster = "gbtree", 
             objective = "binary:logistic",
             tree_method = 'approx',
             eta = 0.01, 
             gamma = 0, 
             max_depth = 8,
             max_delta_step = 2,
             min_child_weight = 1, 
             subsample = .5, 
             colsample_bytree = 1.0)
 params_constrained <- params
watchlist <- list(train = dtrain, test = dtest)

xgboost_model <- xgb.train(params = params, 
                         data = dtrain, 
                         nrounds = 500, 
                         watchlist = watchlist, 
                         print_every_n = 1, 
                         early_stopping_rounds = 5, 
                         maximize = TRUE, 
                         eval_metric = "auc",
                         metric_name = "dtrain_auc")

Looking at features shows that most features have all but been zero’d out but both L1 and L2 regularization are default. The max_delta step is set to 2 this is to make the model more conservative and also due to the imbalance of the dataset.

Finally, I calculate aug myself and plot the curve. I get an 0.8 auc. This leads me to conclude the lower the eta the less conservative the model need be (?), which implies a relationship/trade-off between max delta step and eta - I think. Going to re-run with default max delta step.

hcho3 · December 21, 2020, 7:18pm

Was the AUC metric computed on the training data or the test data? I asked for both training and test AUC and you only provided with a single value.

Setting lower eta will in general make the model more conservative, so having a lower AUC on the training data may well be justified.

Also, AUC may not be the best metric for your application. Consider using AUCPR metric instead, or design a custom metric function.

glenn · December 21, 2020, 7:29pm

@hcho3 the auc in the graph is test data auc. I do get this error when starting the training

[13:09:22] WARNING: amalgamation/…/src/learner.cc:516:
Parameters: { metric_name } might not be used.

This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.

I think it is a case of messing around with XGBoost to learn how it works. I know I can get a performant model at 600k size which is really good given there are 53 sub models.

Finally, FYI I have set the problem up where the negative class is 0 and the positive class is 1. I believe that I read it is best to use error if one is interested in correct probabilities - which I am.

hcho3 · December 21, 2020, 7:38pm

I have no idea then. Maybe fitting more trees may have compensated for lower eta.

use error if one is interested in correct probabilities

Not quite. The XGBoost doc says that you should avoid setting scale_pos_weight to obtain well-calibrated probabilities. The choice of evaluation metric does not interfere with either model fitting or prediction calibration; the metric helps you choose between competing model alternatives.

It is said that AUCPR is better than AUC for evaluating models for imbalanced data: https://towardsdatascience.com/sampling-techniques-for-extremely-imbalanced-data-281cc01da0a8.

glenn · December 21, 2020, 7:55pm

@hcho3 thanks for the link I will check it out. I think the you are correct with respect to trees and eta. I am going to play with settings because it looks to me, at least now, that one can tune such that eta can be larger, the number of trees much less, if the model is made more conservative. This would be great if it is the case as once can create a performant model with fewer trees and a smaller memory footprint. Who know but worth investigating