Cross Validation Error - average of each fold?

brebbles · October 22, 2018, 11:39pm

I’ve been using a custom objective and eval error (which happens to be a raw sum of errors, rather than an average per record) through a xgb.cv, and after digging into some results it seems that the value returned for eval error in a CV is the simple average of the eval error of each fold.

I appreciate that with CV the point is to have the exact same number of records in each fold, but in the case where the number of records in each fold differs even slightly, this could lead to some (even minor) biases creeping in.

Is there any way for the error metric in CV to be passed as the sum of the errors across the folds, rather than the average?

hcho3 · October 23, 2018, 7:00am

I think you can manually edit this line to use np.sum instead of np.mean:

github.com

dmlc/xgboost/blob/e26b5d63b228729598351cb68935d6b776bb8267/python-package/xgboost/training.py#L316


        k, v = it.split(':')
        if k not in cvmap:
            cvmap[k] = []
        cvmap[k].append(float(v))
msg = idx
results = []
for k, v in sorted(cvmap.items(), key=lambda x: (x[0].startswith('test'), x[0])):
    v = np.array(v)
    if not isinstance(msg, STRING_TYPES):
        msg = msg.decode()
    mean, std = np.mean(v), np.std(v)
    results.extend([(k, mean, std)])
return results




def cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None,
   metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None,
   fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True,
   seed=0, callbacks=None, shuffle=True):
# pylint: disable = invalid-name
"""Cross-validation with given parameters.

brebbles · October 30, 2018, 11:24pm

Many thanks for the reply - I’ll have a look for same in the R package.

thvasilo · October 31, 2018, 12:57pm

Just wanted to note, it’s by taking the sum that you are increasing the chances that one fold will have worse metrics, simply because it has more samples in it.

I’m not sure why you would use the sum in a CV scenario.

MattWenham · January 6, 2019, 8:52pm

Wouldn’t the sum of the errors just be the mean multiplied by the number of folds? Or am I missing something?

brebbles · January 6, 2019, 11:28pm

Yes I think you are right - I wasn’t thinking about the problem clearly. I thought it would be taking an average of averages, rather than an average of sums.