Cross Validation Error - average of each fold?


#1

I’ve been using a custom objective and eval error (which happens to be a raw sum of errors, rather than an average per record) through a xgb.cv, and after digging into some results it seems that the value returned for eval error in a CV is the simple average of the eval error of each fold.

I appreciate that with CV the point is to have the exact same number of records in each fold, but in the case where the number of records in each fold differs even slightly, this could lead to some (even minor) biases creeping in.

Is there any way for the error metric in CV to be passed as the sum of the errors across the folds, rather than the average?


#2

I think you can manually edit this line to use np.sum instead of np.mean:


#3

Many thanks for the reply - I’ll have a look for same in the R package.


#4

Just wanted to note, it’s by taking the sum that you are increasing the chances that one fold will have worse metrics, simply because it has more samples in it.

I’m not sure why you would use the sum in a CV scenario.


#5

Wouldn’t the sum of the errors just be the mean multiplied by the number of folds? Or am I missing something?


#6

Yes I think you are right - I wasn’t thinking about the problem clearly. I thought it would be taking an average of averages, rather than an average of sums.