As per Ian Goodfellow and Yoshua Bengio and Aaron Courville in their book Deep Learning: “…indeed, we can show how—in the case of a simple linear model with a quadratic error function and simple gradient descent—early stopping is equivalent to L2 regularization.”
Okay, my model in not linear but we are dealing with quadratic error and “simple” gradient descent (or the Taylor series as a quadratic simple gradient descent formula), I think.
Anyway, I have trouble with early stopping. First, some of my data has a time-stamp and is technically a time series. Therefore, cross-validation may not really be valid (although it may work). I use walk-forward validation when possible.
Second, have you ever looked at some of the validations in a cross-validation to determine early stopping. For my data a mean or median gets printed out but the early stopping for each validation is highly variable with a wide range. 10 samples (or 5) are probably—well unbiased I guess. But with that variance, I wonder what would happen with different data.
Plus, when I move to get a model on all of the data for prediction shouldn’t there be a correction on the early stopping for the increased data (albeit a small one)?
I am already long. So my question is can I use L2 regularization to replace early stopping (is Goodfellow theorem correct about nonlinear data too)? And do I want to use L2 regularization, gamma or both?
And for the purpose of showing how limited my knowledge is: what kind of regularization does gamma produce? Is it L1, L2 or defined as something else? Lagrangian multiplier perhaps?
Thank you for any answer and for your tolerance of someone not formally trained in machine learning.