Do I need to do regularization if I have already used early stopping?

Question

I know that early stopping is a way of regularization. I am using validation-based early stopping now. I believe early stopping can save the model before the model gets overfitting. In this case, do I need to also use other regularization, e.g. L1/L2, to improve the performance?

If you have enough data, you can use separate training and test sets, where ["validation" is done only within the training set](http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set). Then you could compare early stopping vs. regularization vs. both, by comparing performance on the common test set (a.k.a. holdout set; so the inter-method comparison is only *after* all methods are trained). — GeoMatt22, Oct 09 '16 at 05:13

Benoit Sanchez · Answer 1 · 2017-07-25T13:35:45.727

Logically the answer is no: since early stopping is an alternative to L2 regularization, and used mainly to be faster than regularization, it is not meant to be used with a regularized cost function.

L1 regularization is used for a different purpose and I think that early stopping is not meant to be equivalent to L1 regularization.

Early stopping is possibly not as precise as L2 regularization. For the moment, it's hard to see clearly, but I haven't read of a case where early stopping outperforms L2 regularization for precision. I understand it as a sort of lower quality L2 regularization even if the difference may be very small on big datasets.

Supposing you use (appropriate) L2 regularization, then early stopping will not provide a better accuracy.

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

You don't specify on what you're considering early stopping vs. regularization, but, in general, I think that early stopping is not necessarily a substitute for l1 regularization.

Consider partial least squares, for example, with early stopping. PLS can be viewed as optimizing the gradient out of a finite set of gradients (one for each variable). If you look at Elements Of Statistical Learning 3.8, the effect of early stopping on PLS is similar to Ridge (l2) regularization.

l1 regularization can lead to sparse models. l2 regularization, on its own, can't. If the true model is sparse, adding l1 regularization (e.g., using the Elastic Net), can improve prediction performance.

For more complex regressors (e.g., neural nets), it is more difficult to describe what will be the effect of early stopping (see Rgularization Versus Early Stopping: A Case Study With A Real System).

Personally, I'd go with GeoMatt22's suggestion. You have (at least) three options: 1. early stopping, 2. l1 regularization, and 3. l2 regularization (and, of course, combinations). Cross validation can be used to see what works best for your specific problem.

Do I need to do regularization if I have already used early stopping?

2 Answers2