Are early stopping and dropout sufficient to regularize the vast majority of deep neural networks in practice?

Question

There are so many regularization techniques, it's not practical to try out all combinations:

l1/l2
max norm
dropout
early stopping
...

It seems that most people are happy with a combination of dropout + early stopping: are there cases where using other techniques makes sense?

For example, if you want a sparse model you can add in a bit of l1 regularization. Other than that, are there strong arguments in favor of sprinkling in other regularization techniques?

I know about the no-free-lunch theorem, in theory I would have to try out all combinations of regularization techniques, but it's not worth trying if it almost never yields a significant performance boost.

Amitoz Dandiana · Answer 1 · 2016-08-19T05:11:58.703

Let's recall the main aim of regularization is to reduce over fitting.

What other techniques are currently being used to reduce over fitting:

1) Weight sharing- as done in CNN's, applying the same filters across the image.

2) Data Augmentation- Augmenting existing data and generate synthetic data with generative models

3) Large amount of training data- thanks to ImageNet etc.

4) Pre-training- For example say Use ImageNet learnt weights before training classifier on say Caltech dataset.

5) The use of RelU's in Neural Nets by itself encourages sparsity as they allow for zero activations. In fact for more complex regions in feature space use more RelU's, deactivate them for simple regions. So basically vary model complexity based on problem complexity.

Use of a bunch of such techniques in addition to dropout and early stopping seems sufficient for the problems being solved today. However for novel problems with lesser data you may find other regularization techniques useful.

+1 Great answer, thanks. It seems that there is a blurry line separating weight initialization techniques (eg. pre-training) and regularization. Also, some techniques may be useful for several things, including regularization: for example batch-norm is meant to fix the vanishing gradients problem, but it also has some regularization capabilities. I'll wait for a few other answers before accepting one. — MiniQuark, Aug 19 '16 at 06:41

Are early stopping and dropout sufficient to regularize the vast majority of deep neural networks in practice?

1 Answers1