1

Recent work such as Deep Double Descent shows that overfitting is not really a problem with large models, even without any data augmentation or regularization (L2 weight norm, dropout or so).

Edit: Ok, maybe this is a wrong conclusion from this work. It is only that overfitting reduces or generalization improves the larger the model is, but it doesn't say that there is no overfitting anymore, or that overfitting is not a problem anymore.

However, we know from countless work that regularization methods (dropout, etc) and data augmentation (e.g. SpecAugment for speech, etc) still massively improve models.

I often read the statement that regularization and augmentation still improves generalization. My understanding so far was that overfitting is the inverse of generalization. So how does this fit together, if the models do not overfit anyway even without augmentation or regularization?

Or is overfitting not the inverse of generalization? How is generalization defined then?

How would you measure overfitting and generalization? I would measure the difference of the training loss between some held-out validation set and the training set (under same conditions of course, i.e. no dropout, etc). The larger the gap, the more overfitting, the less generalization. The smaller the gap, the less overfitting, the more generalization.

Edit I just thought about synthetically generated datasets, where you can also potentially create an infinite amount of data. In this setting, does the approach to measure overfitting or generalization as I described would make sense? Can there be overfitting if every training example will be different?

I just found the paper Grokking: Generalization Beyond Overfitting On Small Algorithmic Data Sets which is probably also relevant for the question.

Albert
  • 1,145
  • 1
  • 9
  • 12
  • 2
    It's still easy to overfit big models: try a Kaggle competition with "just" 50k images, fine-tune a large model & you easily get into overfitting territory. Do you mean with huge datasets (e.g. size like for GPT-3)? Even then, I'm not sure that double descent results really suggests no overfitting without regularization, is it not more that there's a point when a model where there is better performance when you go less complex and also when you go more complex? There's a nice Twitter thread by Daniela Witten on this: https://twitter.com/daniela_witten/status/1292293122752262145?lang=en – Björn Oct 23 '21 at 12:06
  • 1
    @Björn Thanks for the comment. I guess you are right, and I draw a wrong conclusion from this work. I rephrased my question, as I'm still not sure about the exact definitions or how to measure them. – Albert Oct 23 '21 at 13:14

0 Answers0