Recent work such as Deep Double Descent shows that overfitting is not really a problem with large models, even without any data augmentation or regularization (L2 weight norm, dropout or so).
Edit: Ok, maybe this is a wrong conclusion from this work. It is only that overfitting reduces or generalization improves the larger the model is, but it doesn't say that there is no overfitting anymore, or that overfitting is not a problem anymore.
However, we know from countless work that regularization methods (dropout, etc) and data augmentation (e.g. SpecAugment for speech, etc) still massively improve models.
I often read the statement that regularization and augmentation still improves generalization. My understanding so far was that overfitting is the inverse of generalization. So how does this fit together, if the models do not overfit anyway even without augmentation or regularization?
Or is overfitting not the inverse of generalization? How is generalization defined then?
How would you measure overfitting and generalization? I would measure the difference of the training loss between some held-out validation set and the training set (under same conditions of course, i.e. no dropout, etc). The larger the gap, the more overfitting, the less generalization. The smaller the gap, the less overfitting, the more generalization.
Edit I just thought about synthetically generated datasets, where you can also potentially create an infinite amount of data. In this setting, does the approach to measure overfitting or generalization as I described would make sense? Can there be overfitting if every training example will be different?
I just found the paper Grokking: Generalization Beyond Overfitting On Small Algorithmic Data Sets which is probably also relevant for the question.