9

I often read that training "overparameterized" networks works well in practice, and perhaps no one yet knows exactly why yet. However, when I look at the number of samples and parameters many NNs use, they are still fitting with more data than parameters.

Consider for example the recently announced GPT-3 language model with as many as 175 Billion parameters. They never even tried fitting a model with more parameters than tokens (300 billion tokens).

Would one consider this neural net overparameterized?

If so, what's the criteria, heuristic or rule of thumb if you with that would merit that designation for a model? Is it, for example:

  • the ratio of the # of model parameters $p$ and data points $n$
  • the fact that a model interpolates the training data (the model achieves a training loss of 0)
  • all / any of the above
  • any other measures?

enter image description here

Josh
  • 3,408
  • 4
  • 22
  • 46

1 Answers1

6

“Overparamized” model has more parameters then there were datapoints in training set. More formally, it’s not only about number of parameters, but capacity to memorize data, where number of parameters is just a cheap proxy for measuring it.

You are correct that even huge models like GTP-3 are much smaller then would be needed to fully memorize the data. Overparametrized models are achievable on small datasets. For example, Neal et al (2018) trained such model on subsample of 100 examples from MNIST. There are not something that you would want to use on real tasks, since it’d be impractical, and need enormous computational power.

Check this answer for some related references.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • 4
    Agree. I think capacity to memorize data is a better definition than # of parameters. Technically, a model with one parameter can completely memorize the data by encoding it in its decimal expansion. See also https://arxiv.org/pdf/1904.12320.pdf – Cam.Davidson.Pilon May 30 '20 at 16:38