5

What is the state of the art knowledge on how generalization in interpolating models looks with respect to the number of parameters?

Does it look like this: Landscape of generalization, test loss is bad due to overfiting, but then decreases again

(Picture from Mikhail Belkin's talk on https://www.youtube.com/watch?v=OBCciGnOJVs&t=1185s)

In other words, can overfitting always be overcome with adding more parameters?

Let's say we don't use regularization, but train only for some natural-looking interpolation loss. I'm mainly interested in what is true for neural networks with several layers, but anything goes.

  • @LBogaardt that's not what Belkin et al posit in their double descent conjecture – Firebug Nov 13 '20 at 22:20
  • Okay, then I conjecture that, following the 2nd descent, there will be a 2nd ascent. Note that the x-axis (# parameters) will necessarily end once the df's of the model equals the # datapoints. But I guess this is rarely reached in deep learning. – LBogaardt Nov 13 '20 at 22:30
  • Perhaps related: https://stats.stackexchange.com/a/490912/49090 – LBogaardt Nov 13 '20 at 22:49
  • 1
    @LBogaardt that's not how conjectures work. [Alex-Net has over 60 million parameters, and was trained on ImageNet, with 1.2 million images](https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf). – Firebug Nov 13 '20 at 22:51
  • Lol, I know :P But without reading the paper, I am assuming those 60M parameters are heavily regularised such that the # df is smaller than 1.2M (I'm unsure if 1 image can even be counted as 1 datapoint). – LBogaardt Nov 13 '20 at 23:03
  • @LBogaardt: I think we are talking #degrees_of_freedom = #datapoints * #classes (in a classification problem) for the interpolation threshold in the picture above, but I'm not sure. – Daniel Paleka Nov 13 '20 at 23:06

2 Answers2

4

According to recent works on the Double Descent phenomena, specially Belkin's, yes, you may be able to fix overfitting with more parameters.

That happens because, according to their hypothesis, if you have just enough parameters to interpolate training data, the solution space becomes constrained, precluding you from achieving a lower norm solution.

Adding more parameters (to the limit at infinity) "opens up" solution space again, allowing for, still interpolating, smaller norm solutions.

The interesting part is that in the interpolating regime smaller loss is often achieved than in the non-interpolating regime.

That helps to explain how absurdly over-parametrized deep networks work in practice, as stochastic optimization is inherently regularized and SGD will converge to minimum norm solutions in the over-parametrized regime.

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • I will accept this, as this is the better answer. But I would like to see some work which checks this in practice - Belkin's work is great, but from what I've read, it's not talking about deep networks or other state-of-the-art models. – Daniel Paleka Nov 14 '20 at 19:17
0

Adding parameters will lead to more overfitting. The more parameters, the more models you can represent. The more models, the more likely you'll find one that fits your training data exactly.

To avoid overfitting, choose the simplest model that does not underfit, and use cross-validation to make sure.

  • 1
    Can you add references? – Daniel Paleka Nov 13 '20 at 22:38
  • Also, "The more models, the more likely you'll find one that fits your training data exactly" -- can you formalize this in any sensible way? – Daniel Paleka Nov 13 '20 at 22:38
  • Though I am unfamiliar with deep learning etc., I will attempt to paraphrase the OP's question: can we find a local minimum during cross-validation? – LBogaardt Nov 13 '20 at 22:39
  • I suggest the most recent works on Double Descent, particularly Belkin's. Stochastic optimization is intrinsically regularized in a way that, having more parameters than necessary to interpolate data, can be beneficial. – Firebug Nov 13 '20 at 22:53
  • The typical example is that you can fit a polynomial of sufficiently high degree to exactly fit any data set. For example, take 1000 points on a straight line, and add a tiny bit of noise to it. You can fit a polynomial of degree 999 to exactly fit the data, but it will not be very accurate for test data. "Double descent" does not fix that. If you have more parameters, you'll need either more data or some regularization to not overfit. – Robby the Belgian Nov 13 '20 at 23:03
  • @Firebug: can you post some authoritative source on the inherent regularization of SGD (and the interplay of that with the generalization curves) as an answer? – Daniel Paleka Nov 13 '20 at 23:04