11

I stumbled upon the following paper Reconciling modern machine learning practice and the bias-variance trade-off and do not completely understand how they justify the double descent risk curve (see below), desribed in their paper.

enter image description here

In the introduction they say:

By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that have smaller norm and are thus "simpler". Thus increasing function class capacity improves performance of classifiers.

From this I can understand why the test risk decreases as a function of the function class capacity.

What I don't understand then with this justification, however, is why the test risk increases up to the interpolation point and then decreases again. And why is it exactly at the interpolation point that the number of data points $n$ is equal to the function parameter $N$?

I would be happy if someone could help me out here.

Samuel
  • 585
  • 4
  • 15
  • 2
    [Other posts mentioning this paper](https://stats.stackexchange.com/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F1812.11118), but no duplicates. – Stephan Kolassa Jun 23 '21 at 12:07
  • 1
    This phenomenon is closely related to how deep learning models do not overfit the training set but achieve almost zero loss, i.e., interpolation counterintuitively against the well known statistical learning theory and of course classical bias-variance trade-off is not resolved. See recent exposition from Orial Vinyal's team [Understanding Deep Learning (Still) Requires Rethinking Generalization](https://cacm.acm.org/magazines/2021/3/250713-understanding-deep-learning-still-requires-rethinking-generalization/fulltext). >From this I can understand why the test risk decreases as a function of th – msuzen Jun 23 '21 at 20:54

1 Answers1

10

The main point about Belkin's Double Descent is that, at the interpolation threshold, i.e. the least model capacity where you fit training data exactly, the number of solutions is very constrained. The model has to "stretch" to reach the interpolation threshold with a limited capacity.

When you increase capacity further than that, the space of interpolating solutions opens-up, actually allowing optimization to reach lower-norm interpolating solutions. These tend to generalize better, and that's why you get the second descent on test data.

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • 3
    What is worth highlighting is that this is just a guessxplanation, a hypothesis that cannot really be proven or disproven. This is how the authors try to explain the phenomenon. – Tim Jun 23 '21 at 12:13
  • @Tim It's a formalized hypothesis. They gave multiple experiments corroborating the finding. – Firebug Jun 23 '21 at 12:16
  • @Firebug Thank you for answering. However, your answer raises some more questions. Do you mean by "number of solutions" predictors that can fit the training data exactly? If so, why is the number of solutions exactly at this point so much constrained? And why is it, that at this point the training data can be fitted exactly? Why is it, that beyond the interpolation threshold, lower-norm interpolating solutions can be found? – Samuel Jun 23 '21 at 18:39
  • @Tim Do you can think of promising approaches to prove or disprove the hypothesis? Wouldn't it be possible to test the hypothesis on very large networks to support the hypothesis if true? – Samuel Jun 23 '21 at 18:42
  • @Samuel this is what the paper you refer to & [similar ones](https://stats.stackexchange.com/a/444014/35989) try to achieve. – Tim Jun 23 '21 at 18:56
  • @Samuel the "number of solutions" is the number of parameter sets, not predictors, that achieve a given loss. So the number of solutions at the interpolation threshold is the number of networks that interpolate the training data. This set is much smaller than the set after this threshold is surpassed. – Firebug Jun 24 '21 at 10:57
  • 1
    @Samuel "why is it, that at this point the training data can be fitted exactly?" because that's the definition of the interpolation threshold: the point with least capacity where training data can be interpolated. At higher capacity, network parameters have more leeway in how they achieve this interpolation, leading to lower norm solutions. – Firebug Jun 24 '21 at 10:58