Why is deep learning hyped despite bad VC dimension?

The Vapnik–Chervonenkis (VC)-dimension formula for neural networks ranges from $O(E)$ to $O(E^2)$, with $O(E^2V^2)$ in the worst case, where $E$ is the number of edges and $V$ is the number of nodes. The number of training samples needed to have a strong guarantee of generalization is linear with the VC-dimension.

This means that for a network with billions of edges, as in the case of successful deep learning models, the training dataset needs billions of training samples in the best case, to quadrillions in the worst case. The largest training sets currently have about a hundred billion samples. Since there is not enough training data, it is unlikely deep learning models are generalizing. Instead, they are overfitting the training data. This means the models will not perform well on data that is dissimilar to the training data, which is an undesirable property for machine learning.

Given the inability of deep learning to generalize, according to VC dimensional analysis, why are deep learning results so hyped? Merely having a high accuracy on some dataset does not mean much in itself. Is there something special about deep learning architectures that reduces the VC-dimension significantly?

If you do not think the VC-dimension analysis is relevant, please provide evidence/explanation that deep learning is generalizing and is not overfitting. I.e. does it have good recall AND precision, or just good recall? 100% recall is trivial to achieve, as is 100% precision. Getting both close to 100% is very difficult.

As a contrary example, here is evidence that deep learning is overfitting. An overfit model is easy to fool since it has incorporated deterministic/stochastic noise. See the following image for an example of overfitting.

Also, see lower ranked answers to this question to understand the problems with an overfit model despite good accuracy on test data.

Some have responded that regularization solves the problem of a large VC dimension. See this question for further discussion.

yters

Posted 2017-05-13T12:43:43.370

Reputation: 1 305

Comments are not for extended discussion; this conversation has been moved to chat.

– D.W. – 2017-05-15T03:37:15.120

8I don't think questions why is something "hyped" are good. The answer is "because people". People take interest in things because of plethora of reasons, including marketing. – luk32 – 2017-05-16T13:49:26.787

3Deep learning works in practice. It might be overfiting. It might be completely unjustified. It might be learning secrets of the universe from an eldritch deity. But the hype is coming from practioners who are suddenly able to write 30 lines on code and teach a camera to scan signatures and match them with stored ones to validate bank transactions. Or tag unknown people in photographs. Etc. Maybe you've heard the line "it's not an insult if its true"? Well it's not hype if it works. There are lots of problems it doesn't work on and excessive popular hype. But it works in real life application. – Stella Biderman – 2019-06-27T15:51:43.697

@StellaBiderman ease of tooling around standard machine learning techniques is good and all. But the interest seems more to do with the supposed learning ability of DNNs that perhaps rivals human ability, which appears to be overhype given the VC analysis of the model. Such a high VC dimension implies the models will not generalize, and are instead memorizing the datasets, making them very brittle. All the adversarial example papers appear to demonstrate this point. – yters – 2019-06-27T15:57:45.773

1@yters The fact that adversarial examples exist is somewhat to the side of generalization. I can train a neural network on 10k data points and then test it on 100k data points and it'll do amazingly. That is generalization. The fact that we can deliberately concoct adversarial examples doesn't mean that the network doesn't work any more than the fact that we can concoct optical illusions doesn't mean that your vision doesn't work. Your entire question is predicated on the idea that neural networks don't generalize. But it's an empirical fact that they do. – Stella Biderman – 2019-06-27T20:07:08.547

1@StellaBiderman You are conflating good out-of-sample accuracy with generalization. While generalization implies the former, the former does not imply the latter. One example I've heard about is when a neural network "learned" to distinguish criminals from non-criminals by memorizing the white background of the prison photos. While the NN had great accuracy on out-of-sample dataset, it did not learn some general principle about criminal facial features. This is why we can concoct such absurd adversarial examples for NNs, because the NN is identifying 'cat' with incidental features. – yters – 2019-06-30T20:39:08.933

Answers

"If the map and the terrain disagree, trust the terrain."

It's not really understood why deep learning works as well as it does, but certainly old concepts from learning theory such as VC dimensions appear not to be very helpful.

The matter is hotly debated, see e.g.:

H. W. Lin, M. Tegmark, D. Rolnick, Why does deep and cheap learning work so well?
C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding Deep Learning Requires Rethinking Generalization.
D. Krueger, B. Ballas, S. Jastrzebski, D. Arpit, M. S. Kanwal, T. Maharaj, E. Bengio, A. Fischer, A. Courville, Deep Nets Dont Learn Via Memorization.

Regarding the issue of adversarial examples, the problem was discovered in:

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions.

It is further developed in:

I. Goodfellow, J. Shlens, C. Szegedy, Explaining And Harnessing Adversarial Examples.

There is a lot of follow-on work.

Update March 2020. A new hypothesis that appears to explain some of the mismatch between clear over-parameterisation of modern (feed-forward) NNs and good recognition performance is Frankle and Carbin's Lottery Ticket Hypothesis from 2018:

J. Frankle, M. Carbin, The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.

The claim is that a "randomly-initialised, dense [feed-forward] neural network contains a subnetwork that is initialised such that when trained in isolation it can match the test accuracy of the original network after training for at most the same number of iterations." Regarding the original question, the Lottery Ticket Hypothesis might be understood as saying that:

Training by stochastic gradient descent searches for small subnetworks that work well and deemphasises the rest of the overparameterised network's learning capacity.
The bigger the original network, the more likely it is to contain a small subnetwork with good performance on the task at hand.

This has found empirical support, e.g. in

H. Zhou, J. Lan, R. Liu, J. Yosinski, Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask.

and theoretical support in:

E. Malach, G. Yehudai, S. Shalev-Shwartz, O. Shamir, Proving the Lottery Ticket Hypothesis: Pruning is All You Need.

As far as I'm aware, it has not yet been possible to generalise the Lottery Ticket Hypothesis to recurrent NNs.

Martin Berger

Posted 2017-05-13T12:43:43.370

Reputation: 7 903

Comments are not for extended discussion; this conversation has been moved to chat.

– D.W. – 2017-05-15T03:36:13.787

When you say "There is a lot of follow-on work" are you referring to the last 2014 paper? The first two papers you mention are fairly recent. Could you update with the papers you're referring to?. – VF1 – 2017-06-17T18:55:29.177

6Strong +1 for "If the map and the terrain disagree, trust the terrain." The models work extremely well in practice regardless of if the math says they should. From a scientific POV, this happens all the time and if anything makes problems more interesting. Nobody read Razborov and Rudich's work on Natural Proofs and went "well, I guess P vs NP isn't an interesting question after all." They went and figured out that it might be possible to use algebraic geometry to do complexity theory. From the point of view of science, problems that transcend our understanding are better, not worse. – Stella Biderman – 2019-06-27T15:45:30.963

"Given the inability of Deep Learning to generalize, according to VC dimensional analysis [...]"

No, that's not what VC dimensional analysis says. VC dimensional analysis gives some sufficient conditions under which generalization is guaranteed. But the converse ain't necessarily so. Even if you fail to meet those conditions, the ML method still might generalize.

Put another way: deep learning works better than VC dimensional analysis would lead you to expect (better than VC analysis "predicts"). That's a shortcoming of VC dimensional analysis, not a shortcoming of deep learning. It doesn't imply that deep learning is flawed. Rather, it means that we don't know why deep learning works as well as it does -- and VC analysis is unable to provide any useful insights.

High VC dimension does not imply that deep learning can be fooled. High VC dimension doesn't guarantee anything at all about whether it can be fooled in practical situations. VC dimension provides a unidirectional, worst-case bound: if you meet these conditions, then good things happen, but if you don't meet these conditions, we don't know what will happen (maybe good things will still happen anyway, if nature behaves better than the worst possible case; VC analysis doesn't promise that good things can't/won't happen).

It could be that the VC dimension of the model space is large (it includes very complex patterns as possible), but nature is explained by simple patterns, and the ML algorithm learns the simple pattern present in nature (e.g., because of regularization) -- in this case, VC dimension would be high but the model would generalize (for the particular pattern that is present in nature).

That said... there is growing evidence that deep learning can be fooled by adversarial examples. But be careful about your chain of reasoning. The conclusions you are drawing don't follow from the premises you started with.

D.W.

Posted 2017-05-13T12:43:43.370

Reputation: 128 457

7High VC dimension does imply its harder to generalize (in some sense, at least when dealing with arbitrary distributions). The $\Omega\left(\sqrt{\frac{d}{n}}\right)$ generalization error lower bound exactly means that for number of samples small compared to the VC dimension, there exists a distribution such that relative to it any algorithm will experience high generalization error (with high probability). – Ariel – 2017-05-14T06:49:23.190

6-1 for "High VC dimensional doesn't guarantee anything at all." This is not true: high VC-dimension implies sample complexity lower bounds for PAC learning. A good answer should address worst-case vs "real-life" distributions. – Sasho Nikolov – 2017-05-15T04:19:10.903

1@SashoNikolov, good point -- thank you! Edited. – D.W. – 2017-05-15T16:56:55.840

This post was in low quality review. Given content, length, votes and quality, this is ridiculous, pointing this here, but it may need meta, because something is really wrong. – Evil – 2019-09-25T00:02:53.307

Industry people have no regard for VC dimension, hooligans...

On a more serious note, although the PAC model is an elegant way to think about learning (in my opinion at least), and is complex enough to give rise to interesting concepts and questions (such as VC dimension and its connection to sample complexity), it has very little to do with real life situations.

Remember that in the PAC model you are required to handle arbitrary distributions, this means that your algorithm should handle adversarial distributions. When trying to learn some phenomena in the real world, no one is giving you "adversarial data" to mess up your results, so requiring a concept class to be PAC learnable might be way too strong. Sometimes you can bound the generalization error independently of the VC dimension, for a specific class of distributions. This is the case of margin bounds, who are formulated independently of the VC dimension. They can promise low generalization error if you can guarantee high empirical margin (which of course, cannot happen for all distributions, e.g. take two close points on the plane with opposite tags, and focus the distribution on them).

So, putting the PAC model and VC dimension aside, I think the hype comes from the fact they just seem to work, and succeed in tasks that were previously not possible (one of the latest achievements that comes to mind is AlphaGo). I know very little about neural nets, so I hope someone with more experience will pitch in, but to my knowledge there are no good guarantees yet (definitely not like in the PAC model). Perhaps under the right assumptions one could justify formally the success of neural nets (I assume there are works around the formal treatment of neural nets and "deep learning", so I'm hoping people with more knowledge on the subject could link some papers).

Ariel

Posted 2017-05-13T12:43:43.370

Reputation: 11 881

Comments are not for extended discussion; this conversation has been moved to chat.

– D.W. – 2017-05-15T03:37:45.297

Given the inability of Deep Learning to generalize,

I don't know where you take that from. Empirically, generalization is seen as the score (e.g. accuracy) on unseen data.

The answer why CNNs are used is simple: CNNs work much better than anything else. See ImageNet 2012 for example:

CNNs: 15.315% (that was an early example. CNNs are much better now. At about 4% top-5 error)
Best non-CNN: 26.172% Top-5-error (source - up to my knowledge techniques which do not use CNNs didn't get below 25% top-5 error)

Create a classifier which is better and people will shift to that.

UPDATE: I will award an answer to anyone providing published evidence that machine learning in general is easily fooled, like this evidence for Deep Learning.

This is not the case. You can create a classifier which is extremely simple on a simple dataset. It will not be possible to fool it (it doesn't even matter what "easy" means) , but it also is not interesting.

Martin Thoma

Posted 2017-05-13T12:43:43.370

Reputation: 2 128

3A low error does not imply generalization. It is a necessary, but not sufficient, condition. – yters – 2017-05-15T02:05:53.563

3@yters Please define generalization then. – Martin Thoma – 2017-05-16T06:39:36.990

Generalization means the model is a good fit for the entire population, not just the sample. If your sample happens to be a good representation, then great, but otherwise you need a model that can generalize well. Analogous to how humans can extrapolate properties that hold for a very large population from just a few instances. That is generalization. Memorization is like memorizing the answers for a test without understanding the reason why they are answers. – yters – 2017-05-16T13:57:14.013

5@yters, this comment makes me think you haven't read much about Machine Learning. Martin said accuracy on unseen data. You're talking about accuracy on training data. You're basically correct about what generalization is, but please realize that everyone else here understands that too. – Ken Williams – 2017-05-16T15:32:02.517

@KenWilliams, there is a difference between the unseen data in the dataset, and the unseen data in the population in general. Good accuracy on unseen data in the dataset may indicate generalization, but if the dataset is not a good representation of the population and the model overfits, then it will do badly on unseen data in the general population. It is the difference between building a model in a test environment, and how it performs in production. – yters – 2017-05-16T17:49:21.087

@KenWilliams to relate this to my question: it is clear that DL performs well on benchmark datasets. However, it is erroneous to say this means DL generalizes if the benchmark samples are not typical for the population. For example, say the DL model classifies birds well in the benchmark. But, the benchmark photos are all high quality. As long as the general population of photos are all high quality, then the model is fine. But, if the photos are amateur, then the model will not perform well. An SVM with a higher error may still generalize better in this case, due to lower VC dimension. – yters – 2017-05-16T17:55:38.973

@KenWilliams for an example of the problem with a DNN overfitting, see this paper: https://arxiv.org/pdf/1604.04004.pdf A little noise or blur causes the DNN performance to drop significantly.

– yters – 2017-05-16T17:56:25.153

1@yters I am pretty sure Ken (and many people on this site, including myself) knows this. If your test set, however, does not represent your dataset you can't make any statement about generalization. While it is worth to keep this in mind, I do not see how this helps you in any way for this question. You just have to assume / make sure that your test set does represent your data at production time. In fact, it is really easy to show that you can make any classifier arbitrary bad if the training samples do not represent the distribution. – Martin Thoma – 2017-05-16T19:26:50.483

@MartinThoma Perhaps I can put it this way. Assume we are trying to fit the function x = y. Our samples we use in development include noise from one distribution, but the production environment has noise from another distribution. If our model overfits, then it may do well on our dataset, but it will include the noise, and perform badly in production when trying to predict x = y with a different noise distribution. A model with low VCD that does not overfit will do better in production. – yters – 2017-05-16T19:31:33.853

2That's obvious. You can't expect a model to generalize well if it's trained on validated on the wrong data. You need better data, not a better model. – Emre – 2017-05-18T05:24:04.580

Here is evidence that Deep Learning can be easily fooled – Cliff AB – 2019-11-05T23:21:46.760

Adverserial attacks also work for SVMs. So I'm not sure if this example adds to the discussion. – Martin Thoma – 2019-11-06T06:17:45.530

My point was that I don't understand what you mean by this statement: "I will award an answer to anyone providing published evidence that machine learning in general is easily fooled". Easily fooling ML models (DL or not) is an entire field of study, so finding published examples is as easy as a Google search, so I think the quote needs to be clarified. – Cliff AB – 2020-02-28T18:12:28.427

The one word answer is "regularization". The naive VC dimension formula does not really apply here because regularization requires that the weights not be general. Only a tiny (infinitesimal?) proportion of weight combinations have acceptable loss after regularization. The true dimension is many orders of magnitude less as a result, so generalization can occur with the training sets we have. The real life results bear out that overfitting isn't generally happening.

David Khoo

Posted 2017-05-13T12:43:43.370

Reputation: 111

3I've seen the repeated claim that real life results show deep learning generalizes. What exactly are the results that show generalization? All I've seen so far is that DL achieves low error rates on particular datasets, which does not in itself mean that DL generalizes. – yters – 2017-05-16T01:44:34.997

3it shows good results ("good" = better than other ML methods) on data that it was not trained on. im not sure how else you want to practically measure generalization. – lvilnis – 2017-05-16T13:53:14.017

We address the paper: Understanding Deep Learning Requires Rethinking Generalization. in

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior Charles H. Martin and Michael W. Mahoney

See: https://arxiv.org/pdf/1710.09553.pdf

Basically, we argue that the VC bounds are too loose because the fundamental approach and how the statistical limit that is taken is unrealistic.

A better approach lies in Statistical Mechanics, which considers a class of data dependent functions, takes the Thermodynamic limit (not just the limit of large numbers)

Moreover, we also point out how the natural discontinuities in deep need lead to a phase transitions in the learning curve, which we believe is being observed in the Google paper (above)

With regard to the limits, see section 4.2 of our paper

"Clearly, if we fix the sample size m and let [the size of the function class] N → ∞ , [or vise versa, fix N, let m → ∞] the we should not expect a non-trivial result, since [N] is becoming larger but the sample size is fixed. Thus, [in Statistical Mechanics] one typically considers the case that m, N → ∞ such that α = m/N is a fixed constant."

That is, very rarely would we just add more data (m) to a deep net. We always increase the size of the net (N) too, because we know that we can capture more detailed features / information from the data. Instead we do in practice what we argue for in the paper--take the limit of large size, with the ratio m/N fixed (as opposed to say fixing m and let N increase).

These results are well known in the Statistical Mechanics of Learning. The analysis is more complicated, but the results lead to a much richer structure that explains many phenomena in deep learning.

Also, and in particular, it is known that many bounds from statistics become either trivial or do not apply to non-smooth probability distributions, or when the variables take on discrete values. With neural networks, non-trivial behavior arises because of discontinuities (in the activation functions), leading to phase transitions (which arise in the thermodynamic limit).

The paper we wrote tries to explain the salient ideas to a computer science audience.

Vapnik himself realized that his theory was not really applicable to neural networks...way back in 1994

"The extension of [the VC dimension] to multilayer networks faces [many] difficulties..the existing learning algorithms can not be viewed as minimizing the empirical risk over the entire set of functions implementable by the network...[because] it is likely...the search will be confined to a subset of [these] functions...The capacity of this set can be much lower than the capacity of the whole set...[and] may change with the number of observations. This may require a theory that considers the notion of a non-constant capacity with an 'active' subset of functions"
Vapnik, Levin, and LeCun 1994

http://yann.lecun.com/exdb/publis/pdf/vapnik-levin-lecun-94.pdf

While not easy to treat with VC theory, this is not an issue for stat mech..and what they describe looks very much like Energy Landscape Theory of protein folding. (which will be the topic of a future paper)

Charles Martin

Posted 2017-05-13T12:43:43.370

Reputation: 31

This sounds interesting, but I'm not sure I follow your argument. Can you elaborate on the first sentence, i.e., on how the fundamental approach / statistical limit is unrealistic, in a self-contained way that doesn't require understanding statistical mechanics? What assumptions do VC bounds make, and why are they unrealistic? Perhaps you can edit your answer to include that information? – D.W. – 2017-11-26T18:40:46.653

I added a reference to the original work by Vapnik and LeCun (1994) that discusses the issue. – Charles Martin – 2017-11-27T19:47:34.143

And added some clarification. – Charles Martin – 2017-11-27T19:54:04.580

No one seems to have pointed out in the above answers, that the VC dimension formula quoted is only for a 1-layer neural network. My guess is that the VC dimension actually grows exponentially as the number of layers L increases. My reasoning is based on considering deep neural networks where the activation function is replaced by polynomial ones. Then the degree of the composed polynomials grows exponentially as the layers increase.

Yan King Yin

Posted 2017-05-13T12:43:43.370

Reputation: 209