Why does "stack more layers" work?

Question

This question is about empirical (real-life) usage of neural networks. In an ML class that I am taking right now, the instructor went through the basics of neural networks, from basic perceptron through basic feedfoward with 1 layer to 1 hidden layer, etc.

One thing that stood out to me was the Universal Approximation Theorem. George Cybenko in 1988 showed that any function can be approximated to arbitrary accuracy by a NN with 3 layers (2 hidden, 1 output; see Approximation by Superpositions of a Sigmoidal Function, [Cybenko, 1989]). Of course, this paper doesn't say how many units each layer has, or the learnability of the parameters.

I thought of the post Google Street View Uses An Insane Neural Network To ID House Numbers on Gizmodo talking about an 11 hidden layers network used by Google for identification of house numbers. In fact, the actual paper Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks [Goodfellow et al., 2013] says that the deepest network has the highest accuracy, with accuracy increasing with depth of network.

Why is this the case? Why does "stacking more layers" work? Doesn't the theorem already say 2 hidden layers are enough?

This is essentially the same Q as this one: https://stats.stackexchange.com/questions/182734, see my answer there. — amoeba, Mar 07 '18 at 10:03
@amoeba thanks for clarifying. i've edited the question so that it focuses more on the UAT — hongsy, Mar 07 '18 at 10:39

score 9 · Accepted Answer · answered Mar 07 '18 at 08:39

The universal approximation theorem is mainly a proof that for every continuous mapping there exists a neural network of the described structure with a weight configuration that approximates that mapping to an arbitrary accuracy.

It does not give any proof that this weight configuration can be learned via traditional learning methods, and it relies on the fact there are enough units in each layer, but you don't really know what is "enough". For these reasons, UAT has very little practical use.

Deep networks have multitude of benefits over shallow ones:

Hierarchical features:

Deep learning methods aim at learning of feature hierarchies with features from higher-levels of the hierarchy formed by the composition of lower level features. Automatically learning features at multiple levels of abstraction allows a system to learn complex functions mapping the input to the output directly from data, without depending completely on human crafted features. [1]
Distributed representations:

In addition to depth of architecture, we have found that another ingredient is crucial: distributed representations. (...) most non-parametric learning algorithms suffer from the so-called curse of dimensionality. (...) That curse occurs when the only way a learning algorithm generalizes to a new case x is by exploiting only a raw notion of similarity (...) between the cases. This is typically done by the learner looking in its training examples for cases that are close to x (...). Imagine trying to approximate a function by many small linear or constant pieces. We need at least one example for each piece. We can figure out what each piece should look like by looking mostly at the examples in the neighborhood of each piece. If the target function has a lot of variations, we'll need correspondingly many training examples. In dimension d (...), the number of variations may grow exponentially with d, hence the number of required examples. However, (...) we may still obtain good results when we are trying to discriminate between two highly complicated regions (manifolds), e.g. associated with two classes of objects. Even though each manifold may have many variations, they might be separable by a smooth (maybe even linear) decision surface. That is the situation where local non-parametric algorithms work well. (...)

Distributed representations are transformations of the data that compactly capture many different factors of variations present in the data. Because many examples can inform us about each of these factors, and because each factor may tell us something about examples that are very far from the training examples, it is possible to generalize non-locally, and escape the curse of dimensionality. [1]

This can be translated into pictures:

A non-distributed representation (learned by a shallow network) has to assign an output to every piece of the input space (represented by colored hypercubes). However, the number of pieces (and thus number of training points needed to learn this representation) grows exponentially with the dimensionality:

On the other hand, distributed representations do not try describe completely every piece of the input space. Instead, they partition the space by isolating simple concepts which can be later merged to provide complex information. See below how K hyperplanes split the space into 2$^K$ regions:

(Images from [1])

For more insight about distributed representations, I also recommend this thread at Quora: Deep Learning: What is meant by a distributed representation?
In theory, deep networks can emulate shallow networks:

Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. [2]

Note that this is also rather a theoretical result; as the cited paper states, empirically deep networks (w/o residual connections) experience "performance degradation".

[1]: http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html

[2]: Deep Residual Learning for Image Recognition (He et al., 2015)

niiice and +1, but...1) by cutting and pasting Yoshua's words, you make it sound like neural networks are "local non-parametric algorithms", or at least that they work well in the same conditions. None of that is true: neural networks are **not** local non-parametric models (they're basically the opposite of that, but anyway they're parametric) and we use them in the situations where local non-parametric algorithms **don't** work well. Consider clarifying that piece. — DeltaIV, Mar 07 '18 at 19:31
2) on whether increased depth (without residual connections) degrades performance or instead improves it, the jury is still out. See [this blog](http://www.offconvex.org/2018/03/02/acceleration-overparameterization/) or the related [paper](https://arxiv.org/abs/1802.06509). — DeltaIV, Mar 07 '18 at 21:12
Re 2): Thanks for the reference! I will add it to the answer. Regarding 1), whether NNs are parametric or not seems to be rather a [philosophical question](https://www.quora.com/Are-Neural-Networks-parametric-or-non-parametric-models); anyway, I agree that the excerpt I selected may be actually more confusing than enlightening. — Jan Kukacka, Mar 13 '18 at 20:58

score 1 · Answer 2 · answered Mar 07 '18 at 07:04

Your observation is correct, as the Universal AT doesn't account for layer sizes. In real life scenarios however, weight initializations, learning rates and similar parameters can significantly impact the learning. Interestingly enough, for graph-learning-based tasks, two hidden layers appear as the optimal number of layers, yet this is normally not the case for images.

Furthermore, it is not possible to have infinitely many units in a single layer, and thus more layers are used to construct higher-order generalizations as it simply works. There remains an open gap in understanding exactly how neural networks, especially deeper ones learn.

score 1 · Answer 3 · answered Mar 07 '18 at 08:35

In theory you could achieve the same result with just a single hidden layer, as the theorem suggests.

In practice, as you note: "this paper doesn't say how many units each layer has". This is really important because the number of required units in a single hidden layer network could be exponentially high, thus any learning will require an unfeasible amount of time.

Adding layers helps to maintain the total number of units low, and by consequence the training time will be quicker too.

Why does "stack more layers" work?

3 Answers3