This question is about empirical (real-life) usage of neural networks. In an ML class that I am taking right now, the instructor went through the basics of neural networks, from basic perceptron through basic feedfoward with 1 layer to 1 hidden layer, etc.
One thing that stood out to me was the Universal Approximation Theorem. George Cybenko in 1988 showed that any function can be approximated to arbitrary accuracy by a NN with 3 layers (2 hidden, 1 output; see Approximation by Superpositions of a Sigmoidal Function, [Cybenko, 1989]). Of course, this paper doesn't say how many units each layer has, or the learnability of the parameters.
I thought of the post Google Street View Uses An Insane Neural Network To ID House Numbers on Gizmodo talking about an 11 hidden layers network used by Google for identification of house numbers. In fact, the actual paper Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks [Goodfellow et al., 2013] says that the deepest network has the highest accuracy, with accuracy increasing with depth of network.
Why is this the case? Why does "stacking more layers" work? Doesn't the theorem already say 2 hidden layers are enough?