Basically the above. To clarify, I'm referring to feedforward networks (would recurrent networks technically have infinite depth). A reference would be appreciated, as I'm going to use this for sixth form work.
3 Answers
I think if we do not consider the computational constraints, there is no limit on the depth. In other words, we can define and build a neural network with very complex structure, but may not able to train it.
In real world, the depth will depend on your data size (if you have huge/unlimited data, say, cat pictures on Internet, you may need a very deep model.), and hardware / software (if you are using GPU, training on deep model may be much faster than CPU.), especially, number of hidden unites in the layer. Number of parameters of the model would depend on both depth and "width" of the network. Roughly speaking, a network with 10,000 layer and each layer has 2 units vs 1,000 layer, each layer has 20 unites, would have "similar" complexity.
If we consider what people are using, this link says
even beyond 1200 layers and still yield meaningful improvements in test error

- 32,885
- 17
- 118
- 213
In this paper, the authors demonstrate that if you work to find good initializations, you can train a 10,000-layer CNN.
Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington. "Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks." 2018.
In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures. [Emphasis added]

- 76,417
- 20
- 189
- 313
"Conventional" neural networks often don't go much deeper than a few hundred layers. In actuality, with the prominent use of skip-connections, the "effective" depth of the network is often one layer deep, it's just that that super layer becomes really expressive (one example of that is the U-net for image segmentation, which has several lower resolution levels, but has skip-connections at the high resolution levels that make it, in fact, quite shallow).
The "layer" concept of plain feed-forward networks does not translate as well to other architectures these days.
For a NeuralODE, for example (What are the practical uses of Neural ODEs?), we have a "layer analogue" that goes as deep as you want.
For Deep Equilibrium Models, it is mathematically equivalent to infinity (see my other question here How do Deep Equilibrium Models achieve "infinite depth"?).

- 15,262
- 5
- 60
- 127