Can one (theoretically) train a neural network with fewer training samples than weights?

Question

First of all: I know, there is no general number of sample size required to train a neural network. It depends on way too many factors like complexity of the task, noise in the data and so on. And the more training samples I have, the better will be my network.

But I was wondering: Is it theoretically possible to train a neural network with less training samples than weights, if I assume my task to be "simple" enough? Does anybody know an example where this worked out? Or will this network almost surely perform poor?

If I consider, for example, polynomial regression, I can't fit a polynomial of degree 4 (i.e. with 5 free parameters) on only 4 data points. Is there a similar rule for neural networks, considering my number of weights as the number of free parameters?

Yes: if the weights are initialized randomly, it is theoretically possible to get a perfectly trained neural network even with zero training samples. (Posting as a comment, not an answer, as I know this isn't really what you are asking.) — Darren Cook, Jul 24 '17 at 11:36

score 19 · Accepted Answer · answered Jul 19 '17 at 09:56

19

People do that all the time with large networks. For example, the famous AlexNet network has about 60 million parameters, while the ImageNet ILSVRC it was originally trained on has only 1.2 million images.

The reason you don't fit a 5-parameter polynomial to 4 data points is that it can always find a function that exactly fits your data points, but does nonsensical things elsewhere. Well, as was noted recently, AlexNet and similar networks can fit arbitrary random labels applied to ImageNet and simply memorize them all, presumably because they have so many more parameters than training points. But something about the priors of the network combined with the stochastic gradient descent optimization process means that, in practice, these models can still generalize to new data points well when you give them real labels. We still don't really understand why that happens.

answered Jul 19 '17 at 09:56

Danica

21,852
1
59
115

2

+1. May I add that for a comparison with polynomial regression, I'd also consider that samples are highly dimensional. The average image resolution on ImageNet is about 469x387 pixels, if cropped to 256x256 we have 1.2 millions of 65k input parameters, which are highly correlated within each sample, thus providing a lot more of information to the neural network (and especially a convolutional NN) than in the polynomial regression case. – jjmontes Jul 19 '17 at 17:11
3

@jjmontes true, but the main mystery is that these networks have the capacity to do both memorize and generalize (well). In other words, they can shatter the training data with random labels, and still generalize well. This isn't something one sees in traditional ML methods. – Amelio Vazquez-Reina Jul 19 '17 at 19:14
@AmelioVazquez-Reina Hi, I want to know whether this mystery also exists for shallow neural networks? – emberbillow Oct 29 '20 at 04:45
1

@emberbillow, it actually exists even for _linear regression_ (the shallowest of neural networks) in high dimensions. Check out https://arxiv.org/abs/1903.08560 / https://arxiv.org/abs/1903.09139 / https://arxiv.org/abs/1906.11300 / https://arxiv.org/abs/1912.04265 / https://arxiv.org/abs/2006.05942 – I'm somewhat partial to the last one, since I wrote it :), but the first two are maybe the most intuitive, and the third has the most complete theoretical understanding. – Danica Oct 30 '20 at 16:20
@djs Thank you. – emberbillow Oct 31 '20 at 06:20

score 8 · Answer 2 · answered Jul 19 '17 at 10:05

Underdetermined systems are only underdetermined if you impose no other constraints than the data. Sticking with your example, fitting a 4-deg polynomial to 4 data points means you have one degree of freedom not constrained by the data, which leaves you with a line (in coefficient space) of equally good solutions. However, you can use various regularization techniques to make the problem tractable. For example, by imposing a penalty on the L2-norm (i.e. the sum of squares) of the coefficients, you ensure that there is always one unique solution with the highest fitness.

Regularization techniques also exist for neural networks, so the short answer to your question is 'yes, you can'. Of particular interest is a technique called "dropout", in which, for each update of the weights, you randomly 'drop' a certain subset of nodes from the network. That is, for that particular iteration of the learning algorithm, you pretend these nodes don't exist. Without dropout, the net can learn very complex representations of the input that depend on all the nodes working together just right. Such representations are likely to 'memorize' the training data, rather than finding patterns that generalize. Dropout ensures that the network cannot use all nodes at once to fit the training data; it has to be able to represent the data well even when some nodes are missing, and so the representations it comes up with are more robust.

Also note that when using dropout, the degrees of freedom at any given point during training can actually be smaller than the number of training samples, even though in total you're learning more weights than training samples.

This maybe overstates the role that explicit regularization plays in deep nets: [this paper](https://arxiv.org/abs/1611.03530) I referred to in my answer shows dropout and other forms of regularization having only small effects on how much the network can memorize. It might be, though, that your basic story is right but the main regularization at play is the implicit one from SGD; this is all still somewhat murky. — Danica, Jul 19 '17 at 16:09

Can one (theoretically) train a neural network with fewer training samples than weights?

2 Answers2

Linked