What to do when a neural network cannot overfit one training sample?

Question

Other questions have addressed what to do when a network does not reach good performance on a (medium / big) training set or that overfitting one training sample requires enough capacity.

However, what if a network has enough capacity and it is still not able to overfit one training sample? I have a CNN with 3d data which regardless of training set size (in the range [1,256]) is still not able to get below a loss of ~e-3. I have tried tuning the learning rate, changing the initialization, simplifying the architecture, trying different activation functions, ...

I would appreciate any tips.

What is a CNN with 3d data? What is the task? What is the loss function? What loss value would you accept as indicating overfitting? What activation functions have you tried? Are you sure your model output is on the correct scale for your choice of loss function? — Sycorax, Oct 15 '20 at 19:49

Tom Dörr · Accepted Answer · 2020-10-15T20:20:46.073

2

What loss function are you using? When you want to perfectly overfit, you should use L2 and not L1 loss.

The reason for this is that the derivative for the L2 loss is $\frac{d l^2} {d l} = 2l$ and the derivative for the L1 loss is $\frac{dl^1}{d l} = 1$. This means that the gradient update for the L2 loss gets smaller as the loss value decreases, therefore making it easier to improve the network in later stages of the training when only small adjustments are needed. The L1 loss on the other hand doesn't decrease the magnitude of the gradient update, therefore skipping over the local minima you want your network to reach.

edited Oct 15 '20 at 20:20

answered Oct 15 '20 at 19:50

Tom Dörr

331
1
5

Thanks, this worked, however it is not fully satisfying as the L2 loss of $(10^{-3})^{2} = 0$ in terms of float32. Is it unreasonable to expect the L1 loss to reach 0 (even for one example)? – NightRain23 Oct 15 '20 at 20:04
Float32 should be able to represent much smaller values. Reaching exactly 0 is likely not possible, but I would expect that you can get close. I updated my answer to explain why the L2 loss is better in this situation. – Tom Dörr Oct 15 '20 at 20:28
Your explanation seems to be correct, I was able to achieve near zero loss (e-6) with L1 by decaying the learning rate (even though Adam is being used). This now seems reasonable as L1 needs a little more 'help' not to skip through minima as the loss decreases. Does this seem reasonable? Still strange since I expected Adam has an adaptive learning rate... – NightRain23 Oct 15 '20 at 20:32
Yes, that seems very reasonable. Actually thought about adding learning rate decay to my answer. Adam can adapt the learning rate, but only to a certain extent. – Tom Dörr Oct 15 '20 at 20:35

What to do when a neural network cannot overfit one training sample?

1 Answers1