Other questions have addressed what to do when a network does not reach good performance on a (medium / big) training set or that overfitting one training sample requires enough capacity.
However, what if a network has enough capacity and it is still not able to overfit one training sample? I have a CNN with 3d data which regardless of training set size (in the range [1,256]) is still not able to get below a loss of ~e-3. I have tried tuning the learning rate, changing the initialization, simplifying the architecture, trying different activation functions, ...
I would appreciate any tips.