Why do convolutional neural networks benefit from input translation?

Question

Following this tutorial, it shows that a CNN accuracy on MNIST is improved even when the training set is expanded with simple operations such as one pixel translation. The same tutorial then asks

The idea of convolutional layers is to behave in an invariant way across images. It may seem surprising, then, that a network can learn more when all we've done is translate the input data. Can you explain why this is actually quite reasonable?

I was thinking because the pool layer (max or L2) somehow is a sort approximation and loose the precise spatial information. This answer instead says it is due to the fully connected layers:

But the fully-connected layer (if there is one, or the output layer if not) still sees as its input a spatial map, which is probably, depending on the network architecture, mostly invariant to small shifts but less so to larger ones.

Does anyone have a better suggestion why this happens?

Does this help https://stats.stackexchange.com/questions/350220/utility-of-feature-engineering-why-create-new-features-based-on-existing-featu/350238#350238 ? — Tim, May 28 '20 at 10:35
@Tim No, not really. My question is specific to convolutional neural networks which are purposely built to be invariant. Those question and answers don't relate to CNN. — Francesco Boi, Jun 04 '20 at 10:29

Why do convolutional neural networks benefit from input translation?

0 Answers0