In CS294A lecture notes, Andrew Ng writes (about autoencoders): "Usually weight decay is not applied to the bias terms... Applying weight decay to the bias units usually makes only a small different to the final network, however".
Is there any particular reason for which we shouldn't apply weight decay to the bias terms? Does it reduce the performance of the network?