I am currently working on an auto-encoder to create latent vectors from multivariate time series. The architectures I've tested so far all revolve around some flavours and combinations of 1d convolutions and resnets (wavenet style). Basically, my most basic standard building block is
1DConv()->ActivationFunction()
I've so far tested on Relus, tanh gates and sigmoid gates. I train all networks on reconstruction error of my time series using MSE or variants thereof.
So far, I've observed that removing non-linear activations completely from my networks actually tends to improve learning time and reconstruction capability of my nets. This seems counterintuitive to me, as I thought it's precisely the activation functions that give neural networks the expressive power they have. How can this be?
Some additional info:
-My data is rather noisy and I suspect that there is only little structural information to be learned and compressed using my auto encoders.