I read the answer in Why are non zero-centered activation functions a problem in backpropagation? but I still can't understand.
Assume$$f=\sum w_ix_i+b$$ $$\sigma(x)=\dfrac{1}{1+e^{-x}}$$, and loss function is $$L=\sigma(f)$$
To my understand, the gradient $$\dfrac{dL}{dw_i}=\dfrac{dL}{df}x_i=\dfrac{dL}{d\sigma}\dfrac{d\sigma}{df}x_i$$ So $\dfrac{dL}{dw_i}$ is actually depends on so-called upstream gradient $\dfrac{dL}{d\sigma}$, since $\dfrac{d\sigma}{df}$ is always positive.
So to my understand, I don't think non-zero-centred output($\sigma(x)$) activation function is a problem, the problem is non-zero-centred derivative ($\dfrac{d\sigma}{df}$) of activation function.
Is there anything wrong?
And what's the mathematical expression of zero-centred? Is $\int_{-\infty}^{\infty} f(x) dx=0$?