Why zero-centered output affected the backpropagation?

Question

I read the answer in Why are non zero-centered activation functions a problem in backpropagation? but I still can't understand.

Assume$$f=\sum w_ix_i+b$$ $$\sigma(x)=\dfrac{1}{1+e^{-x}}$$, and loss function is $$L=\sigma(f)$$

To my understand, the gradient $$\dfrac{dL}{dw_i}=\dfrac{dL}{df}x_i=\dfrac{dL}{d\sigma}\dfrac{d\sigma}{df}x_i$$ So $\dfrac{dL}{dw_i}$ is actually depends on so-called upstream gradient $\dfrac{dL}{d\sigma}$, since $\dfrac{d\sigma}{df}$ is always positive.

So to my understand, I don't think non-zero-centred output($\sigma(x)$) activation function is a problem, the problem is non-zero-centred derivative ($\dfrac{d\sigma}{df}$) of activation function.

Is there anything wrong?

And what's the mathematical expression of zero-centred? Is $\int_{-\infty}^{\infty} f(x) dx=0$?

score 2 · Accepted Answer · answered Oct 31 '17 at 05:21

The effection is between two layers.

Consider a three-layer network:

$$X \Rightarrow h_1 = f(W_1X+b_1) \Rightarrow h_2 = f(W_2h_1+b_2) \Rightarrow L = W_3h_2+b_3$$

where $f(x)$ is sigmoid function.

When we optimize parameter $W_2$, no matter whether input $X$ is zero-centred or not, the input of this layer from the previous layer $h_1$ is always positive because sigmoid function is used in the previous layer as activation function, so $\frac{dL}{dW_{2,ij}}$ have the same sign and will cause a zig-zag path during optimization.

Why zero-centered output affected the backpropagation?

1 Answers1

Linked