41

I read here the following:

  • Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. $x > 0$ elementwise in $f = w^Tx + b$)), then the gradient on the weights $w$ will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression $f$). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

Why would having all $x>0$ (elementwise) lead to all-positive or all-negative gradients on $w$?


Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110

1 Answers1

49

$$f=\sum w_ix_i+b$$ $$\frac{df}{dw_i}=x_i$$ $$\frac{dL}{dw_i}=\frac{dL}{df}\frac{df}{dw_i}=\frac{dL}{df}x_i$$

because $x_i>0$, the gradient $\dfrac{dL}{dw_i}$ always has the same sign as $\dfrac{dL}{df}$ (all positive or all negative).

Update
Say there are two parameters $w_1$ and $w_2$. If the gradients of two dimensions are always of the same sign (i.e., either both are positive or both are negative), it means we can only move roughly in the direction of northeast or southwest in the parameter space.

If our goal happens to be in the northwest, we can only move in a zig-zagging fashion to get there, just like parallel parking in a narrow space. (forgive my drawing)

enter image description here

Therefore all-positive or all-negative activation functions (relu, sigmoid) can be difficult for gradient based optimization. To solve this problem we can normalize the data in advance to be zero-centered as in batch/layer normalization.

Also another solution I can think of is to add a bias term for each input so the layer becomes $$f=\sum w_i(x_i+b_i).$$ The gradients is then $$\frac{dL}{dw_i}=\frac{dL}{df}(x_i-b_i)$$ the sign won't solely depend on $x_i$.

dontloo
  • 13,692
  • 7
  • 51
  • 80
  • Please correct me if I am wrong but shouldn't the value of dL/df be transpose of x ie x.T since we would be using idea of Jacobin in here. – chinmay Feb 11 '18 at 14:54
  • @chinmay sorry for the late reply, I think $f$ here is the outcome of $w^Tx+b$ so the value of dL/df does not depend on x, and usually $L$ is a scalar, $w$ and $x$ are 1d vectors, so dL/df should also be a scalar, right? – dontloo Feb 23 '18 at 05:47
  • Yes, it is a big typo from my end. I meant df/dw .... but I think it depends more on the vector x and if it is a row vector or a column vector – chinmay Mar 28 '18 at 15:46
  • @dontloo sorry for the very late reply but what is the problem with the gradients having the same sign as $d L/d f$?Why is that a bad thing? – floyd Jul 31 '19 at 19:30
  • 1
    @floyd hi I just added some updates for your question – dontloo Aug 01 '19 at 10:31
  • 1
    Isn't the argument works only for a specific case (as in picture)? If src is at top right and target is at bottom left (or vice versa) then we will not have zig-zag dynamics right? I could not understand how are we generalizing here? – Vinay Feb 25 '20 at 11:22
  • @Vinay yes I don't think is a broadly applicable case either, I'm not an expert on optimization methods though – dontloo Feb 25 '20 at 22:27