1

As far as I understand, one of the main claimed problems with initializing e.g. a feed-forward neural network (with several $\text{tanh}$ or $\text{ReLU}$ layers) with $W=0$ is that it doesn't break "network symmetry", meaning, backpropagation would propagate the same error through all such units (i.e. "nudging all weights in the same direction"). This is, I presume, undesirable because we would not be learning "different" calculations through different paths of the network.

However, I'm confused why that even matters in this case given the fact that if $W$ ever drops to 0, we will effectively be propagating no gradients at all through the network, since W=0 would multiply all errors from the output and prevent any learning.

Put another way, even if $W=0$ does not break network symmetry (wasting calculations and paths in the network) is it correct to say that if $W=0$ (e.g. by initialization) we are effectively killing gradients in a neural network, and thus no learning can take place?

Josh
  • 3,408
  • 4
  • 22
  • 46
  • Related: https://stats.stackexchange.com/questions/45087/why-doesnt-backpropagation-work-when-you-initialize-the-weights-the-same-value – Sycorax Jun 02 '20 at 15:51

1 Answers1

2

It's simpler. The gradient of objective function will be of the form: $\frac \partial {\partial w_i} \mathcal L=\alpha $, i.e. all components of a gradient will be equal (within one layer). Since your step in optimization is proportional to the gradient, you'll be making the same step in every direction $\Delta w_i=\alpha \eta$, so your weights will be the same within a layer all the time. I denoted a learning rate $\eta$. The step size in parameter direction $\Delta w_i$ is usually proportional to a gradient and a learning rate.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • Thanks, but wouldn't you argue that the $\Delta$'s (at least in lower layers) will actually be **0**? since they multiply by $W=0$ through the chain rule from upper layers? – Josh May 28 '20 at 03:08
  • 1
    not necessarily, because they'll be multiplied by nonlinear functions of weights, not the weights themselves in the gradient calculation – Aksakal May 28 '20 at 03:10
  • Thanks - (1) I think I need to see an actual full derivation of the gradient formula in a lower layer to see what you are saying (if you know of any links showing this, I'd love to see the actual expression). (2) Separately, what do you mean by _"your weights will be the same all the time"_? You wrote $\Delta w_i = \alpha \Delta x$. That implies that $w_i$ actually changes in value doesn't it? What am I missing? – Josh May 28 '20 at 03:13
  • 1
    *the same* refers to the weights being the same across a layer – Aksakal May 28 '20 at 03:15
  • Ok thanks, but I think $W$'s (from upper layers) **always** multiply derivatives of activations (see https://en.wikipedia.org/wiki/Backpropagation#Matrix_multiplication for the derivation for a feedforward network). So if $W=0$ the weights won't change at all (i.e. no learning) – Josh May 28 '20 at 03:43
  • 1
    no. consider a derivative of the weight in the last layer. it's not zero. so the weight will step out of zero, then in the next iteration the next layer's derivative is not zero etc. – Aksakal May 28 '20 at 04:08
  • +1 I think I follow. Great point. Thanks! – Josh May 28 '20 at 04:16
  • For final clarity, what's $x$ in your $w_i=\alpha \Delta x$ expression? From [the link](https://en.wikipedia.org/wiki/Backpropagation#Matrix_multiplication) in the expression I wrote above, I don't see $x$ in the expression anywhere. If it's the input to the neural net, I don't see how that could happen, i.e. changes in lower levels don't get propagated to upper levels right? (and in the forward pass we just forward the actual value, not a value change) – Josh May 28 '20 at 04:19