Initializing network weights to zero

Question

Since my last question on the topic I have tried searching on my own how zero weight initialization impedes learning but I can't quite seem to wrap my head around the concept. The CS231n course notes explain that zero-initializing the weights is a bad idea because

[...] if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates.

I am unable to understand this explanation. I understand the first bit, that all the neurons will compute the same output in the forward pass. However, I don't see why the gradient flowing back would be the same?

For example, take the network below. Say the gradient flowing back to Hidden Layer 2 from the output layer is $a$. In Hidden Layer 2, this upstream gradient would get multiplied by the local gradient, say $b$, to get the gradient $ba$ on this layer's weights. Next, these gradients will flow back to Hidden Layer 1, and if the neurons in Hidden Layer 1 also have a local gradient of $b$, then the gradient on Hidden Layer 1's weights would come out to be $b^2a$. I might be overlooking some detail here, but if this reasoning is right by any chance, then it seems like different gradients are flowing through the network. From the output layer we have $a$ flowing back, from Hidden Layer 2 we have $ba$ flowing back, and from Hidden Layer 1 we have $b^2a$ flowing back. But according to the course notes, the gradients flowing back should be the same

Edit

I was also wondering why symmetry breaking is called so? I am confused why the network is said to be "symmetrical" if all the neurons have the same outputs. I guess one can assume the network to be symmetrical about the longitudinal axis of the network; but that is just me guessing. I have so far not found any sources on how the term came about

I've read your question a few times, but it appears that the only direct question is "I was also wondering why symmetry breaking is called so?" But I don't think this is the main thing that you want to know more about. Can you edit your question to clarify what you'd like to know it more detail? You don't have to add a section called "**edit**" -- if someone wants to see previous versions, they can examine the edit history. Just write your question to be a concise, coherent outline with a clear, direct question. — Sycorax, Jul 06 '20 at 14:14
Also, the Question text is a little unclear because you refer to "hidden layer 3" but there's no layer with that label in your diagram. Perhaps you could rework that part so that the diagram and the text are both in harmony? — Sycorax, Jul 06 '20 at 14:15
I've fixed the post. But just to reiterate the question in case it's again got lost somewhere in all that explanation; I don't understand the part in CS231n's course notes where they say that *if all the neurons compute the same outputs in the forward pass then they would also compute the same gradients during the backward pass*. In the example I have reasoned about in my post, the neurons are computing the same outputs (because the weights are all 0), but the gradients flowing back through the neurons are not the same -- the gradients are different, depending on the layer the neurons are in. — Wololo, Jul 06 '20 at 15:17
I agree with @Sycorax, the question is not very clear. I also can't see how does it differ from your previous question? — Tim, Jul 06 '20 at 15:55
@Tim It is the same as my last question. But there I was trying to reason mathematically - I tried driving all the gradients to see if they were indeed the same or different. But here I am trying to reason on intuitive grounds. If the question is still not clear please let me know what part is unclear and I can expand on it. But in one line, I am just confused about why zero weight initialization makes the gradients same throughtout the network? In the question above, the example I have given just tries to show my reasoning of how I think the gradients should not be the same. — Wololo, Jul 06 '20 at 16:52
@Wololo your previous question was marked as duplicate of three other questions, with multiple answers, that answer it. Is there any reason why they do not answer your question? — Tim, Jul 06 '20 at 17:53
I read through all those answers but it is still unclear to me. The answers just say that the gradient will be the same without explaining how. For example, [this is one of the most upvoted questions on the topic](https://stats.stackexchange.com/questions/27112/danger-of-setting-all-initial-weights-to-zero-in-backpropagation), and the top answer to it simply has "[...]if the neurons start with the same weights, then all the neurons will follow the same gradient,[...]" without explaining how. (## continued in the next comment ##) — Wololo, Jul 06 '20 at 22:09
There is [another answer](https://stats.stackexchange.com/a/421687/276536) that just calculates the gradients for the last layer and says the others will be similar without explaining how. Some other answers have explained things from a different point of view, like thinking of weights as priors and setting them equal to zero as saying that the inputs do not affect the system, which is interesting and I get the point of view, but it does not explain how the gradients would be the same throughout the whole network. — Wololo, Jul 06 '20 at 22:14

Initializing network weights to zero

0 Answers0