What stops the network from learning the same weights in multi-head attention mechanism

Question

I have been trying to understand the transformer network and specifically the multi-head attention bit. So, as I understand it that multiple attention weighted linear combination of the input features are calculated.

My question is what stops the network from learning the same weights or linear combination for each of these heads i.e. basically making the multiple head bit redundant. Can that happen? I am guessing it has to happen for example in the trivial case where the translation only depends on the word in the current position?

I also wonder if we actually use the full input vector for each of the heads. So, imagine my input vector is of length 256 and I am using 8 heads. Would I divide my input into $256 / 8 = 32$ length vectors and perform attention on each of these and concatenate the results or do I use the full vector for each of these and then combine the results?

Same is true about *all* the other neural network architectures. That is why we initialize the parameters randomly. — Tim, Oct 26 '18 at 09:35
@Tim Thanks for your answer. Yes, I was wondering if random initialization was the reason. So basically, we are counting on the cost function being non-convex and basically averaging across the various local minimas in this attention mechanism? — Luca, Oct 26 '18 at 09:38

Tim · Accepted Answer · 2018-11-13T22:53:05.583

We observe these kind of redundancies in literally all neural network architectures, starting from simple fully-connected networks (see diagram below), where same inputs are mapped to multiple hidden layers. Nothing prohibits the network from ending up with same weights in here as well.

We fight this by random initialization of weights. You usually need to initialize all the weights randomly, unless some special cases where initializing with zeros or other values proved to worked better. The optimization algorithms are deterministic, so there is no reason whatsoever why the same inputs could lead to different outputs if all the initial conditions were the same.

Same seems to be true for the original attention paper, but to convince yourself, you can check also this great "annotated" paper with PyTorch code (or Keras implementation if you prefer) and this blog post. Unless I missed something from the paper and the implementations, the weights are treated the same in each case, so there is not extra measures to prevent redundancy. In fact, if you look at the code in the "annotated Transformer" post, in the MultiHeadedAttention class you can see that all the weights in multi-head attention layer are generated using same kind of nn.Linear layers.

dontloo · Answer 2 · 2018-11-12T11:20:10.840

3

I'm not an expert I'll try to answer your questions. :)

1) I believe it can happen, as redundant units are very common in neural networks. In another paper referenced by the transformer paper, it addresses this issue by adding a regularization term to the loss $p=\mid\mid AA^T-I\mid\mid$, which penalizes redundancy in matrix A.

2) It should be the full vector, since the dimension of the weight matrix is $d_{model}\times d_k$.

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

Also if we split the vector before attention, the computational cost will be reduced by a factor h.

edited Nov 12 '18 at 11:20

answered Nov 12 '18 at 10:03

dontloo

13,692
7
51
80

The attention paper does not seem to mention using such regularization. – Tim Nov 12 '18 at 10:21
@Tim right, I didn't see that either – dontloo Nov 12 '18 at 11:14
But thanks for pointing the second paper, I missed it. – Tim Nov 12 '18 at 11:14
@Tim you are welcome :) – dontloo Nov 12 '18 at 11:15

score 0 · Answer 3 · answered Sep 06 '20 at 12:26

My question is what stops the network from learning the same weights or linear combination for each of these heads i.e. basically making the multiple head bit redundant. Can that happen?

Not stopping or preventing it but the different attention heads is calculating attention for different subparts of the Query and Key vectors, standard setting with 512 dimensions is 8 heads doing 64 dimensions each. Two sets of different 64 dimensions being redundant over all tokens is highly unlikely, although proven by some papers some attention heads, if to many, doesn't contribute much to the overall result. The multi-head is more about parallelization of the processing than improving the result.

I also wonder if we actually use the full input vector for each of the heads.

Yes. Full input sequence, all 512 tokens, but different vector dimension group for each attention head.

What stops the network from learning the same weights in multi-head attention mechanism

3 Answers3