Multi-Head attention mechanism in transformer and need of feed forward neural network

Question

After reading the paper, "Attention is all you need," I have two questions.

1) What is the need of multi-head attention mechanism? Paper says that "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions". So my understanding is that it helps in anaphora resolution. For example, "The animal didn't cross the street because it was too ..... (tired/wide)". Here "it" can refer to animal or street based on the last word. My doubt is why can't a single attention head learn this link over some time?

2) I also don't understand the need of feed-forward neural network in the encoder module of the transformer.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

score 2 · Answer 1 · answered Jul 17 '19 at 12:58

1) Multi-head attention can be evaluated in parallel. It could simplify multi GPUs implementation. Single head attention could link it as you said, however, authors of the paper noticed that application of the multi-head attention is 'beneficial'. It's a black-box statement. 2) Feed-forward layers are always adding a "space" for a preparing mixture of information from previous layers. Quote: "Another way of describing this is as two convolutions with kernel size 1.", you can interpret convolution with kernel 1 as a layer for preparing a linear combination of all channels.

Multi-Head attention mechanism in transformer and need of feed forward neural network

1 Answers1