1

According to the article Attention Is All You Need, each sublayer in each one of the encoder and decoder layers is wrapped by a residual connection followed by layer normalization ("Add & Norm" in the following figure):

enter image description here

I more or less understand how the "Add & Norm" layers work, but what I don't understand is their purpose. Why are residual connections and layer normalization needed? What do they exactly do? Do they improve the model performances, and if so - how?

kodkod
  • 123
  • 4

1 Answers1

1

Add & Norm are in fact two separate steps. The add step is a residual connection

Residual connection diagram, described below.

It means that we take sum together the output of a layer with the input $\mathcal{F}(\mathbf{x}) + \mathbf{x}$. The idea was introduced by He et al (2005) with the ResNet model. It is one of the solutions for vanishing gradient problem.

The norm step is about layer normalization (Ba et al, 2016), it is another way of normalization. TL;DR it is one of the many computational tricks to make life easier for training the model, hence improve the performance and training time.

You can find more details on the Transformer model in the great The Annotated Transformer blog post that explains the paper in-depth and illustrates it with code.

Tim
  • 108,699
  • 20
  • 212
  • 390