2

The language model I am referring to is the one outlined in "Attention is All You Need": enter image description here

My understanding is (please correct me if I am wrong) that when the task is translation, the encoder's input could be "Hi, my name is John." And the decoder's input could be "Bonjour, je m'appelle" and then the Transformers would output "John" as the next word.

However, when it comes to language modeling, I don't see what the encoder's input could be (just as there is no encoder in RNN when the task is language modeling).

So if we're dealing with language modeling, is the left part of the transformer (in Attention is All You Need), the encoder, removed? If it is still used, what is the input to it?

Thanks in advance!

1 Answers1

1

Yes, that’s right—the left part is removed.

You no longer have anything to condition on except previous parts of the sequence. Your autoregressive model predicts $p(x_i \mid x_{<i})$ instead of $p(x_i \mid x_{<i}, y)$ where $x$ is the target language sentence and $y$ is the source language sentence.


If you were to alter the diagram, the two arrows going from the left to the right would be removed and replaced. This makes it match the lower part of the diagram’s right side. The input is now the hidden state representing the earlier part of the sequence.

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
  • Thanks a lot! You mentioned that the two arrows going from the left to the right would be removed and replaced. Just to confirm, do you mean that the Keys and Values are generated by the multi-head on the lower right instead? – Matthew Yang Jun 13 '21 at 02:10
  • Yes, that’s right. It works identically to the section below it in the diagram. – Arya McCarthy Jun 13 '21 at 02:12