7

I am trying to understand the transformer model from Attention is all you need, following the annotated transformer.

The architecture looks like this:

transformer

Everything is essentially clear, save for the output embedding on the bottom right. While training, I understand that one can use the actual target as input - all one needs is to

  • shift the target by one position to the right
  • use a mask to prevent using - say - the $n+k$-th word from the output to learn the $n$-th one

What is not clear to me is how to use the model at inference time. When doing inference, one of course does not have the output - what goes there?

Andrea Ferretti
  • 173
  • 1
  • 5

1 Answers1

5

A popular method for such sequence generation tasks is beam search. It keeps a number of K best sequences generated so far as the "output" sequences.

In the original paper different beam sizes was used for different tasks. If we use a beam size K=1, it becomes the greedy method in the blog you mentioned.

dontloo
  • 13,692
  • 7
  • 51
  • 80
  • Yeah, I saw it mentioned in the blog post, but without much explanation. If I understand correctly, the greedy method amounts to starting from any random output (all 1 seems to be used in the post), use that to generate the first token, then repeat with the first token to generate the second token and so on. If so, it seems that - unlike in training - decoding in inference happens in a loop, one token at a time, not unlike from what happens with RNN. Am I understanding correctly? – Andrea Ferretti Nov 12 '18 at 16:37
  • Beam search would amount to using the same strategy, but keeping the best k tokens at each step, if I understand correctly. Again, while training can exploit parallelization, it seems that decoding would have to happen serially. Please, tell me if I am missing something – Andrea Ferretti Nov 12 '18 at 16:39
  • @AndreaFerretti yes! it has to be done step by step, for generating the first token, we usually initiate the output with a fixed `start_symbol`, like all ones. – dontloo Nov 13 '18 at 02:40
  • Thank you, everything is clear. I was mislead by the remarks I heard that the transformer is more hardware friendly than usual RNN architectures, due to the fact that it avoids the sequential loop, and in fact that works at training time, just not for inference – Andrea Ferretti Nov 13 '18 at 08:52
  • @AndreaFerretti Yea, actually there are two tasks at inference time, evaluation and decoding (like [hmms](http://jedlik.phy.bme.hu/~gerjanos/HMM/node6.html)), for the decoding task I believe we need the loop, for the evaluation task, the transformer model has better parallelism than rnns. – dontloo Nov 13 '18 at 10:32
  • @dontloo Can you explain a little more about the difference between evaluation and decoding for the transformer? Surely, in order to evaluate the performance of the Transformer, you have to encode the input and then decode the output? – jonathanking Sep 24 '19 at 11:51
  • 1
    @jonathanking hi by decoding I meant to generate sequences, by evaluation I meant to test the probability of a given sequence, you may take a look at how it's defined in the context of HMMs http://jedlik.phy.bme.hu/~gerjanos/HMM/node6.html – dontloo Sep 25 '19 at 09:15