2

Title is the question.

The papers I've read, e.g. "Attention is All You Need" fail to specify exactly what word embeddings are used in these machine translation networks. In most cases they'll mention the dimensionality,e.g. 512, but don't specify exactly how tokens are mapped into input vectors.

Note: I'm not asking about the general features of constructing a word embedding, I'm asking which specific word embedding is used in this paper, or subsequent ones that refine the approach.

Dave
  • 3,109
  • 15
  • 23

1 Answers1

1

The embeddings are trained jointly with the rest of the network. In the beginning, the embeddings are initialized randomly and the error gets back-propagated through the entire network down to the embeddings.

When you train the embeddings jointly with the rest of the model, the problem often is that embeddings of the rare words only get updated once in a while a get sort of out of the sync with the rest of the model. Transformer tries to avoid this problem by:

  • Using not really large sub-word-based vocabulary where infrequent words split into smaller units that appear frequently enough.
  • Sharing the embedding between the encoder and the decoder.
  • Resuing the embeddings as parameters of the final output layer (i.e., the classification can be then interpreted as a sort of measuring dot-product similarity between the output state and the embeddings).
Jindřich
  • 2,261
  • 4
  • 16