Title is the question.
The papers I've read, e.g. "Attention is All You Need" fail to specify exactly what word embeddings are used in these machine translation networks. In most cases they'll mention the dimensionality,e.g. 512, but don't specify exactly how tokens are mapped into input vectors.
Note: I'm not asking about the general features of constructing a word embedding, I'm asking which specific word embedding is used in this paper, or subsequent ones that refine the approach.