Methods for aggregating a set feature vectors into a single feature vector relevant to a context.
Questions tagged [attention]
85 questions
158
votes
9 answers
What exactly are keys, queries, and values in attention mechanisms?
How should one understand the keys, queries, and values that are often mentioned in attention mechanisms?
I've tried searching online, but all the resources I find only speak of them as if the reader already knows what they are.
Judging by the paper…

Sean
- 2,184
- 2
- 9
- 22
31
votes
1 answer
What are attention mechanisms exactly?
Attention mechanisms have been used in various Deep Learning papers in the last few years. Ilya Sutskever, head of research at Open AI, has enthusiastically praised them:
https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
Eugenio…

DeltaIV
- 15,894
- 4
- 62
- 104
9
votes
3 answers
What stops the network from learning the same weights in multi-head attention mechanism
I have been trying to understand the transformer network and specifically the multi-head attention bit. So, as I understand it that multiple attention weighted linear combination of the input features are calculated.
My question is what stops the…

Luca
- 4,410
- 3
- 30
- 52
7
votes
1 answer
On masked multi-head attention and layer normalization in transformer model
I came to read Attention is All you Need by Vaswani. There two questions came up to me:
1. How is it possible to mask out illegal connections in decoder multi-head attention?
It says by setting something to negative infinity, they could prevent…

ChosunTequilla
- 73
- 1
- 4
7
votes
1 answer
How to use the transformer for inference
I am trying to understand the transformer model from Attention is all you need, following the annotated transformer.
The architecture looks like this:
Everything is essentially clear, save for the output embedding on the bottom right. While…

Andrea Ferretti
- 173
- 1
- 5
6
votes
1 answer
Why are residual connections needed in transformer architectures?
Residual connections are often motivated by the fact that very deep neural networks tend to "forget" some features of their input data-set samples during training.
This problem is circumvented by summing the input x to the result of a typical…

Ramiro Hum-Sah
- 163
- 4
6
votes
1 answer
Does Attention Help with standard auto-encoders
I understand the use of attention mechanisms in the encoder-decoder for sequence-to-sequence problem such as a language translator.
I am just trying to figure out whether it is possible to use attention mechanisms with standard auto-encoders for…

Amhs_11
- 173
- 8
5
votes
1 answer
Deciding between Decoder-only or Encoder-only Transformers (BERT, GPT)
I just started learning about transformers and looked into the following 3 variants
The original one from Attention Is All You Need (Encoder & Decoder)
BERT (Encoder only)
GPT-2 (Decoder only)
How does one generally decide whether their…

Athena Wisdom
- 159
- 4
5
votes
1 answer
Why K and V are not the same in Transformer attention?
My understanding is for translation task K should be the same with V, but in Transformer K and V are generated by two different(randomly initialized) matrix $W^K, W^V$, therefore not the same. Can any one tell me why?

eric2323223
- 277
- 1
- 3
- 14
5
votes
1 answer
Attention methods
When using Attention, for example with LSTM (but not necessarily), one can use the following methods to attend:
MLP: $ug(W^1v+W^2q)$
dot product: $v \cdot q$
biaffine transform: $v^TWq$
($v$ is the attended vector which is used for prediction, $q$…

Jjang
- 201
- 1
- 7
4
votes
1 answer
When calculating self-attention for Transformer ML architectures, why do we need both a key and a query weight matrix?
I'm trying to understand the math behind Transformers, specifically self-attention. This link, and many others, gives the formula to compute the output vectors from the input embeddings…

itrase
- 43
- 3
4
votes
2 answers
Why do attention models need to choose a maximum sentence length?
I was going through the seq2seq-translation tutorial on pytorch and found the following sentence:
Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sentence length…

Charlie Parker
- 5,836
- 11
- 57
- 113
4
votes
1 answer
Can attention be implemented without encoder / decoder?
I just got into models beyond biLSTM, would like to start with applying attention to my existing network (RNN). I find examples for attention always with encoder decoder architecture, however is it possible to use attention without encoder decoder?…

xyz
- 141
- 3
3
votes
1 answer
Operation modes in neural turing machine (Graves, 2014)
I am reading the paper "Neural Turing Machines" of Alex Graves (2014) and there are two points that are unclear to me. I would be very grateful if someone could help me out.
More specifically, my questions are about the last step performed by the…

Alf
- 75
- 4
3
votes
0 answers
Why PyTorch MultiheadAttention is considered as activation function?
When I scroll all activation functions available on PyTorch package (here), I found that nn.MultiheadAttention is described there. Can you please explain why it's considered activation function? Maybe I understand something wrong, but Multihead…

demo
- 131
- 1