Highest Voted 'attention' Questions - Statistical Analysis Stack Exchange

158

votes

9 answers

What exactly are keys, queries, and values in attention mechanisms?

How should one understand the keys, queries, and values that are often mentioned in attention mechanisms? I've tried searching online, but all the resources I find only speak of them as if the reader already knows what they are. Judging by the paper…

asked Aug 13 '19 at 09:00

Sean

2,184
2
9
22

31

votes

1 answer

What are attention mechanisms exactly?

Attention mechanisms have been used in various Deep Learning papers in the last few years. Ilya Sutskever, head of research at Open AI, has enthusiastically praised them: https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0 Eugenio…

time-series deep-learning lstm recurrent-neural-network attention

asked May 04 '18 at 17:20

DeltaIV

15,894
4
62
104

9

votes

3 answers

What stops the network from learning the same weights in multi-head attention mechanism

I have been trying to understand the transformer network and specifically the multi-head attention bit. So, as I understand it that multiple attention weighted linear combination of the input features are calculated. My question is what stops the…

neural-networks deep-learning attention

asked Oct 26 '18 at 09:01

Luca

4,410
3
30
52

7

votes

1 answer

On masked multi-head attention and layer normalization in transformer model

I came to read Attention is All you Need by Vaswani. There two questions came up to me: 1. How is it possible to mask out illegal connections in decoder multi-head attention? It says by setting something to negative infinity, they could prevent…

neural-networks deep-learning normalization attention

asked Jan 14 '19 at 13:28

ChosunTequilla

73
1
4

7

votes

1 answer

How to use the transformer for inference

I am trying to understand the transformer model from Attention is all you need, following the annotated transformer. The architecture looks like this: Everything is essentially clear, save for the output embedding on the bottom right. While…

neural-networks natural-language attention

asked Nov 09 '18 at 16:49

Andrea Ferretti

173
1
5

6

votes

1 answer

Why are residual connections needed in transformer architectures?

Residual connections are often motivated by the fact that very deep neural networks tend to "forget" some features of their input data-set samples during training. This problem is circumvented by summing the input x to the result of a typical…

neural-networks transformers attention residual-networks

asked Feb 21 '22 at 05:37

Ramiro Hum-Sah

163
4

6

votes

1 answer

Does Attention Help with standard auto-encoders

I understand the use of attention mechanisms in the encoder-decoder for sequence-to-sequence problem such as a language translator. I am just trying to figure out whether it is possible to use attention mechanisms with standard auto-encoders for…

neural-networks feature-selection dimensionality-reduction autoencoders attention

asked Oct 23 '20 at 02:08

Amhs_11

173
8

5

votes

1 answer

Deciding between Decoder-only or Encoder-only Transformers (BERT, GPT)

I just started learning about transformers and looked into the following 3 variants The original one from Attention Is All You Need (Encoder & Decoder) BERT (Encoder only) GPT-2 (Decoder only) How does one generally decide whether their…

neural-networks natural-language attention transformers

asked Mar 22 '21 at 23:00

Athena Wisdom

159
4

5

votes

1 answer

Why K and V are not the same in Transformer attention?

My understanding is for translation task K should be the same with V, but in Transformer K and V are generated by two different(randomly initialized) matrix $W^K, W^V$, therefore not the same. Can any one tell me why?

neural-networks natural-language attention

asked Oct 10 '19 at 06:47

eric2323223

277
1
3
14

5

votes

1 answer

Attention methods

When using Attention, for example with LSTM (but not necessarily), one can use the following methods to attend: MLP: $ug(W^1v+W^2q)$ dot product: $v \cdot q$ biaffine transform: $v^TWq$ ($v$ is the attended vector which is used for prediction, $q$…

machine-learning neural-networks attention

asked Feb 18 '18 at 18:45

Jjang

201
1
7

4

votes

1 answer

When calculating self-attention for Transformer ML architectures, why do we need both a key and a query weight matrix?

I'm trying to understand the math behind Transformers, specifically self-attention. This link, and many others, gives the formula to compute the output vectors from the input embeddings…

machine-learning neural-networks attention transformers

asked Mar 24 '21 at 17:41

itrase

43
3

4

votes

2 answers

Why do attention models need to choose a maximum sentence length?

I was going through the seq2seq-translation tutorial on pytorch and found the following sentence: Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sentence length…

neural-networks natural-language recurrent-neural-network attention

asked Jun 06 '19 at 01:02

Charlie Parker

5,836
11
57
113

4

votes

1 answer

Can attention be implemented without encoder / decoder?

I just got into models beyond biLSTM, would like to start with applying attention to my existing network (RNN). I find examples for attention always with encoder decoder architecture, however is it possible to use attention without encoder decoder?…

neural-networks recurrent-neural-network autoencoders attention

asked Nov 13 '18 at 14:21

xyz

141
3

3

votes

1 answer

Operation modes in neural turing machine (Graves, 2014)

I am reading the paper "Neural Turing Machines" of Alex Graves (2014) and there are two points that are unclear to me. I would be very grateful if someone could help me out. More specifically, my questions are about the last step performed by the…

machine-learning neural-networks attention transformers

asked Sep 19 '21 at 12:19

Alf

75
4

3

votes

0 answers

Why PyTorch MultiheadAttention is considered as activation function?

When I scroll all activation functions available on PyTorch package (here), I found that nn.MultiheadAttention is described there. Can you please explain why it's considered activation function? Maybe I understand something wrong, but Multihead…

neural-networks attention

asked Jun 25 '21 at 11:10

demo

131
1

Questions tagged [attention]