Why do attention models need to choose a maximum sentence length?

Question

I was going through the seq2seq-translation tutorial on pytorch and found the following sentence:

Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sentence length (input length, for encoder outputs) that it can apply to. Sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few.

which didn't really make sense to me. My understanding of attention is that attention is computed as follows (according to the Pointer Network paper) at time step $t$:

$$ u^{<t,j>} = v^\top tanh( W_1 e_j + W_2 d_{t} ) = NN_u(e_j, d_t )$$ $$ \alpha^{<t,j>} = softmax( u^{<t,j>} ) = \frac{exp(u^{<t,j>})}{Z^{<t>}} = \frac{exp(u^{<t,j>})}{\sum^{T_x}_{k=1} exp( u^{<t,k>} ) } $$ $$ d'_{<i+1>} = \sum^{T_x}_{j=1} \alpha^{<t,j>} e_j $$

which basically means that a specific attention weight is not dependent on the length of the encoder (i.e. the encoder can change size and the above equation won't be affected because $T_x$ can be variable size).

If that is true then why does the paper say this maximum sentence length thing?

They also say:

There are other forms of attention that work around the length limitation by using a relative position approach. Read about “local attention” in Effective Approaches to Attention-based Neural Machine Translation.

which also confused me. Any clarification?

Perhaps related:

https://discuss.pytorch.org/t/attentiondecoderrnn-without-max-length/13473

Crossposted:

https://discuss.pytorch.org/t/why-do-attention-models-need-to-choose-a-maximum-sentence-length/47201

https://www.reddit.com/r/deeplearning/comments/bxbypj/why_do_attention_models_need_to_choose_a_maximum/?

yeah, that doesn't make any sense to me - the tutorial is wrong — shimao, Jun 06 '19 at 01:47
@shimao in the tutorial they reference this paper...I wonder if it might be useful to read it... — Charlie Parker, Jun 06 '19 at 03:52
i'm familiar with the paper, but i don't think it has anything to do with the tutorial — shimao, Jun 06 '19 at 04:08
"normal" attention isn't restricted to fixed length or bounded length sequences either. it's just that long sequences take more and more computation and memory. the author of this tutorial seems to have proposed a weird variant of attention which only works on bounded size sequences — shimao, Jun 06 '19 at 04:09
@shimao weird he’d do that...the flexibility of any length processing power of attention is my fav thing about attention! What exactly introduces this limitation and how can we remove it? — Charlie Parker, Jun 06 '19 at 11:34
@shimao did you look at the current answer? What do you think? — Charlie Parker, Jun 08 '19 at 00:01
the current answer is talking about attention mechanisms as they are normally implemented. as i've explained, the tutorial implements attention wrong, which is why they have a hard limit on the sentence length. — shimao, Jun 08 '19 at 00:06

score 5 · Answer 1 · answered Jun 07 '19 at 07:41

It is only an efficiency issue. In theory, the attention mechanism can work with arbitrarily long sequences. The reason is that batches must be padded to the same length.

Sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few.

By this sentence they mean they want to avoid batches like this:

A B C D E F G H I K L M N O
P Q _ _ _ _ _ _ _ _ _ _ _ _
R S T U _ _ _ _ _ _ _ _ _ _
V W _ _ _ _ _ _ _ _ _ _ _ _

Because of one long sequence, most of the memory is wasted for padding and not used from weights update.

A common strategy to avoid this problem (not included in the tutorial) is bucketing, i.e., having batches with an approximately constant number of words, but a different number of sequences in each batch, so the memory is used efficiently.

hmmm so what happens when the current input sentence is longer than maximum length? — Charlie Parker, Jun 07 '19 at 21:42
most implementations just throw away such examples at training time (e.g., Tensor2Tensor, OpenNMT), some (Neural Monkey) crop the sentences to the maximum length — Jindřich, Jun 17 '19 at 15:22

score 2 · Accepted Answer · answered Jun 09 '19 at 05:11

2

A "typical" attention mechanism might assign the weight $w_i$ to one of the source vectors as $w_i \propto \exp(u_i^Tv)$ where $u_i$ is the $i$th "source" vector and $v$ is the query vector. The attention mechanism described in OP from "Pointer Networks" opts for something slightly more involved: $w_i \propto \exp(q^T \tanh(W_1u_i + W_2v))$, but the main ideas are the same -- you can read my answer here for a more comprehensive exploration of different attention mechanisms.

The tutorial mentioned in the question appears to have the peculiar mechanism

$$w_i \propto \exp(a_i^Tv)$$

Where $a_i$ is the $i$th row of a learned weight matrix $A$. I say that it is peculiar because the weight on the $i$th input element does not actually depend on any of the $u_i$ at all! In fact we can view this mechanism as attention over word slots -- how much attention to put to the first word, the second word, third word etc, which does not pay any attention to which words are occupying which slots.

Since $A$, a learned weight matrix, must be fixed in size, then the number of word slots must also be fixed, which means the input sequence length must be constant (shorter inputs can be padded). Of couse this peculiar attention mechanism doesn't really make sense at all, so I wouldn't read too much into it.

Regarding length limitations in general: the only limitation to attention mechanisms is a soft one: longer sequences require more memory, and memory usage scales quadratically with sequence length (compare this to linear memory usage for vanilla RNNs).

I skimmed the "Effective Approaches to Attention-based Neural Machine Translation" paper mentioned in the question, and from what I can tell they propose a two-stage attention mechanism: in the decoder, the network selects a fixed sized window of the input of the encoder outputs to focus on. Then, attention is applied across only those source vectors within the fixed sized window. This is more efficient than typical "global" attention mechanisms.

answered Jun 09 '19 at 05:11

shimao

22,706
2
42
81

Now I realize that I was thrown off/confused cuz the attention weights depend on the input (through the embedding) rather than the encoder hidden state. I thought the `max_length` window was a way to use only `max_length` of the encoder hidden states which seemed reasonable if @Jindřich other answer was correct. If that would have been correct I think it would have seemed reasonable to use `a_j` (but then we would have the problem of undefined hidden encoder states when the sequences are to short). Regardless, your answer was useful! I knew there was something I had missed. Thanks. – Charlie Parker Jun 09 '19 at 17:13
Lets ask this question to make sure I truly got it: couldn't the tutorial have defined attention to still be fixed length but depend on the encoder hidden states $e_j$? – Charlie Parker Jun 09 '19 at 17:26
now I am also observing that the context vector in the tutorial is paying attention over embeddings rather than over the encoded hidden vector. i.e. usually its $d' = \sum^{T_x}_{j=1} \alpha^{} e_j$ but the tutorial is doing something like $d' = \sum^{MaxLength}_{j=1} \alpha^{} embedding(x_j)$ which in the context of the tutorial I guess makes sense because they are computing the attention using the input embeddings but in standard attention that I am familiar the attention is computed using encoding hidden states and the context vector is computed using the encoded hidden states too – Charlie Parker Jun 09 '19 at 18:02
1. Well yes, you can fix the max length of the sequence arbitrarily even if there is no reason to do so, and it'll still be compatible with normal attention mechanisms. 2. No, it's not embeddings of the input sequence, but rather embeddings of the decoder output in the previous step. – shimao Jun 09 '19 at 19:08
response to 2 not sure what you saying no to. It's clear what the tutorial is computing the attention from: `attn_weights = F.softmax( self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1) ` it **IS** using the embedded input. – Charlie Parker Jun 09 '19 at 19:09
1

What the code calls "input" is not the input sequence but rather the target output sequence. – shimao Jun 09 '19 at 19:28
crucial! ** how much attention to put to the first word, the second word, third word etc, which does not pay any attention to which words are occupying which slots.** You are right, its attention to the location of the token but it doesn't take the actual word there into account! What a ridiculous tutorial. How are they not embarrassed to have such a tutorial on their official site, or at least comment how much of a toy example the tutorial is. Horrible. – Charlie Parker Jun 19 '19 at 21:08
yes, there sadly isn't much in the way of quality control on online tutorials – shimao Jun 19 '19 at 21:54

Why do attention models need to choose a maximum sentence length?

2 Answers2

Linked