31

Attention mechanisms have been used in various Deep Learning papers in the last few years. Ilya Sutskever, head of research at Open AI, has enthusiastically praised them: https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0

Eugenio Culurciello at Purdue University has claimed that RNNs and LSTMs should be abandoned in favor of purely attention-based neural networks:

https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0

This seems an exaggeration, but it's undeniable that purely attention-based models have done quite well in sequence modeling tasks: we all know about the aptly named paper from Google, Attention is all you need

However, what exactly are attention-based models? I've yet to find a clear explanation of such models. Suppose I want to forecast the new values of a multivariate time series, given its historical values. It's quite clear how to do that with an RNN having LSTM cells. How would I do the same with an attention-based model?

Alexis
  • 26,219
  • 5
  • 78
  • 131
DeltaIV
  • 15,894
  • 4
  • 62
  • 104

1 Answers1

28

Attention is a method for aggregating a set of vectors $v_i$ into just one vector, often via a lookup vector $u$. Usually, $v_i$ is either the inputs to the model or the hidden states of previous time-steps, or the hidden states one level down (in the case of stacked LSTMs).

The result is often called the context vector $c$, since it contains the context relevant to the current time-step.

This additional context vector $c$ is then fed into the RNN/LSTM as well (it can be simply concatenated with the original input). Therefore, the context can be used to help with prediction.

The simplest way to do this is to compute probability vector $p = \text{softmax}(V^Tu)$ and $c = \sum_i p_i v_i$ where $V$ is the concatenation of all previous $v_i$. A common lookup vector $u$ is the current hidden state $h_t$.

There are many variations on this, and you can make things as complicated as you want. For example, instead using $v_i^T u$ as the logits, one may choose $f(v_i, u)$ instead, where $f$ is an arbitrary neural network.

A common attention mechanism for sequence-to-sequence models uses $p = \text{softmax}(q^T \tanh(W_1 v_i + W_2 h_t))$, where $v$ are the hidden states of the encoder, and $h_t$ is the current hidden state of the decoder. $q$ and both $W$s are parameters.

Some papers which show off different variations on the attention idea:

Pointer Networks use attention to reference inputs in order to solve combinatorial optimization problems.

Recurrent Entity Networks maintain separate memory states for different entities (people/objects) while reading text, and update the correct memory state using attention.

Transformer models also make extensive use of attention. Their formulation of attention is slightly more general and also involves key vectors $k_i$: the attention weights $p$ are actually computed between the keys and the lookup, and the context is then constructed with the $v_i$.


Here is a quick implementation of one form of attention, although I can't guarantee correctness beyond the fact that it passed some simple tests.

Basic RNN:

def rnn(inputs_split):
    bias = tf.get_variable('bias', shape = [hidden_dim, 1])
    weight_hidden = tf.tile(tf.get_variable('hidden', shape = [1, hidden_dim, hidden_dim]), [batch, 1, 1])
    weight_input = tf.tile(tf.get_variable('input', shape = [1, hidden_dim, in_dim]), [batch, 1, 1])

    hidden_states = [tf.zeros((batch, hidden_dim, 1), tf.float32)]
    for i, input in enumerate(inputs_split):
        input = tf.reshape(input, (batch, in_dim, 1))
        last_state = hidden_states[-1]
        hidden = tf.nn.tanh( tf.matmul(weight_input, input) + tf.matmul(weight_hidden, last_state) + bias )
        hidden_states.append(hidden)
    return hidden_states[-1]

With attention, we add only a few lines before the new hidden state is computed:

        if len(hidden_states) > 1:
            logits = tf.transpose(tf.reduce_mean(last_state * hidden_states[:-1], axis = [2, 3]))
            probs = tf.nn.softmax(logits)
            probs = tf.reshape(probs, (batch, -1, 1, 1))
            context = tf.add_n([v * prob for (v, prob) in zip(hidden_states[:-1], tf.unstack(probs, axis = 1))])
        else:
            context = tf.zeros_like(last_state)

        last_state = tf.concat([last_state, context], axis = 1)

        hidden = tf.nn.tanh( tf.matmul(weight_input, input) + tf.matmul(weight_hidden, last_state) + bias )

the full code

shimao
  • 22,706
  • 2
  • 42
  • 81
  • Reaaly nice, but there's a part which is not clear: you write $p = \text{softmax}(V^Tu)$ without a suffix $i$, then you write $c = \sum_i p_i v_i$. So how are the various $p_i$ computed? What are exactly $V^T$ and $v$? Looking at the code, it seems like $V^T$ is a collection of hidden states at preceding time steps. How many time steps? $v$ seems to be the current hidden state. Is this correct? – DeltaIV May 10 '18 at 08:12
  • 1
    another way to write it is $z_i = v_i^T u$ and then $p = \text{softmax}(z)$ or $p_i = \frac{e^z_i}{\sum_j e^z_j}$ – shimao May 10 '18 at 08:13
  • hmmm, so $p$ is a vector and $p_i$ are its components, right? – DeltaIV May 10 '18 at 08:17
  • 1
    yes, that's what i meant – shimao May 10 '18 at 08:17
  • @shimao I created a [chat room](https://chat.stackexchange.com/rooms/77504/chat), let me know if you'd be interested to talk (not about this question) – DeltaIV May 15 '18 at 06:38
  • @shimao Can you explain why the probability distribution of the $v_i$'s will match our intuition of attention? For example, as humans we naturally might attend to the eyes, ears, nose of a dog when classifying a picture of a dog. That would be equivalent to giving a high probability to the feature $v_i$'s that represent eyes, ears, nose and zero'ing out the rest. However, the only reason I can see that the learned attention mechanism would do so is because learning to do so increases classification accuracy. But wouldn't the weights in a convnet also learn to 0 out irrelevant features as well? – user3180 Dec 22 '18 at 22:37
  • @user3180 the weights of a convnet can only properly "zero out" irrelevant features if the convnet also has access to the query vector. you can feed this in as additional channels to the convnet of course, but that's a much more unconstrained and possibly more computationally expensive approach versus just taking dot products. – shimao Dec 23 '18 at 00:27
  • @shimao Let's say we have weakly labeled (binary indicator of dog) messy, multiple object scenes, and we want to learn the concept 'has dog'. The goal of an attention system would be to output the subset of the image pixels that contains the dog. However, we could also have convnet where the last layer has only a single channel. Are these systems very similar conceptually? The attention system should learn to collapse ("ignore") sections of the image that don't have a dog in it, and the convnet system should learn a dog filter such that the single output channel locates dogs in pixel space – user3180 Dec 23 '18 at 00:54
  • I am interested in attention for 'concept' learning. In the above example, the concept is 'has dog'. In these single image (not sequence) scenarios the query is just the image itself. In this case, is the convnet still more unconstrained? – user3180 Dec 23 '18 at 00:56
  • why does the PointerNet version of attention not use biases? i.e. why is it $\alpha^{} = v^\top tanh( W_1 e_j + W_2 d_t )$ and not $\alpha^{} = v^\top tanh( W_1 e_j + W_2 d_t + b_\alpha)$ where it uses the trainable bias parameter $b_\alpha$? – Charlie Parker Jun 09 '19 at 17:56