Attention methods

Question

When using Attention, for example with LSTM (but not necessarily), one can use the following methods to attend:

MLP: $ug(W^1v+W^2q)$
dot product: $v \cdot q$
biaffine transform: $v^TWq$

($v$ is the attended vector which is used for prediction, $q$ is the query vector determining the weighted sum, used in the encoded input vectors attention)

What are the pros/cons of using each attention method? Or, what are the practical differences in the result?

You introduced the new tag [attention], can you please write a tag wiki? — kjetil b halvorsen, Feb 20 '18 at 19:22

score 1 · Accepted Answer · answered Sep 29 '18 at 18:06

I think the main difference is just the amount of model capacity / complexity you want -- you'd probably start with dot product, and then pick increasingly elaborate methods if that doesn't fit properly.

Another consideration is the "type" of the vectors. If both $v$ and $q$ are word/sentence embeddings, a dot product seems straightforward, but what if $v$ is a sentence embedding and $q$ is the encoded form of an image? Then, taking the dot product makes less sense, since you are saying that the components of $v$ should somehow correspond to the same components in $q$.

Of course, this can also be a good thing if you are trying to come up with a single embedding space for multimodal inputs. So depending on this, you may or may not try to use biaffine attention, which doesn't assume $v$ and $q$ are the same type.

As for bi-affine vs MLP, note that bi-affine allows the easy modeling of quadratic interactions between $v$ and $q$, whereas MLP is more "linear". (See this related question on quadratic neurons in NTNs: What is the "expressive power" of the composition function in a Recursive Neural Tensor Network?)

Attention methods

1 Answers1