5

When using Attention, for example with LSTM (but not necessarily), one can use the following methods to attend:

  1. MLP: $ug(W^1v+W^2q)$
  2. dot product: $v \cdot q$
  3. biaffine transform: $v^TWq$

($v$ is the attended vector which is used for prediction, $q$ is the query vector determining the weighted sum, used in the encoded input vectors attention)

What are the pros/cons of using each attention method? Or, what are the practical differences in the result?

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271
Jjang
  • 201
  • 1
  • 7

1 Answers1

1

I think the main difference is just the amount of model capacity / complexity you want -- you'd probably start with dot product, and then pick increasingly elaborate methods if that doesn't fit properly.

Another consideration is the "type" of the vectors. If both $v$ and $q$ are word/sentence embeddings, a dot product seems straightforward, but what if $v$ is a sentence embedding and $q$ is the encoded form of an image? Then, taking the dot product makes less sense, since you are saying that the components of $v$ should somehow correspond to the same components in $q$.

Of course, this can also be a good thing if you are trying to come up with a single embedding space for multimodal inputs. So depending on this, you may or may not try to use biaffine attention, which doesn't assume $v$ and $q$ are the same type.

As for bi-affine vs MLP, note that bi-affine allows the easy modeling of quadratic interactions between $v$ and $q$, whereas MLP is more "linear". (See this related question on quadratic neurons in NTNs: What is the "expressive power" of the composition function in a Recursive Neural Tensor Network?)

shimao
  • 22,706
  • 2
  • 42
  • 81