3

When I scroll all activation functions available on PyTorch package (here), I found that nn.MultiheadAttention is described there. Can you please explain why it's considered activation function? Maybe I understand something wrong, but Multihead Attention have it's own learnable weights, so it seems to be more suitable for Layers, and not activation functions. Can you please explain me what I'm getting wrong.

Thank you!

demo
  • 131
  • 1

0 Answers0