When I scroll all activation functions available on PyTorch package (here), I found that nn.MultiheadAttention
is described there. Can you please explain why it's considered activation function? Maybe I understand something wrong, but Multihead Attention have it's own learnable weights, so it seems to be more suitable for Layers
, and not activation functions. Can you please explain me what I'm getting wrong.
Thank you!