Why BERT use learned positional embedding?

Question

Compared with sinusoidal positional encoding used in Transformer, BERT's learned-lookup-table solution has 2 drawbacks in my mind:

Fixed length
Cannot reflect relative distance

Could anyone please tell me the considerations behind such design?

I recommend to add more information to this potentially interesting question: e.g. sources, links. That will increase the chance of a good answer (or any answer). — Nikolas Rieble, Apr 24 '20 at 06:40

eric2323223 · Answer 1 · 2020-09-25T02:33:33.533

Here is my current understanding to my own question.

It probably related BERT's transfer learning background. The learned-lookup-table indeed increase learning effort in pretrain stage, but the extra effort can be almost ingnored compared to number of the trainable parameters in transformer encoder, it also should be accepted given the pretrain stage one-time effort and meant to be time comsuming.

While in the finetune and prediction stages, it's much faster because the sinusoidal positional encoding need to be computed at every position.

Tim · Answer 2 · 2021-11-12T22:28:55.797

1

Fixed length

BERT, same as Transformer, use attention as a key feature. The attention as used in those models, has a fixed span as well.

Cannot reflect relative distance

We assume neural networks to be universal function approximators. If that is the case, why wouldn't it be able to learn building the Fourier terms by itself?

Why did they use it? Because it was more flexible then the approach used in Transformer. It is learned, so possibly it can figure out by itself something better--that's the general assumption behind deep learning as a whole. It also simply proved to work better.

edited Nov 12 '21 at 22:28

answered Nov 01 '21 at 08:29

Tim

108,699
20
212
390

Your first claim is not correct. The attention weights are not learned, they are computed based on keys and queries which are different for every input, so attention can in principle generalize to different input lengths. – ondra.cifka Nov 11 '21 at 20:42
@ondra.cifka it has learned parameters, hence it’s “learned”. – Tim Nov 11 '21 at 20:45
The weights (parameters) of the network (i.e. the linear layers) are learned, but these do not depend on position. The _attention weights_ (coefficients) are computed based on these parameters, so strictly speaking they are not learned, only the way to compute them is. And I don't see how this would result in it having a "fixed span". – ondra.cifka Nov 12 '21 at 21:36
@ondra.cifka attention by itself not, but it depends on position embeddings that do. But agree, the wording might have been confusing so I removed that part. – Tim Nov 12 '21 at 22:27

Why BERT use learned positional embedding?

2 Answers2