Why can't standard conditional language models be trained left-to-right and right-to-left?

Question

From the BERT paper:

Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.

I do not understand this. To my understanding training a standard conditional language model means to collect n-grams and to compute ratios e.g. $p(w_c | w_a, w_b) = \frac{c(w_a,w_b,w_c)}{c(w_a, w_b, *)}$ The probability of $w_c$ after observing $w_a, w_b$ is the number of times the sequence $w_a, w_b, w_c$ is observed by the number of times a sequence $w_a, w_b$ with an arbitrary third element is observed.

And this can certainly be extended to bidirectional like this: $p(w_c | w_a, w_b, w_d, w_e) = \frac{c(w_a,w_b,w_c,w_d,w_e)}{c(w_a, w_b, *, w_d, w_e)}$.

So I must be misunderstanding something here. What is it? What else would left-to-rigth training of a standard conditional language model be?

This question has been asked in other places with either no answer:
https://ai.stackexchange.com/questions/11755/how-does-bidirectional-encoding-allow-the-predicted-word-to-indirectly-see-itse

or answered but not to my satisfaction: https://datascience.stackexchange.com/questions/57000/bi-directionality-in-bert-model

I'm asking here again because its about machine learning and I consciously left everything out about bert or higher methods.

Vivek Subramanian · Answer 1 · 2020-01-23T04:48:07.777

In a standard language model (LM), you're trying to predict the probability of the next word given the past. The past could be a fixed window size of $n$ words, as in your example, or an indefinitely long window size, as in the case of RNNs (and their variants). Without loss of generality, let's stick with the RNN as the LM since that is a relatively common choice. Mathematically, each output $\hat{y}^{(t+1)} \in \Delta^{|V|}$ of an RNN specifies a conditional distribution $p(w^{(t+1)}| x^{(t)}, \ldots, x^{(1)})$ over $|V|$ words in the vocabulary $V$ for the next word $w^{(t+1)}$ given embeddings $x^{(s)}$ for $s = 1, \ldots, t$. This can be expressed as: $$ h^{(t)} = f(x^{(t)}, h^{(t-1)})\\ \hat{y}^{(t+1)} = p(w^{(t+1)} | x^{(t)}, \ldots, x^{(1)}) = g(h^{(t)}) = \mbox{softmax}(W_oh^{(t)} + b_o) $$ where $f$ and $g$ are recurrent and softmax neural network layers, respectively.

This is a unidirectional LM since the probability of the next word is only dependent on the past. Suppose we could instead create a "bidirectional LM" using a bidirectional RNN. We would then have: $$ \overrightarrow{h^{(t)}} = \overrightarrow{f}(x^{(t)}, \overrightarrow{h^{(t-1)}})\\ \overleftarrow{h^{(t)}} = \overleftarrow{f}(x^{(t)}, \overleftarrow{h^{(t+1)}})\\ h^{(t)} = [\overrightarrow{h^{(t)}}; \overleftarrow{h^{(t)}}]\\ \hat{y}^{(t+1)} = p(w^{(t+1)} | x^{(T)}, \ldots, x^{(1)}) = g(h^{(t)}) = \mbox{softmax}(W_oh^{(t)} + b_o) $$

where $T$ is the length of the entire sequence. The problem here is that you're training the second RNN from back to front. At time $t$, the backward RNN already knows what word should come at time $t+1$ because $$ \overleftarrow{h^{(t)}} = \overleftarrow{f}(x^{(t)}, \overleftarrow{h^{(t+1)}}) = \overleftarrow{f}(x^{(t)}, \overleftarrow{f}(x^{(t+1)}, h^{(t+2)})) $$ and the embedding of $w^{(t+1)} = x^{(t+1)}$. Thus, the word is able to "see itself."

This is illustrated in the example below, in which the word "runs" is already seen by the backward RNN (dashed green arrow) before the LM tries to predict it (black arrow).

To remedy this, BERT includes MASK tokens, which mask certain words in the input sequence from being seen by the LM. If the word "runs" were masked, then there would be no way for the LM to know what it is, even if words that came after it (e.g., "quickly", not shown in this example) are present. This is because the only thing that would be revealed is the presence of a MASK token, not the actual word.

In addition, a major difference between BERT and the LM shown here is that BERT is a Transformer-based architecture, but conceptually, the concept of a word "seeing itself" is the same. Note that these MASK tokens could be applied in this LM as well, but the Transformer model just has so much more going on, which is why it performs so well compared to RNNs.

Beginner question here: is the backwards RNN trying to predict a left-side word given a sequence of right-side words? If so, shouldn't it be predicting the word at time $t-1$ and not at $t+1$? — sprajagopal, Mar 11 '21 at 13:24
@sprajagopal Neither backward nor forward RNN layers predict any word. We use these to get right-to-left and left-to-right representations of a sequence, respectively. In the aforementioned example, only, a word at time t+1 is being predicted. — Saurabh Jain, May 14 '21 at 04:02

Why can't standard conditional language models be trained left-to-right *and* right-to-left?

1 Answers1

Why can't standard conditional language models be trained left-to-right and right-to-left?