From the BERT paper:
Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.
I do not understand this. To my understanding training a standard conditional language model means to collect n-grams and to compute ratios e.g. $p(w_c | w_a, w_b) = \frac{c(w_a,w_b,w_c)}{c(w_a, w_b, *)}$ The probability of $w_c$ after observing $w_a, w_b$ is the number of times the sequence $w_a, w_b, w_c$ is observed by the number of times a sequence $w_a, w_b$ with an arbitrary third element is observed.
And this can certainly be extended to bidirectional like this: $p(w_c | w_a, w_b, w_d, w_e) = \frac{c(w_a,w_b,w_c,w_d,w_e)}{c(w_a, w_b, *, w_d, w_e)}$.
So I must be misunderstanding something here. What is it? What else would left-to-rigth training of a standard conditional language model be?
This question has been asked in other places with either no answer:
https://ai.stackexchange.com/questions/11755/how-does-bidirectional-encoding-allow-the-predicted-word-to-indirectly-see-itse
or answered but not to my satisfaction: https://datascience.stackexchange.com/questions/57000/bi-directionality-in-bert-model
I'm asking here again because its about machine learning and I consciously left everything out about bert or higher methods.