Machine Translation: with sufficient parallel data, can we improve even further the performance of the system with the use of monolingual data?

Question

I am trying to find scientific literature that studies if, in a situation in which we already have enough parallel data, the addition of monolingual data can further improve performance.

I have not been able to find anything yet, but it seems reasonable to me that adding a target-side Language Model, for instance, should improve translation. Or maybe improving even more the parallel dataset by performing back-translation with target monolingual data. So, does anyone know of literature regarding this?

score 3 · Accepted Answer · answered Jan 17 '20 at 10:38

There are quite a few papers on this topic.

The recent attempts to use pre-trained language models in MT are for instance:

Back-translation is now considered a little bit tricky. Several recent papers showed (e.g. Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation or Translationese as a Language in "Multilingual" NMT ) that back-translation improves translation quality mostly in the artificial setup when the source side is a human translation and the target side is a native sentence in the target language. (But the normal use-case is having a native source-language sentence as the input.)

The first paper (btw. also discussed in this blog post) shows that in the high-resource setup, data augmentation by translating the source sentence (forward translation) might be as good as back-translation.

Thank you very much, I will read and follow the breadcrumbs with these papers. Regarding back-translation, I think Edunov et al (https://arxiv.org/pdf/1908.05204.pdf) have shown recently that it may not be as bad as BLEU scores themselves might show, however. — Hill Farmer, Jan 17 '20 at 18:26

Machine Translation: with sufficient parallel data, can we improve even further the performance of the system with the use of monolingual data?

1 Answers1