In phrase-based machine translators, how does the program recognize phrases in the corpus text?

Question

I know that a phrase-based statistical machine translator finds the probability of a correct translation by analyzing a bilingual corpus text, and it maps phrases from the one language to phrases in the other language. By the frequency of maps between specific phrases it finds the probability.

What I don't understand, and can't find anywhere, is how the corpus text analyzer knows what word groups form phrases. Is this incorporated in the text somehow, such that the text has been modified that phrases are read as group? I'm not sure how else this could be performed.

An answer would sincerely be appreciated! (PS: I am not entirely sure where to ask this question. If I am wrongfully asking it here, please do say so)

score 0 · Answer 1 · answered Dec 31 '16 at 05:31

I'm not sure how the machine translators that you're talking about work, but, regardless, generating phrases from corpora is known as "phrase modeling", and it is a standard task in natural language processing (NLP). The Python package gensim comes equipped with tools to make it easy.

The basic idea is described in this tutorial on SpaCy, a Python package for NLP. Essentially, if the probability that two tokens occur in order is much larger than what would be expected from chance, then we assume that those two tokens actually constitute a phrase

Here's the formula:

\begin{equation} \left(\frac{m_A m_B - m_{min}}{m_{AB}} \right) N > \varepsilon \end{equation}

where $m_A$ and $m_B$ are the number of times token A and token B appear in the corpus, respectively, $m_{AB}$ is the number of times the phrase "[token A] [token B]" appears in the corpus, $N$ is the size of the vocabulary, $m_{min}$ is some minimum count, to ensure that we don't pick up very rare phrases, and $\varepsilon$ is some user-specified threshold.

"Token", here, can mean either a word or a phrase, so, for instance, if we've already discovered that "ice cream" is a phrase, then we can construct a larger phrase "vanilla ice cream" by evaluating this equation with the token "vanilla" and the token "ice cream". So, first we can run gensim's phrase modeler once to get two-word phrases, and then a second time to get three-word phrases, etc.

In phrase-based machine translators, how does the program recognize phrases in the corpus text?

1 Answers1