I know that a phrase-based statistical machine translator finds the probability of a correct translation by analyzing a bilingual corpus text, and it maps phrases from the one language to phrases in the other language. By the frequency of maps between specific phrases it finds the probability.
What I don't understand, and can't find anywhere, is how the corpus text analyzer knows what word groups form phrases. Is this incorporated in the text somehow, such that the text has been modified that phrases are read as group? I'm not sure how else this could be performed.
An answer would sincerely be appreciated! (PS: I am not entirely sure where to ask this question. If I am wrongfully asking it here, please do say so)