NLTK: odd outputs from bleu_score

Question

For machine translation purposes I use bleu score, which seems to be the validation mechanism of choice (used in the sutskever 2014 sequence-to-sequence).

The purpose is to get as high bleu as possible (between 0 to 1).

The following mumble gives an extraordinary high bleu score (0.77):

from nltk import bleu_score

reference = ['The moon is very bright']
hypothesis = ['Dee dd ss eee']
reference = [[r.split()] for r in reference]
hypothesis = [[h.split()] for h in hypothesis]

bleu_score.corpus_bleu(reference, hypothesis)

Why does the bleu score give such high accuracy for mumbling? Which other tools for validation could I use for machine translation?

Franck Dernoncourt · Accepted Answer · 2016-12-17T18:48:41.433

1

Which other tools for validation could I use for machine translation?

NLTK's BLEU implementation has a few issues. One of the most commonly used BLEU implementation is the one provided with MOSES (documentation).

edited Dec 17 '16 at 18:48

answered Dec 17 '16 at 18:36

Franck Dernoncourt

42,093
30
155
271

Commenting for posterity: while this is true, the high BLEU score here was user error. (The other answer addresses this.) Following Franck's guidance by using the Moses implementation of BLEU (or the more recent SacreBLEU) would have made the user error obvious. – Arya McCarthy Apr 25 '21 at 04:29

score 1 · Answer 2 · answered Apr 25 '21 at 00:40

You are calling the score function incorrectly. This is the way you do it:

from nltk import bleu_score
references = ['The moon is very bright'.split()]
hypothesis = 'Dee dd ss eee'.split()
bleu_score.sentence_bleu(references, hypothesis)

It will print 0 as expected.

NLTK: odd outputs from bleu_score

2 Answers2