2

I am trying to evaluate a NLP model using BLEU and ROUGE. However, I am a bit confused about the difference between those scores. While I am aware that ROUGE is aimed at recall whilst BLEU measures precision, all ROUGE implementations I have come across also output precision and the F-score. The original ROUGE paper only briefly mentions precision and the F-score, therefore I am a bit unsure about what meaning they have to ROUGE. Is ROUGE mainly about recall and the precision and F-score are just added as a compliment, or is the ROUGE considered to be the combination of those three scores?

What confuses me even more is that to my understanding ROUGE-1 precision should be equal to BLEU when using the weights (1, 0, 0, 0), but that does not seem to be the case. The only explanation I could have for this is the brevity penalty. However, I checked that the accumulated lengths of the references are shorter than the length of the hypothesis, which means that the brevity penalty is 1. Nonetheless, BLEU with w = (1, 0, 0, 0) scores 0.55673 while ROUGE-1 precision scores 0.7249.

What am I getting wrong?

I am using nltk to evaluate BLEU and rouge-metric for ROUGE.

Disclaimer: I already posted this question on Data Science, however after not receiving any replies and doing some additional research on the differences between Data Science and Cross Validated, I figured that this question might be better suited for Cross Validated (correct me if I am wrong).

jdepoix
  • 71
  • 2

1 Answers1

0

BLEU computes a similarity score based on 1) n-gram precision(usually for 1, 2, 3, and 4-grams); 2) a penalty for too-short system translations.

$$\text{BLEU}=\text{BP}*\exp(\sum_{n=1}^N w_n \log p_n)$$ where $p_n$ is the modified precision for n-gram, the base of log is the natural base $e$, $w_n$ is weight between 0 and 1 for $\log p_n$ and $\sum_{n=1}^N w_n =1$, and BP is the brevity penalty to penalize short machine translations.

\begin{equation} \text{BP} = \begin{cases} 1 & \text{if $c>r$}\\ \exp(1-\frac{r}{c}) & \text{if $c\le r$} \end{cases} \end{equation}

Plug in $w = (1, 0, 0, 0)$ and $\text{BP} = 1$, we obtain this: $\text{BLEU}=p_1$ which is the precision of the uni-gram overlap.

I thought in your case the BP might be not 1. The penalty is not one if $r$, the effective reference length which is the length of the reference that’s closest to the hypothesis, is greater than the length of the candidate sentence. I don't know how you calculated the accumulated lengths of the references to get the brevity penalty 1.

Is ROUGE mainly about recall and the precision and F-score are just added as a compliment, or is the ROUGE considered to be the combination of those three scores?

ROUGE is based on recall but sometimes $F_1$ version (combination of precision and recall) of it is reported anyway. Please refer to this answer.

References:

  1. Bilingual Evaluation Understudy (BLEU)
  2. BLEU: a Method for Automatic Evaluation of Machine Translation
  3. What is ROUGE and how it works for evaluation of summarization tasks?
  4. Lecture 15: Natural language generation
Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52
  • Thanks for the detailed reply. However, as stated in my initial post, I made sure that the brevity penalty is 1 in my case. So it is not the cause for the differences in precision scores. By now I have a theory based on a discussion I had over on the datascience exchange: BLEU uses a micro-average while the ROUGE implementation I use employs a macro-average. However, I am still uncertain whether this is an error in that ROUGE implementation or if that's correct. From reading the ROUGE paper I would've guessed it's micro-averaged as well, but a few implementations I found use a macro-average. – jdepoix Jan 16 '21 at 11:52