Gaming the ROUGE metric for text summarization

Question

ROUGE seems to be the standard way of evaluating the quality of machine generated summaries of text documents by comparing them with reference summaries (human generated). $$ROUGE_{n}= \frac {\sum_{s\in \textrm{Ref Summaries} } \sum_{gram_{n}\in s}{Count_{match}(gram_{n})}}{\sum_{s\in \textrm{Ref Summaries} } \sum_{gram_{n}\in s}{Count(gram_{n})}}$$

Based on the formula above, ROUGE checks only for recall so I could just generate a summary which is the concatenation of all reference summaries and get a perfect ROUGE score.

Is it always the case that ROUGE has to be considered in the light of some other metric which is related to Precision (either BLEU or some cap on length of summary)?

I just read some papers on the subject of text summarization, all of them used ROUGE-1, ROUGE-2 and ROUGE-L as their measure of performance. I also stumbled on a paper that kind of deals with gaming evaluation metrics (http://arxiv.org/abs/1603.08023) - haven't read it yet, but it was mentioned in another paper in the context of gaming evaluation metrics. — bam, Jul 01 '17 at 16:59

score 3 · Answer 1 · edited Jan 16 '21 at 06:48

Based on the formula above, ROUGE checks only for recall so I could just generate a summary which is the concatenation of all reference summaries and get a perfect ROUGE score.

Yes, but typically your summarization algorithm will not have access to the reference summaries at test time. You could get a perfect ROUGE recall by simply outputting the entire text as your summary though.

Is it always the case that ROUGE has to be considered in the light of some other metric which is related to Precision (either BLEU or some cap on length of summary)?

You can compute both the ROUGE recall and ROUGE precision, or the combination ROUGE F1. Or set a maximum length for your summary.

Gaming the ROUGE metric for text summarization

1 Answers1

Linked