Why isn't ROUGE-N normalized by the number of N-grams in the reference summary?

Question

Note: I'll focus on $ROUGE-1$, but the same holds for $ROUGE-N$.

For a machine-produced summary $M$ and a bunch of reference summaries $RefSummaries$, I believe $ROUGE-1$ can be calculated in the following manner:

$$ROUGE\textrm{-}1 = \frac{\sum\limits_{R \in \{RefSummaries\}} \sum\limits_{unigram \: i \in R} min(count(i, M), count(i, R))}{\sum\limits_{R \in \{RefSummaries\}} \sum\limits_{unigram \: i \in R} count(i, R)}$$

For simplicity, let's suppose that there's only one reference summary with 10 unigrams (i.e. words). For some $i$, the maximum possible value of $ROUGE-1$ is 1, which makes perfect sense to me, because the count is normalized by the denominator. If we define the total value of $ROUGE-1$ as the sum of $ROUGE-1$ for all unigrams in $R$, it then follows that the maximum possible total value of $ROUGE-1$ is 10, because there are 10 unigrams, and the maximum value for each of them is 1. Why doesn't this final value get normalized by the number of unigrams, i.e. $\frac{10}{10}=1.0$?

When reading research papers, I've noticed that they never normalize that value either, so that's why I'm asking. For example, in the paper titled A Neural Attention Model for Abstractive Sentence Summarization by Alexander M. Rush et al., a value of 26.55 for $ROUGE-1$ using the ABS model is reported. The goal of that paper is to perform sentence-level summarization, and in chapter 7.1, which describes the dataset, they note that the average sentence length for the dataset is 31.3 words. If the aforementioned $ROUGE-1$ value were to be normalized by the average sentence length, the result would be $\frac{26.55}{31.3} = 0.8482$. To me, that value is much more meaningful than 26.55, especially when comparing results from different papers with different datasets.

Lest you think it's a one-off, ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Weizhen Qi et al. didn't the normalize the value either, and reported a $ROUGE-1$ score of 43.68 on the CNN/DM dataset.

I'm sure that I'm missing something and that plenty of other people have thought of this before me, so why doesn't $ROUGE-N$ get normalized by the number of $N-grams$ present in the reference summary?

Why isn't ROUGE-N normalized by the number of N-grams in the reference summary?

0 Answers0