What metric should I use for evaluating a reading order of text tokens given the correct ordering?

Question

Given an ordering of tokens extracted from a document with a ground truth ordering available. What would be the correct way to evaluate the ordering?

I took a look at some Machine Translation evaluation metrics such as Word Error Rate and the BLEU metric, and these might score highly for the correct ordering, but I wonder if insertions and deletions as operations in WER make sense if I am only interested in the evaluating the ordering (not whether a word was extracted correctly for example)

The BLEU metric relies on n-gram precision and would probably not work well if the document repeats words or n-grams a lot. I think it's designed to work well on short sequences, so if a whole paragraph is out of order (which I expect will be the more common case) it won't work well.

What other metrics would you suggest?

score 0 · Answer 1 · answered Jun 30 '20 at 07:22

BLEU to some extent captures the word order well. You can get a better idea by trying BLEU-$n$ metric, where $n$ means the longest $n$-gram being consider. (In the early days of image captioning, the typical evaluation metric was BLEU-1). But as you noted, it becomes less efficient with the growing length of the text. But if you could reasonably sentence-split the documents, it should work well.

Alternatively, I would suggest measuring perplexity with respect to a language model. If you worry that the perplexity would be too much influence with the lexical choice, you might consider a language model trained on sequences of POS tags.

If Sentence split is going to be tricky. I thought I could also mix and match BLEU and WER. A high WER with a high BLEU-1 on the word level means order is probably wrong since word level precision is high. — Youssef Fares, Jul 05 '20 at 17:52

What metric should I use for evaluating a reading order of text tokens given the correct ordering?

1 Answers1