Given an ordering of tokens extracted from a document with a ground truth ordering available. What would be the correct way to evaluate the ordering?
I took a look at some Machine Translation evaluation metrics such as Word Error Rate and the BLEU metric, and these might score highly for the correct ordering, but I wonder if insertions and deletions as operations in WER make sense if I am only interested in the evaluating the ordering (not whether a word was extracted correctly for example)
The BLEU metric relies on n-gram precision and would probably not work well if the document repeats words or n-grams a lot. I think it's designed to work well on short sequences, so if a whole paragraph is out of order (which I expect will be the more common case) it won't work well.
What other metrics would you suggest?