2

In speech to text, one common metric is the word error rate (WER).

WER is the word-level Levenshtein distance, which is the minimum number of substitutions ($S$), deletions ($D$), and insertions ($I$) to modify the prediction to the ground truth with sequence length $N$.

$$WER = \frac{I+S+D}{N}$$

Now, there are two ways I would interpret $WER$ over a data set.

One way is to take the average WER, denoted as $g_{WER}$ of each data point:

$$g_{WER} = \mathbb{E}\left[{\frac{I+S+D}{N}}\right]$$

which is heavily skewed by error for short ground-truth sequences.

The other way, denoted as $f_{WER}$, is to sum the errors of all points then normalize by the sum of sequence lengths.

$$f_{WER} = \frac{\sum_i (I+S+D)_i}{\sum_j N_j} = \frac{\mathbb{E}\left[ I+S+D \right]}{\mathbb{E}\left[ N \right]}$$

This removes the strong dependence of error in short statements. This can also be interpreted as the $WER$ of one very large sequence.

Which version is canonical?

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271
user18764
  • 151
  • 8

1 Answers1

1

The second version is the most commonly used one. E.g. in Kaldi's compute-wer.cc you can see that they do not normalize by length for each short ground-truth sequence, but only at the end:

https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/bin/compute-wer.cc#L94:

    int32 num_words = 0, word_errs = 0

    // Main loop, accumulate WER stats
    for (; !ref_reader.Done(); ref_reader.Next()) {
      [...]
      num_words += ref_sent.size();
      word_errs += LevenshteinEditDistance(ref_sent, hyp_sent, &ins, &del, &sub);
      [...]
     }

     // Compute WER, SER,
     BaseFloat percent_wer = 100.0 * static_cast<BaseFloat>(word_errs)
           / static_cast<BaseFloat>(num_words);

FYI: How to normalize text when computing the word error rate of a speech recognition system?

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271