In speech to text, one common metric is the word error rate (WER).
WER is the word-level Levenshtein distance, which is the minimum number of substitutions ($S$), deletions ($D$), and insertions ($I$) to modify the prediction to the ground truth with sequence length $N$.
$$WER = \frac{I+S+D}{N}$$
Now, there are two ways I would interpret $WER$ over a data set.
One way is to take the average WER, denoted as $g_{WER}$ of each data point:
$$g_{WER} = \mathbb{E}\left[{\frac{I+S+D}{N}}\right]$$
which is heavily skewed by error for short ground-truth sequences.
The other way, denoted as $f_{WER}$, is to sum the errors of all points then normalize by the sum of sequence lengths.
$$f_{WER} = \frac{\sum_i (I+S+D)_i}{\sum_j N_j} = \frac{\mathbb{E}\left[ I+S+D \right]}{\mathbb{E}\left[ N \right]}$$
This removes the strong dependence of error in short statements. This can also be interpreted as the $WER$ of one very large sequence.
Which version is canonical?