Often, the loss function used for a sequence is cross entropy loss between $y_{true}$ and $y_{pred}$ where both are of size $SeqLength \times NumClasses$. When $y_{pred}=y_{true}$ we get the lowest loss, however if we shift the values of $y_{pred}$ to the left or right temporally, or swap some adjacent tokens we will get a much higher loss, even if the result of this shift is a semantically similar sequence.
Levenshtein/Edit distance comes to mind as a suitable building block for a loss function that takes into account temporal shift. However I do not know how 1, this would be implemented to take into account the probability distribution of tokens, and 2, how to implement this algorithm such that it can be auto-differentiated. Does such a loss function or something similar exist?