Levenshtein/Edit Distance as a loss function for sequence transformer models?

Question

Often, the loss function used for a sequence is cross entropy loss between $y_{true}$ and $y_{pred}$ where both are of size $SeqLength \times NumClasses$. When $y_{pred}=y_{true}$ we get the lowest loss, however if we shift the values of $y_{pred}$ to the left or right temporally, or swap some adjacent tokens we will get a much higher loss, even if the result of this shift is a semantically similar sequence.

Levenshtein/Edit distance comes to mind as a suitable building block for a loss function that takes into account temporal shift. However I do not know how 1, this would be implemented to take into account the probability distribution of tokens, and 2, how to implement this algorithm such that it can be auto-differentiated. Does such a loss function or something similar exist?

Jaccard similarity might be better if we are dealing with token sets in a batch, it would ba a custom implementation Looks like such loss indeed exist in the literature https://arxiv.org/abs/1911.01685 — msuzen, Jul 26 '21 at 03:09
@Mehmet Suzen I found a Jaccard implementation which works however it exhibits unusual behaviour while training in that it seems to cause the model to converge to predicting only padding tokens for several epochs, while cross entropy does not display such an issue. I had an alternate idea... what if we convolved $y_{true}$ with a gaussian kernel along the temporal axis as a form of temporal label smoothing for cross entropy loss. Could that work? The idea is that would allow for temporal deviation with distance penalty controlled by a stddev smoothing hyperparameter. — Avelina, Jul 26 '21 at 13:21
Probably this is an artefact of Jaccard loss not being as smooth as cross-entropy, intuitively speaking, that could be that dataset is not large enough as well. Yes, some sort of smoothing may help but we need to be careful, smoothing usually curates additional data that is not there. — msuzen, Jul 26 '21 at 16:59
@MehmetSuzen the dataset is actually generative, so I have a near infinite combination of sequence sources and targets. And can I ask what you mean by "smoothing usually curates additional data that is not there"? Does that mean smoothing could result in underfitting? — Avelina, Jul 27 '21 at 15:36
Yes, depending on dataset characteristics and smoothing technique it may overfit or underfit. So first try to understand how smoothing behaves against a ground truth may help to avoid upstream issues. — msuzen, Jul 27 '21 at 15:52

Levenshtein/Edit Distance as a loss function for sequence transformer models?

0 Answers0