1

Background: I have an interesting little problem where I need some sort of metric to compare strings (or list of numbers of whatever), with the added caveat that the base/ground string (the thing which I am comparing against) is somewhat (but not fully) uncertain. And to be perfectly honest; both strings have some uncertainty in them, hence what I would really want is a string matching algorithm that takes this uncertainty into account before issuing a similarity score.


Here is an example that I stole directly from the Wikipedia article for the Hamming distance. The Hamming distance for

"karolin" and "kathrin" is 3.

i.e. the number of positions at which the corresponding symbols are different.

My problem is more along the lines of (where we assume the name to the far left is the ground-truth)

"ka[r,k]ol[i,l]n" and "kathrin"

where I have replaced the third position of the name with my uncertain as to what it should be, and I have two options r or k (inside the brackets). The same for the fifth position.

Question:
How can I take my uncertainty into account when comparing my generated string (in this example kathrin) to my uncertain ground-truth (in this example ka[r,k]ol[i,l]n)? Would it be better to perhaps consider the mutual information, or the KL divergence (somehow) or something else?

EDIT 1:

It would appear that mine falls under the 'approximate string matching' problem. Still looking for a version where one could e.g. specify densities over uncertain parts of the string (if such a thing exists).

Ben
  • 91,027
  • 3
  • 150
  • 376
Astrid
  • 745
  • 7
  • 17
  • 1
    Here's one approach: William Winkler, *Approximate String Comparator Search Strategies for Very Large Administrative Lists*... https://www.census.gov/srd/papers/pdf/rrs2005-02.pdf – Mike Hunter Mar 21 '17 at 12:09
  • Thanks to the moderator who updated my question. This is much better. – Astrid Mar 21 '17 at 15:22
  • 1
    Check [string kernel](https://en.wikipedia.org/wiki/String_kernel) – Haitao Du Mar 21 '17 at 16:57

0 Answers0