Jaccard distance vs Levenshtein distance for fuzzy matching

Question

My data is similar to the following data, but far bigger and more complex.

Apple
Banana
Those fruits
Tomato 
Cocumber
These vegetables

I would like to get the following result:

Those fruits
These vegetables

Using the agrep/agrepl function in R I received a first result. However agrep and agrepl use the Levenshtein distance as default. An alternative would be the Jaccard distance.

Jaccard distance vs Levenshtein distance: Which distance is better for fuzzy matching?

There is already a similar question: Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients - in sentence matching. However I would like to know which distance works best for Fuzzy matching.

Extra credits: Are other distance measure (e.g. N-Gram, Cosine, Geometric, Manhattan) also useful for Fuzzy matching? Implementations in R are also welcome.

If your data are far bigger, something like LSH might be faster and also preserve the "fuzzy" property in a very specific sense. Some LSH schemes are easily demonstrated to be probabilistic Jaccard similarity. — Sycorax, Oct 14 '16 at 16:12
in R you have the [stringdist](https://cran.r-project.org/web/packages/stringdist/) package. You might to check that one out. — phiver, Oct 15 '16 at 06:47

score -1 · Answer 1 · answered Feb 11 '17 at 18:13

-1

You can use Naive Bayes algorithm:

Naive Bayes - Wikipedia

answered Feb 11 '17 at 18:13

Abraham Romano Cohen

1

1

This is being automatically flagged as low quality, probably because it is so short. At present it is more of a comment than an answer by our standards. Can you expand on it? We can also turn it into a comment. – gung - Reinstate Monica Feb 11 '17 at 18:19

Jaccard distance vs Levenshtein distance for fuzzy matching

1 Answers1