2

My data is similar to the following data, but far bigger and more complex.

Apple
Banana
Those fruits
Tomato 
Cocumber
These vegetables

I would like to get the following result:

Those fruits
These vegetables

Using the agrep/agrepl function in R I received a first result. However agrep and agrepl use the Levenshtein distance as default. An alternative would be the Jaccard distance.

Jaccard distance vs Levenshtein distance: Which distance is better for fuzzy matching?

There is already a similar question: Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients - in sentence matching. However I would like to know which distance works best for Fuzzy matching.

Extra credits: Are other distance measure (e.g. N-Gram, Cosine, Geometric, Manhattan) also useful for Fuzzy matching? Implementations in R are also welcome.

Ferdi
  • 4,882
  • 7
  • 42
  • 62
  • 2
    If your data are far bigger, something like LSH might be faster and also preserve the "fuzzy" property in a very specific sense. Some LSH schemes are easily demonstrated to be probabilistic Jaccard similarity. – Sycorax Oct 14 '16 at 16:12
  • 2
    in R you have the [stringdist](https://cran.r-project.org/web/packages/stringdist/) package. You might to check that one out. – phiver Oct 15 '16 at 06:47

1 Answers1

-1

You can use Naive Bayes algorithm:

Naive Bayes - Wikipedia

  • 1
    This is being automatically flagged as low quality, probably because it is so short. At present it is more of a comment than an answer by our standards. Can you expand on it? We can also turn it into a comment. – gung - Reinstate Monica Feb 11 '17 at 18:19