suppose we have a stream of sentences and we need to compare each new sentence with previously received ones . For example , with sentences received in last 30 minutes. What is the best method to do that ? can we use Mahalanobis distance for this ? How ?
1 Answers
You probably can (see the answers to this question), but depending on your application, I'm not sure that you should. For whatever you're doing, is there a useful interpretation of the Mahalanobis distance between two strings?
There are other measures of string similarity which might be more appropriate. For example, there are "edit distances", which measures the number of insertions/deletions/substitutions needed to turn one string into another. I think that there are even language- and device-specific cost functions available, if you're trying to correct for typos or something. This is a decent, if somewhat old, review of string-matching techniques Navarro (2001). Note that most of these are pairwise and moderately computationally intensive. If you wanted to compare a new string to collection of other strings, you could turn it into a nearest-neighbor problem.
In general, these methods are only going to work for lexicographically similar strings: "Last night, I saw the cat" vs. "Last night, I saw the hat". If you're trying to detect strings which mean similar things ("I saw the cat last night" vs. "Last night, I saw a kitty"), then you're essentially tackling two or three unsolved problems :-) Get a good language processing textbook and good luck!

- 19,089
- 3
- 60
- 101