1

I have some open-ended survey data that I'm trying to recode, but the range of answers is very large (e.g. one question got responses of 'word', 'separate', 'mesabatainia', and 'abra cadabra alakazam'). I'm hoping to, for each question, recode the data by clustering it using Levenshtein distance. Are there any clustering algorithms that would be considered common practice in situations like these where the variation between what's being clustered is massive? Is there anything I should do because this particular situation is recoding? Thank you and I appreciate any help!

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Jodast
  • 155
  • 5
  • 1
    Can you say more about what kind of structure clustering is meant to capture, and what you're planning to do with it? Levenshtein distance measures similarity between strings according to how easily you can transform one into the other by manipulating characters. This may be good for accounting for minor variations like plural vs. singular, typos, etc. But, it won't account for semantic similarity, which is sometimes what's needed. E.g. "world" would be similar to "word" (similar spelling but different meaning), and dissimilar to "globe" (similar meaning but different spelling) – user20160 Feb 14 '21 at 02:09
  • I'd probably say I'm trying to capture what type of answer is given i.e. answers which are short, long, have lots of vowels or syllables, have multiple words, are gibberish vs. not , etc... and then from there cluster by similarity (like world and word would be in the same category). This particular question was "What's a word you always have a hard time spelling" for context. – Jodast Feb 14 '21 at 02:17

0 Answers0