4

I have a data set from a survey asking about what brands they can remember within the category toys.

The survey participants get to write a single brand in 10 different text boxes. The purpose is to reveal the brand with the greatest top of mind effect, without displaying any suggestions.

The problem is: text answers are not consistent. For example "Toy's are us" vs "Toys r us'"

My question is: is there any plug in for Stata correcting similar answers based on a scope of answers? (answers out of this scope can be set to missing)

All tips and answers are appreciated!

Steffen Moritz
  • 1,564
  • 2
  • 15
  • 22
  • See a [similar question](http://stats.stackexchange.com/q/3425/1036) for R. I wouldn't be surprised if those same routines for fuzzy matching strings are somewhere coded in Stata user contributed routines. – Andy W Oct 19 '12 at 13:59
  • 1
    Stata does have a [soundex](http://en.wikipedia.org/wiki/Soundex) function for help with stuff like this. In my experience, though, one's best friend is a string processing oriented tool like PERL or AWK, rather than the limited data cleaning capabilities built into statistical systems. (John Chambers, one of the originators of the `S` system, advocates PERL for such uses.) – whuber Oct 19 '12 at 14:30
  • Many thanks to both of you :) I'm searching for something to be used for people with limited programming experience, so i think Google Refine might be the solution to this. Maby ill learn pearl after c++. The Stata programming language is based on pearl right? – Ole Henrik Skogstrøm Oct 19 '12 at 19:53

1 Answers1

4

In Stata, there's a user written command called strgroup that's pretty good at this. It uses Levenshtein distances. It's available from ssc. There is also Google Refine, which is a non-Stata solution, but works very well and is free.

dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • Do you know of any good user guides? the one provided by the developer is... well... non existing. all these solutions (except google refine) seem to be focused on merging two variables. i just need to compair a cell with a group of accepted answers, got any tips? :) And Thank you! – Ole Henrik Skogstrøm Oct 19 '12 at 20:04
  • GR has a series of videos on reconciliation. Start with one data set that contains the variants and another dataset that has the accepted spellings. It may help if you remove punctuation. Apply strgroup, and fiddle with the thresholds. Use the results to recode your data. This will still require a lot of manual work. – dimitriy Oct 19 '12 at 21:49