5

I have four databases of books that I have assembled from various sources, websites, etc. I would like to merge the databases, but I face a significant merging issue in that there is no "perfect" match ID among the databases. Each database has the title and publication date, but it's not perfect. For example, I might have the following entries for title and publication date:

  • (1) The Catcher and the Rye, 7/16/51
  • (2) The Catcher & the Rye, 7/16/51
  • (3) Catcher and the Rye, 1951
  • (4) The Catcher and the Rye (1951), [missing]

So I have tried things like getting rid of common words, spaces and other non-letter characters, using only the first 15 characters of the title and only publication year (or month and year), but I don't think I have a comprehensive solution or approach to getting the best match.

Does anybody have any suggestions on approaches, software, an algorithm I can follow or look up, etc. to help me get the best match possible? (The databases range from 9,000 observations to 15,000 observations so doing it manually isn't really an option)

I work primarily in Stata, but I have basic knowledge of R and Python if that guides any responses.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user12798
  • 53
  • 1
  • 4
  • A lot can be said about this, but to get you started, note that Stata has a `soundex` function. – whuber Jul 23 '12 at 14:29
  • 1
    Google has the best search engine in the world. I bet they came up with good solutions to your problem. But some of the best tricks may be proprietary. Now they don't have the exact same problem that you have because they can make long lists. But they do order them based on algorithms like the what you are looking for. With Google, Yahoo and so many other search engines out there I am sure there must be a vast literature in open journals that you could explore. – Michael R. Chernick Jul 23 '12 at 14:52
  • Some papers on this using SAS (but maybe methods are transferrable). [Jones and Zdeb](http://www.nesug.org/proceedings/nesug08/cc/cc07.pdf). [Heath et al.](http://analytics.ncsu.edu/sesug/2008/CC-028.pdf) [Sambidi et al.](http://www.scsug.org/SCSUGProceedings/2008/papers/app/Pramod_Sambidi.pdf) – Peter Flom Jul 23 '12 at 15:26
  • This strikes me as a text-mining issue & CV also includes data mining among it's appropriate topics. However, if you don't get answers that are usable for you after a while, you might ask the moderators if this question could be migrated to [Stack Overflow](http://stackoverflow.com/) (for general programming), or [dba](http://dba.stackexchange.com/) (database specific) for alternative sources of help. Good luck! – gung - Reinstate Monica Jul 23 '12 at 15:59
  • Thanks to all who provided answers and suggestions. I have a couple of routes that I am exploring based on these great responses. Thanks again. – user12798 Oct 17 '12 at 20:35

3 Answers3

4

I would try two things. There is a user-written command called strgroup, which you can install with -ssc install strgroup-. It uses Levenshtein distances to do string matching.

If that does not work, I would look into a free program called Google Refine. It's pretty easy to use once you watch the instructional video.

dimitriy
  • 31,081
  • 5
  • 63
  • 138
2

There is a user-written Stata command called reclink. It can perform record linkage between two files using a standard probabilistic matching scheme.

I've been wanting to re-write the program in Mata (to speed it up) and to add various features, but it works OK for probabilistic merging. Given your example, it should work just fine after you do some standardization of your fields (e.g. convert date variable to year/month/day, clean up "&", maybe add fields for phonetic algorithms of the book title). Programs like Link King, LinkageWiz, and others can do a lot of the cleaning up automatically.

Keith
  • 225
  • 2
  • 5
0

Despite the date of this post, I would like to share something that made my life easier.

I use to start by following the Stata tip 64: Cleaning up user-entered string variables (see here) in order to eliminate small inconsistencies, i.e removing leading and trailing spaces (trim and itrim); avoiding uppercase and lowercase variations (upper,lower, proper) and eliminating other common differences that arise from hyphens, accented characters and ampersands - as in "The Catcher and the Rye" and "The Catcher & the Rye" (subinstr). It seems to increase the efficiency of strgroup, which uses the aforementioned Levenshtein edit distance to matches similar strings together, since the function is case sensitive.

Regards!

Robson
  • 141
  • 9