I have four databases of books that I have assembled from various sources, websites, etc. I would like to merge the databases, but I face a significant merging issue in that there is no "perfect" match ID among the databases. Each database has the title and publication date, but it's not perfect. For example, I might have the following entries for title and publication date:
- (1) The Catcher and the Rye, 7/16/51
- (2) The Catcher & the Rye, 7/16/51
- (3) Catcher and the Rye, 1951
- (4) The Catcher and the Rye (1951), [missing]
So I have tried things like getting rid of common words, spaces and other non-letter characters, using only the first 15 characters of the title and only publication year (or month and year), but I don't think I have a comprehensive solution or approach to getting the best match.
Does anybody have any suggestions on approaches, software, an algorithm I can follow or look up, etc. to help me get the best match possible? (The databases range from 9,000 observations to 15,000 observations so doing it manually isn't really an option)
I work primarily in Stata, but I have basic knowledge of R and Python if that guides any responses.