I am interested in linking records across 2 datasets by first name, last name, and birth year. Might this be doable with the EM algorithm, and if so, how?
Consider the following record in the 1st as an example: Carl McCarthy,1967. I will search through all records in the 2nd dataset, and assign a jaro-winkler distance between the 1st name and Carl and a jaro-winkler distance between the last name and McCarthy. These distance are probabilistic as is the distance between the birth years. We combine those 3 probabilities (multiply? average?) into 1.
Now comes the decision rule part. Let us rank all of the probabilities from highest to lowest. First, we want P(first hit is match) >= threshold. Second, we also want P(first hit is match) / P(second hit is match) >= threshold if P(second hit is match) exists. Third, we want the first hit in this second dataset to match for no more than 1 person in the 1st dataset with Carl McCarthy,1967.
How may these thresholds be determined?
I prefer approaches in Stata and/or Perl.
See, for example:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1479910/pdf/amia2003_0259.pdf
(Although with that, I still do not fully follow the why or how, and what the inputs and outputs are, as well as the assumptions and how restrictive they are).