10

I am interested in linking records across 2 datasets by first name, last name, and birth year. Might this be doable with the EM algorithm, and if so, how?

Consider the following record in the 1st as an example: Carl McCarthy,1967. I will search through all records in the 2nd dataset, and assign a jaro-winkler distance between the 1st name and Carl and a jaro-winkler distance between the last name and McCarthy. These distance are probabilistic as is the distance between the birth years. We combine those 3 probabilities (multiply? average?) into 1.

Now comes the decision rule part. Let us rank all of the probabilities from highest to lowest. First, we want P(first hit is match) >= threshold. Second, we also want P(first hit is match) / P(second hit is match) >= threshold if P(second hit is match) exists. Third, we want the first hit in this second dataset to match for no more than 1 person in the 1st dataset with Carl McCarthy,1967.

How may these thresholds be determined?

I prefer approaches in Stata and/or Perl.

See, for example:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1479910/pdf/amia2003_0259.pdf

(Although with that, I still do not fully follow the why or how, and what the inputs and outputs are, as well as the assumptions and how restrictive they are).

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user1690130
  • 755
  • 2
  • 11
  • 23

3 Answers3

4

Absolutely, the EM algorithm has been used for probabilistic linking. There are a lot of articles on the subject, the following by Winkler may be helpful regarding theoretical details:

http://www.census.gov.edgekey.net/srd/papers/pdf/rr2000-05.pdf

Also there is data linking software developed by Kevin Campbell already available here:

http://the-link-king.com/

The software can be freely downloaded & Kevin Campbell offers support for a fee. The code is written in SAS, so you'll need the base SAS package.

RobertF
  • 4,380
  • 6
  • 29
  • 46
  • Thank you! I have read 2 papers by Winkler but did not fully understand them. I gathered EM from that paper. Also, I do not know how to use SAS. I know perl has an EM module, which I would use, but I'm not sure why EM is appropriate or how to use it. Conceptually, how does EM answer the above questions? – user1690130 Feb 21 '13 at 16:41
  • My understanding is that the EM algorithm is useful for modeling the likelihood of a positive match because it takes into account the unknown (or "latent") probabilities of incorrectly linking two different records or incorrectly not linking two matching records. Estimates of these probabilities are refined during each step of the algorithm in order to maximize the likelihood function. – RobertF Feb 21 '13 at 17:26
  • What inputs do I provide? The univariate prob and a label? And it spits out the optimal match? – user1690130 Feb 21 '13 at 18:27
0

There is a software RELAIS that does record linkage with:

6) Probabilistic record linkage (Estimation of the Fellegi and Sunter model parameters via EM (Expectation-Maximization).

RELAIS has been implemented in Java and R and has a database architecture (MySQL).

There are some more documentation about record linkage available from the ESSnet Data Integration project.

djhurio
  • 658
  • 4
  • 16
0

We have recently published Splink, a Python/Spark library for record linkage, which includes an implementation of the EM algorithm.

You can try it in your browser here. There's more information on the underlying theory here

RobinL
  • 175
  • 7