Tamr (previously Data Tamer) does database deduplication at scale. Naive Bayes and graph clustering are involved.
I believe the algorithms are largely implemented in SQL, which is somewhat odd, but the primary author of their whitepaper is Michael Stonebraker, who helped lead the creation of PostgreSQL.
Check out the whitepaper here.
Edit: I've summarized the steps their paper takes below. Some of my words are nearly the same as in their paper.
Tamr's deduplication system has two main steps for dealing with a new data source:
(1) Attribute Identification and (2) Entity Consolidation. These are roughly equivalent to column deduplication and row deduplication.
1) Comparing a new data source to an existing one, the first step is Attribute Identifcation.
The attributes (columns) of the new source are mapped to the attributes of the existing source with four algorithms:
- Compare attribute names with fuzzy string comparison (trigram cosine similarity)
- Consider an entire column as a document, tokenize, measure total frequency/inverse document frequency (TF-IDF) cosine similarity between that and other columns.
- Minimum descriptive length: compare two columns based on the sizes of their intersection and union with exact matching.
- For numeric columns, perform a t-test between the new column and existing numeric columns to determine if they came from the same distribution.
2) Entity Consolidation (Row Deduplication)
Once Attribute Identification has been performed, we want to deduplicate the rows (records).
Categorization with clustering
Records are first grouped into categories based on similarity, and then deduplication rules are learned at the category level. The example they give of categorization is for a database of ski resorts, where western ski resorts should be a different category from eastern ski resorts, since features such as base elevation are strongly separated by whether the resort is east or west. The categorization is done with a clustering algorithm, with k-means given as an example.
Deduplicating with Naive Bayes
Once attributes are identified and records have been clustered into categories, we learn the deduplication rules for each category based on a training set of dupes and non-dupes.
There are two types of deduplication rules:
- Thresholds for attribute similarities with respect to a distance function that makes sense for the attribute. (The paper is not clear on how these thresholds are learned.)
- Probability distributions for dupe and non-dupes in each attribute.
e.g.
P("Title" values similar | duplicate) ~ 1
and
Pr("State" values are different | duplicate) ~ 0
For each pair of records, we compute the similarity of each of their attributes wrt an appropriate distance metric. If any attribute has a similarity above its threshold, the pair of records is fed through a Naive Bayes classifier to be classified as dupe or non-dupe.
My assumption is that for records X1 = (a1,b1,c1,d1)
, X2 = (a2,b2,c2,d2)
, they compute a similarity vector S = (s_a, s_b, s_c, s_d)
where s_i
is similarity for that attribute wrt to the correct distance metric.
I assume their Naive Bayes classifier has this structure:
P(dupe|S) = P(dupe)P(s_a|dupe)(s_b|dupe)(s_c|dupe)P(s_d|dupe) / P(S)
Entity resolution with graph clustering
After classification step, we have a subset of records from a given category that are believed to be pairwise duplicates. These now need to be resolved into distinct entities. This solves a transitivity problem: if record t1 is a dupe of t2 and t2 is a dupe of t3, then t1 must also a dupe of t3. This is to say t1,t2, and t3 represent the same entity.
A graph structure is used for this step. Within the category, each record that may be a dupe is a node. Nodes that are suspected of being dupes of each other have edges between them. Clusters are then discovered in the graph and then merged together based on thresholds concerning how strongly one cluster is connected to another. Here are three examples of cluster pairs that might or might not be merged
together based on their connectedness:
c1 c2
x-x-x-----y-y-y
|\|/| |\|/|
x-x-x-----y-y-y Meets similiarity threshold
|/|\| |/|\|
x-x-x-----y-y-y
x-x-x y-y-y
|\|/| |\|/|
x-x-x-----y-y-y Does not meet similarity threshold
|/|\| |/|\|
x-x-x y-y-y
x y
| |
x-----y Meets similarity threshold
| |
x y
When the algorithm terminates, each cluster should represent a distinct entity within the category. To complete the process, the attributes of this entity must be determined from the attributes of the records within it. Nulls are discarded first, then methods including frequency, average, median, and longest are used.
The paper also develops some methods for using domain experts to help when the algorithms are unsure, and how to use multiple experts with differing levels of expertise.