Iterative record linkage for cleaning and integration
- 13 June 2004
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples' attribute values; tuples with similarity scores above a certain threshold are declared to be matches. While this method can perform quite well in many domains, particularly domains where there is not a large amount of noise in the data, in some domains looking only at tuple values is not enough. By also examining the context of the tuple, i.e. the other tuples to which it is linked, we can come up with a more accurate linkage decision. But this additional accuracy comes at a price. In order to correctly find all duplicates, we may need to make multiple passes over the data; as linkages are discovered, they may in turn allow us to discover additional linkages. We present results that illustrate the power and feasibility of making use of join information when comparing records.Keywords
This publication has 13 references indexed in Scilit:
- Adaptive duplicate detection using learnable string similarity measuresPublished by Association for Computing Machinery (ACM) ,2003
- Robust and efficient fuzzy match for online data cleaningPublished by Association for Computing Machinery (ACM) ,2003
- Learning to match and cluster large high-dimensional data sets for data integrationPublished by Association for Computing Machinery (ACM) ,2002
- Interactive deduplication using active learningPublished by Association for Computing Machinery (ACM) ,2002
- Learning object identification rules for information integrationInformation Systems, 2001
- A guided tour to approximate string matchingACM Computing Surveys, 2001
- Data integration using similarity joins and a word-based information representation languageACM Transactions on Information Systems, 2000
- Learning string-edit distanceIeee Transactions On Pattern Analysis and Machine Intelligence, 1998
- CiteSeerPublished by Association for Computing Machinery (ACM) ,1998
- Automatic Linkage of Vital RecordsScience, 1959