Schema Matching Using Duplicates
- 19 April 2005
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach.Keywords
This publication has 21 references indexed in Scilit:
- Similarity flooding: a versatile graph matching algorithm and its application to schema matchingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- TAILOR: a record linkage toolboxPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Attribute classification using feature analysisPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Text joins in an RDBMS for web data integrationPublished by Association for Computing Machinery (ACM) ,2003
- Reconciling schemas of disparate data sourcesPublished by Association for Computing Machinery (ACM) ,2001
- Real-world Data is Dirty: Data Cleansing and The Merge/Purge ProblemData Mining and Knowledge Discovery, 1998
- Block edit models for approximate string matchingTheoretical Computer Science, 1997
- Entity identification in database integrationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1993
- Efficient algorithms for finding maximum matching in graphsACM Computing Surveys, 1986
- A Theory for Record LinkageJournal of the American Statistical Association, 1969