DogmatiX tracks down duplicates in XML
- 14 June 2005
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 431-442
- https://doi.org/10.1145/1066157.1066207
Abstract
Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates. Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach.Peer RevieweKeywords
This publication has 7 references indexed in Scilit:
- Efficient Similarity Search for Hierarchical Data in Large DatabasesLecture Notes in Computer Science, 2004
- Finding similar identities among objects from multiple web sourcesPublished by Association for Computing Machinery (ACM) ,2003
- Adaptive duplicate detection using learnable string similarity measuresPublished by Association for Computing Machinery (ACM) ,2003
- Interactive deduplication using active learningPublished by Association for Computing Machinery (ACM) ,2002
- Approximate XML joinsPublished by Association for Computing Machinery (ACM) ,2002
- Probabilistic linkage of large public health data filesStatistics in Medicine, 1995
- The merge/purge problem for large databasesPublished by Association for Computing Machinery (ACM) ,1995