DogmatiX tracks down duplicates in XML

14 June 2005

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 431-442
https://doi.org/10.1145/1066157.1066207

Abstract

Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates. Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach.Peer Reviewe

Keywords

This publication has 7 references indexed in Scilit:

Efficient Similarity Search for Hierarchical Data in Large Databases
Lecture Notes in Computer Science, 2004
Finding similar identities among objects from multiple web sources
Published by Association for Computing Machinery (ACM) ,2003
Adaptive duplicate detection using learnable string similarity measures
Published by Association for Computing Machinery (ACM) ,2003
Interactive deduplication using active learning
Published by Association for Computing Machinery (ACM) ,2002
Approximate XML joins
Published by Association for Computing Machinery (ACM) ,2002
Probabilistic linkage of large public health data files
Statistics in Medicine, 1995
The merge/purge problem for large databases
Published by Association for Computing Machinery (ACM) ,1995

Cited by 69 articles