Collection statistics for fast duplicate document detection

1 April 2002

journal article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Information Systems

Vol. 20 (2), 171-191
https://doi.org/10.1145/506309.506311

Abstract

We present a new algorithm for duplicate document detection that uses collection statistics. We compare our approach with the state-of-the-art approach using multiple collections. These collections include a 30 MB 18,577 web document collection developed by Excite@Home and three NIST collections. The first NIST collection consists of 100 MB 18,232 LA-Times documents, which is roughly similar in the number of documents to the Excite&at;Home collection. The other two collections are both 2 GB and are the 247,491-web document collection and the TREC disks 4 and 5---528,023 document collection. We show that our approach called I-Match, scales in terms of the number of documents and works well for documents of all sizes. We compared our solution to the state of the art and found that in addition to improved accuracy of detection, our approach executed in roughly one-fifth the time.

This publication has 7 references indexed in Scilit:

Accessibility of information on the web
Nature, 1999
Finding Near-Replicas of Documents on the Web
Lecture Notes in Computer Science, 1999
Searching the World Wide Web
Science, 1998
Copy detection mechanisms for digital documents
Published by Association for Computing Machinery (ACM) ,1995
Discrimination of authorship using visualization
Information Processing & Management, 1994
An algorithm for suffix stripping
Program: electronic library and information systems, 1980
A vector space model for automatic indexing
Communications of the ACM, 1975

Cited by 161 articles