Duplicate record elimination in large data files

1 June 1983

journal article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Database Systems

Vol. 8 (2), 255-265
https://doi.org/10.1145/319983.319987

Abstract

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.

Keywords

This publication has 3 references indexed in Scilit:

Implementing a relational database by means of specialzed hardware
ACM Transactions on Database Systems, 1979
System R
ACM Transactions on Database Systems, 1976
Sorting and Searching in Multisets
SIAM Journal on Computing, 1976

Cited by 100 articles