Efficient Merging and Filtering Algorithms for Approximate String Searches

Top Cited Papers

1 April 2008

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 257-266
https://doi.org/10.1109/icde.2008.4497434

Abstract

We study the following problem: how to efficiently find in a collection of strings those similar to a given query string? Various similarity functions can be used, such as edit distance, Jaccard similarity, and cosine similarity. This problem is of great interests to a variety of applications that need a high real-time performance, such as data cleaning, query relaxation, and spellchecking. Several algorithms have been proposed based on the idea of merging inverted lists of grams generated from the strings. In this paper we make two contributions. First, we develop several algorithms that can greatly improve the performance of existing algorithms. Second, we study how to integrate existing filtering techniques with these algorithms, and show that they should be used together judiciously, since the way to do the integration can greatly affect the performance. We have conducted experiments on several real data sets to evaluate the proposed techniques.

Keywords

This publication has 6 references indexed in Scilit:

Scaling up all pairs similarity search
Published by Association for Computing Machinery (ACM) ,2007
A Primitive Operator for Similarity Joins in Data Cleaning
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Efficient set joins on similarity predicates
Published by Association for Computing Machinery (ACM) ,2004
Robust and efficient fuzzy match for online data cleaning
Published by Association for Computing Machinery (ACM) ,2003
A guided tour to approximate string matching
ACM Computing Surveys, 2001
Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science, 1992

Cited by 184 articles