Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
- 1 March 2009
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 604-615
- https://doi.org/10.1109/icde.2009.32
Abstract
Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining efficient query processing. We first study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the flexibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising finding is that while we can reduce the index size significantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.Keywords
This publication has 25 references indexed in Scilit:
- Hashed samplesProceedings of the VLDB Endowment, 2008
- Fast Indexes and Algorithms for Set Similarity Selection QueriesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Efficient Merging and Filtering Algorithms for Approximate String SearchesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Estimating the selectivity of approximate string queriesACM Transactions on Database Systems, 2007
- Indexing text data under space constraintsPublished by Association for Computing Machinery (ACM) ,2004
- Substring selectivity estimationPublished by Association for Computing Machinery (ACM) ,1999
- Approximate nearest neighborsPublished by Association for Computing Machinery (ACM) ,1998
- Self-indexing inverted files for fast text retrievalACM Transactions on Information Systems, 1996
- Estimating alphanumeric selectivity in the presence of wildcardsACM SIGMOD Record, 1996
- Development of a Spelling ListIEEE Transactions on Communications, 1982