A fast algorithm for exact sequence search in biological sequences using polyphase decomposition
Open Access
- 4 September 2010
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 26 (18), i414-i419
- https://doi.org/10.1093/bioinformatics/btq364
Abstract
Motivation: Exact sequence search allows a user to search for a specific DNA subsequence in a larger DNA sequence or database. It serves as a vital block in many areas such as Pharmacogenetics, Phylogenetics and Personal Genomics. As sequencing of genomic data becomes increasingly affordable, the amount of sequence data that must be processed will also increase exponentially. In this context, fast sequence search algorithms will play an important role in exploiting the information contained in the newly sequenced data. Many existing algorithms do not scale up well for large sequences or databases because of their high-computational costs. This article describes an efficient algorithm for performing fast searches on large DNA sequences. It makes use of hash tables of Q-grams that are constructed after downsampling the database, to enable efficient search and memory use. Time complexity for pattern search is reduced using beam pruning techniques. Theoretical complexity calculations and performance figures are presented to indicate the potential of the proposed algorithm. Contact: s.abhilash@samsung.com; ajit.b@samsung.comKeywords
This publication has 12 references indexed in Scilit:
- Fast and accurate long-read alignment with Burrows–Wheeler transformBioinformatics, 2010
- Fast exact string matching algorithmsInformation Processing Letters, 2007
- Versatile and open software for comparing large genomesGenome Biology, 2004
- PatternHunter: faster and more sensitive homology searchBioinformatics, 2002
- SSAHA: A Fast Search Method for Large DNA DatabasesGenome Research, 2001
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990
- Basic local alignment search toolJournal of Molecular Biology, 1990
- Improved tools for biological sequence comparison.Proceedings of the National Academy of Sciences of the United States of America, 1988
- Rapid and Sensitive Protein Similarity SearchesScience, 1985
- Fast Pattern Matching in StringsSIAM Journal on Computing, 1977