Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Open Access

11 September 2009

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 5 (9), e1000502
https://doi.org/10.1371/journal.pcbi.1000502

Abstract

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/. The successful mapping of high-throughput sequencing (HTS) reads to reference genomes largely depends on the accuracy of both the sequencing technologies and reference genomes. Current mapping algorithms focus on mapping with mismatches but largely neglect insertions and deletions—regardless of whether they are caused by sequencing errors or genomic variation. Furthermore, trailing contaminations by primers and declining read qualities can be cumbersome for programs that allow a maximum number of mismatches. We have developed and implemented a new approach for short read mapping that, in a first step, computes exact matches of the read and the reference genome. The exact matches are then modified by a limited number of mismatches, insertions and deletions. From the set of exact and inexact matches, we select those with minimum score-based E-values. This gives a set of regions in the reference genome which is aligned to the read using Myers bitvector algorithm [1]. Our method utilizes enhanced suffix arrays [2] to quickly find the exact and inexact matches. It maps more reads and achieves higher recall rates than previous methods. This consistently holds for reads produced by 454 as well as Illumina sequencing technologies.

Keywords

This publication has 17 references indexed in Scilit:

SHRiMP: Accurate Mapping of Short Color-space Reads
PLoS Computational Biology, 2009
Fast and accurate short read alignment with Burrows–Wheeler transform
Bioinformatics, 2009
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biology, 2009
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Genome Research, 2008
ZOOM! Zillions of oligos mapped
Bioinformatics, 2008
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
Nucleic Acids Research, 2008
PatMaN: rapid alignment of short sequences to large databases
Bioinformatics, 2008
SOAP: short oligonucleotide alignment program
Bioinformatics, 2008
Accuracy and quality of massively parallel DNA pyrosequencing
Genome Biology, 2007
Sublinear approximate string matching and biological applications
Algorithmica, 1994

Cited by 517 articles