Efficient construction of an assembly string graph using the FM-index
Open Access
- 1 June 2010
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 26 (12), i367-i373
- https://doi.org/10.1093/bioinformatics/btq217
Abstract
Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms. Results: Standard overlap assembly methods have time complexity O(N2), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly. Contact:js18@sanger.ac.ukKeywords
This publication has 19 references indexed in Scilit:
- Fast and accurate long-read alignment with Burrows–Wheeler transformBioinformatics, 2010
- SOAP2: an improved ultrafast tool for short read alignmentBioinformatics, 2009
- Genome assembly reborn: recent computational challengesBriefings in Bioinformatics, 2009
- Fast and accurate short read alignment with Burrows–Wheeler transformBioinformatics, 2009
- Ultrafast and memory-efficient alignment of short DNA sequences to the human genomeGenome Biology, 2009
- ABySS: A parallel assembler for short read sequence dataGenome Research, 2009
- Velvet: Algorithms for de novo short read assembly using de Bruijn graphsGenome Research, 2008
- De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computerGenome Research, 2008
- Short read fragment assembly of bacterial genomesGenome Research, 2007
- The fragment assembly string graphBioinformatics, 2005