LazyB: fast and cheap genome assembly
Open Access
- 1 June 2021
- journal article
- research article
- Published by Springer Science and Business Media LLC in Algorithms for Molecular Biology
- Vol. 16 (1), 1-23
- https://doi.org/10.1186/s13015-021-00186-5
Abstract
Background: Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, “hybrid” methods that integrate short and long read data have been devised to address this need. Results: is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of , entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. Conclusions: is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. Availability: The prototype is available at https://github.com/TGatter/LazyB.Keywords
Funding Information
- German Research Foundation DFS (850/19-2 within SPP 1738)
- Bundesministerium für Bildung und Forschung (BMBF 031L0164C, de.NBI-RBC)
- RSF / Helmholtz Association programme (18-44-06201)
- Deutscher Akademischer Austausch Dienst Kairo (DAAD)
- Universität Leipzig
This publication has 65 references indexed in Scilit:
- How to apply de Bruijn graphs to genome assemblyNature Biotechnology, 2011
- A general approach for optimizing regular criteria in the job-shop scheduling problemEuropean Journal of Operational Research, 2011
- ABySS: A parallel assembler for short read sequence dataGenome Research, 2009
- The fragment assembly string graphBioinformatics, 2005
- A New Algorithm for DNA Sequence AssemblyJournal of Computational Biology, 1995
- Approximating maximum independent sets by excluding subgraphsBIT Numerical Mathematics, 1992
- An Algorithm for Finding a Minimum Equivalent Graph of a DigraphJournal of the ACM, 1969
- Topological sorting of large networksCommunications of the ACM, 1962
- On the shortest spanning subtree of a graph and the traveling salesman problemProceedings of the American Mathematical Society, 1956
- Ueber die Auflösung der Gleichungen, auf welche man bei der Untersuchung der linearen Vertheilung galvanischer Ströme geführt wirdAnnalen der Physik, 1847