A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Top Cited Papers

7 January 2011

journal article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 27 (6), 764-770
https://doi.org/10.1093/bioinformatics/btr011

Abstract

Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.

Keywords

This publication has 19 references indexed in Scilit:

Quake: quality-aware detection and correction of sequencing errors
Genome Biology, 2010
Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis
PLoS Biology, 2010
The sequence and de novo assembly of the giant panda genome
Nature, 2009
A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes
BMC Genomics, 2008
MapReduce
Communications of the ACM, 2008
RAP: a new computer program for de novo identification of repeated sequences in whole genomes
Bioinformatics, 2004
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Research, 2004
FORRepeats: detects repeats on entire chromosomes and between genomes
Bioinformatics, 2003
Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2
Genome Research, 2003
Annotating Large Genomes With Exact Word Matches
Genome Research, 2003

Cited by 3027 articles