Efficient counting of k-mers in DNA sequences using a bloom filter

Open Access

10 August 2011

journal article
research article
Published by Springer Science and Business Media LLC in BMC Bioinformatics

Vol. 12 (1), 1-7
https://doi.org/10.1186/1471-2105-12-333

Abstract

Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html

Keywords

This publication has 22 references indexed in Scilit:

Multiplexed shotgun genotyping for rapid and efficient genetic mapping
Genome Research, 2011
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
Bioinformatics, 2011
High-quality draft assemblies of mammalian genomes from massively parallel sequence data
Proceedings of the National Academy of Sciences of the United States of America, 2010
A map of human genome variation from population-scale sequencing
Nature, 2010
Classification of DNA sequences using Bloom filters
Bioinformatics, 2010
A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware
Journal of Computational Biology, 2010
De novo assembly of human genomes with massively parallel short read sequencing
Genome Research, 2009
ABySS: A parallel assembler for short read sequence data
Genome Research, 2009
Velvet: Algorithms for de novo short read assembly using de Bruijn graphs
Genome Research, 2008
ALLPATHS: De novo assembly of whole-genome shotgun microreads
Genome Research, 2008

Cited by 209 articles