Disk compression of k-mer sets
Open Access
- 21 June 2021
- journal article
- research article
- Published by Springer Science and Business Media LLC in Algorithms for Molecular Biology
- Vol. 16 (1), 1-14
- https://doi.org/10.1186/s13015-021-00192-7
Abstract
K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.Keywords
Funding Information
- National Science Foundation (1453527, 1439057)
- National Science Foundation (1453527, 1439057)
- National Institutes of Health (CBIOS training grant)
- INCEPTION project
This publication has 38 references indexed in Scilit:
- Fast search of thousands of short-read sequencing experimentsNature Biotechnology, 2016
- Kraken: ultrafast metagenomic sequence classification using exact alignmentsGenome Biology, 2014
- On the Representation of de Bruijn GraphsLecture Notes in Computer Science, 2014
- MFCompress: a compression tool for FASTA and multi-FASTA dataBioinformatics, 2013
- DSK: k-mer counting with very low memory usageBioinformatics, 2013
- SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell SequencingJournal of Computational Biology, 2012
- Succinct de Bruijn GraphsLecture Notes in Computer Science, 2012
- GAGE: A critical evaluation of genome assemblies and assembly algorithmsGenome Research, 2011
- A fast, lock-free approach for efficient parallel counting of occurrences of k-mersBioinformatics, 2011
- Basic Terminology, Notation and ResultsPublished by Springer Science and Business Media LLC ,2009