Compression of next-generation sequencing reads aided by highly efficient de novo assembly
Open Access
- 13 August 2012
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 40 (22), e171
- https://doi.org/10.1093/nar/gks754
Abstract
We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information. Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.Keywords
This publication has 29 references indexed in Scilit:
- Scaling metagenome sequence assembly with probabilistic de Bruijn graphsProceedings of the National Academy of Sciences of the United States of America, 2012
- Large-scale compression of genomic sequence databases with the Burrows–Wheeler transformBioinformatics, 2012
- Genome-wide Runx2 occupancy in prostate cancer cells suggests a role in regulating secretionNucleic Acids Research, 2011
- Compressing Genomic Sequence Fragments Using SlimGeneJournal of Computational Biology, 2011
- Efficient storage of high throughput DNA sequencing data using reference-based compressionGenome Research, 2011
- The Sequence Read ArchiveNucleic Acids Research, 2010
- A map of human genome variation from population-scale sequencingNature, 2010
- Fast and SNP-tolerant detection of complex variants and splicing in short readsBioinformatics, 2010
- The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variantsNucleic Acids Research, 2009
- The Sequence Alignment/Map format and SAMtoolsBioinformatics, 2009