Compression of next-generation sequencing reads aided by highly efficient de novo assembly

Open Access

13 August 2012

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 40 (22), e171
https://doi.org/10.1093/nar/gks754

Abstract

We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information. Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.

Keywords

MASSIVELY-PARALLEL GENOME SEQUENCING

This publication has 29 references indexed in Scilit:

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
Proceedings of the National Academy of Sciences of the United States of America, 2012
Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform
Bioinformatics, 2012
Genome-wide Runx2 occupancy in prostate cancer cells suggests a role in regulating secretion
Nucleic Acids Research, 2011
Compressing Genomic Sequence Fragments Using SlimGene
Journal of Computational Biology, 2011
Efficient storage of high throughput DNA sequencing data using reference-based compression
Genome Research, 2011
The Sequence Read Archive
Nucleic Acids Research, 2010
A map of human genome variation from population-scale sequencing
Nature, 2010
Fast and SNP-tolerant detection of complex variants and splicing in short reads
Bioinformatics, 2010
The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
Nucleic Acids Research, 2009
The Sequence Alignment/Map format and SAMtools
Bioinformatics, 2009

Cited by 140 articles