An Introduction to MPEG-G: The First Open ISO/IEC Standard for the Compression and Exchange of Genomic Sequencing Data
Open Access
- 15 June 2021
- journal article
- research article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in Proceedings of the IEEE
- Vol. 109 (9), 1607-1622
- https://doi.org/10.1109/jproc.2021.3082027
Abstract
The development and progress of high-throughput sequencing technologies have transformed the sequencing of DNA from a scientific research challenge to practice. With the release of the latest generation of sequencing machines, the cost of sequencing a whole human genome has dropped to less than $600. Such achievements open the door to personalized medicine, where it is expected that genomic information of patients will be analyzed as a standard practice. However, the associated costs, related to storing, transmitting, and processing the large volumes of data, are already comparable to the costs of sequencing. To support the design of new and interoperable solutions for the representation, compression, and management of genomic sequencing data, the Moving Picture Experts Group (MPEG) jointly with working group 5 of ISO/TC276 ``Biotechnology'' has started to produce the ISO/IEC 23092 series, known as MPEG-G. MPEG-G does not only offer higher levels of compression compared with the state of the art but it also provides new functionalities, such as built-in support for random access in the compressed domain, support for data protection mechanisms, flexible storage, and streaming capabilities. MPEG-G only specifies the decoding syntax of compressed bitstreams, as well as a file format and a transport format. This allows for the development of new encoding solutions with higher degrees of optimization while maintaining compatibility with any existing MPEG-G decoder.Keywords
This publication has 41 references indexed in Scilit:
- QualComp: a new lossy compressor for quality scores based on rate distortion theoryBMC Bioinformatics, 2013
- Compression of FASTQ and SAM Format Sequencing DataPLOS ONE, 2013
- Compression of next-generation sequencing reads aided by highly efficient de novo assemblyNucleic Acids Research, 2012
- The sequence read archive: explosive growth of sequencing dataNucleic Acids Research, 2011
- Efficient storage of high throughput DNA sequencing data using reference-based compressionGenome Research, 2011
- Integrative genomics viewerNature Biotechnology, 2011
- The Sequence Read ArchiveNucleic Acids Research, 2010
- The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variantsNucleic Acids Research, 2009
- The Sequence Alignment/Map format and SAMtoolsBioinformatics, 2009
- A universal algorithm for sequential data compressionIEEE Transactions on Information Theory, 1977