Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score
Open Access
- 4 July 2012
- journal article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 28 (16), 2097-2105
- https://doi.org/10.1093/bioinformatics/bts330
Abstract
Motivation: Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the reliability of the mapping. These attributes, however, are assigned to individual reads and do not directly measure the problematic repeats across the genome. Here, we present the Genome Mappability Score (GMS) as a novel measure of the complexity of resequencing a genome. The GMS is a weighted probability that any read could be unambiguously mapped to a given position and thus measures the overall composition of the genome itself. Results: We have developed the Genome Mappability Analyzer to compute the GMS of every position in a genome. It leverages the parallelism of cloud computing to analyze large genomes, and enabled us to identify the 5–14% of the human, mouse, fly and yeast genomes that are difficult to analyze with short reads. We examined the accuracy of the widely used BWA/SAMtools polymorphism discovery pipeline in the context of the GMS, and found discovery errors are dominated by false negatives, especially in regions with poor GMS. These errors are fundamental to the mapping process and cannot be overcome by increasing coverage. As such, the GMS should be considered in every resequencing project to pinpoint the ‘dark matter’ of the genome, including of known clinically relevant variations in these regions. Availability: The source code and profiles of several model organisms are available at http://gma-bio.sourceforge.net Contact: hlee@cshl.edu Supplementary Information: Supplementary data are available at Bioinformatics online.Keywords
This publication has 22 references indexed in Scilit:
- Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011Proceedings of the National Academy of Sciences of the United States of America, 2012
- A novel and well-defined benchmarking method for second generation read mappingBMC Bioinformatics, 2011
- Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencingBMC Genomics, 2011
- A map of human genome variation from population-scale sequencingNature, 2010
- International network of cancer genome projectsNature, 2010
- Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA NanoarraysScience, 2010
- Accurate whole human genome sequencing using reversible terminator chemistryNature, 2008
- Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalisScience, 2007
- The ENCODE (ENCyclopedia Of DNA Elements) ProjectScience, 2004
- Initial sequencing and analysis of the human genomeNature, 2001