The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process
Open Access
- 29 November 2011
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 40 (6), 2426-2431
- https://doi.org/10.1093/nar/gkr1073
Abstract
With the availability of next-generation sequencing (NGS) technology, it is expected that sequence variants may be called on a genomic scale. Here, we demonstrate that a deeper understanding of the distribution of the variant call frequencies at heterozygous loci in NGS data sets is a prerequisite for sensitive variant detection. We model the crucial steps in an NGS protocol as a stochastic branching process and derive a mathematical framework for the expected distribution of alleles at heterozygous loci before measurement that is sequencing. We confirm our theoretical results by analyzing technical replicates of human exome data and demonstrate that the variance of allele frequencies at heterozygous loci is higher than expected by a simple binomial distribution. Due to this high variance, mutation callers relying on binomial distributed priors are less sensitive for heterozygous variants that deviate strongly from the expected mean frequency. Our results also indicate that error rates can be reduced to a greater degree by technical replicates than by increasing sequencing depth.Keywords
This publication has 17 references indexed in Scilit:
- Strategies for exome and genome sequence data analysis in disease‐gene discovery projectsClinical Genetics, 2011
- Technology-specific error signatures in the 1000 Genomes Project dataHuman Genetics, 2011
- SNVMix: predicting single nucleotide variants from next-generation sequencing of tumorsBioinformatics, 2010
- Exome sequencing identifies the cause of a mendelian disorderNature Genetics, 2009
- Genetic diagnosis by whole exome capture and massively parallel DNA sequencingProceedings of the National Academy of Sciences of the United States of America, 2009
- Relative frequencies in multitype branching processesThe Annals of Applied Probability, 2009
- Evaluation of next generation sequencing platforms for population targeted sequencing studiesGenome Biology, 2009
- Mapping short DNA sequencing reads and calling variants using mapping quality scoresGenome Research, 2008
- SOAP: short oligonucleotide alignment programBioinformatics, 2008
- Branching ProcessesPublished by Springer Science and Business Media LLC ,1972