A SNP discovery method to assess variant allele probability from next-generation resequencing data

Open Access

17 December 2009

journal article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 20 (2), 273-280
https://doi.org/10.1101/gr.096388.109

Abstract

Accurate identification of genetic variants from next-generation sequencing (NGS) data is essential for immediate large-scale genomic endeavors such as the 1000 Genomes Project, and is crucial for further genetic analysis based on the discoveries. The key challenge in single nucleotide polymorphism (SNP) discovery is to distinguish true individual variants (occurring at a low frequency) from sequencing errors (often occurring at frequencies orders of magnitude higher). Therefore, knowledge of the error probabilities of base calls is essential. We have developed Atlas-SNP2, a computational tool that detects and accounts for systematic sequencing errors caused by context-related variables in a logistic regression model learned from training data sets. Subsequently, it estimates the posterior error probability for each substitution through a Bayesian formula that integrates prior knowledge of the overall sequencing error probability and the estimated SNP rate with the results from the logistic regression model for the given substitutions. The estimated posterior SNP probability can be used to distinguish true SNPs from sequencing errors. Validation results show that Atlas-SNP2 achieves a false-positive rate of lower than 10%, with an ∼5% or lower false-negative rate.

Keywords

This publication has 21 references indexed in Scilit:

VarScan: variant detection in massively parallel sequencing of individual and pooled samples
Bioinformatics, 2009
SNP detection for massively parallel whole-genome resequencing
Genome Research, 2009
Sequencing of natural strains of Arabidopsis thaliana with short reads
Genome Research, 2008
DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome
Nature, 2008
Next-generation DNA sequencing
Nature Biotechnology, 2008
Mapping short DNA sequencing reads and calling variants using mapping quality scores
Genome Research, 2008
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
Nucleic Acids Research, 2008
The complete genome of an individual by massively parallel DNA sequencing
Nature, 2008
Quality scores and SNP detection in sequencing-by-synthesis systems
Genome Research, 2008
BLAT—The BLAST-Like Alignment Tool
Genome Research, 2002

Cited by 145 articles