A Systematic Evaluation of High-Throughput Sequencing Approaches to Identify Low-Frequency Single Nucleotide Variants in Viral Populations

Open Access

20 October 2020

journal article
research article
Published by MDPI AG in Viruses

Vol. 12 (10), 1187
https://doi.org/10.3390/v12101187

Abstract

High-throughput sequencing such as those provided by Illumina are an efficient way to understand sequence variation within viral populations. However, challenges exist in distinguishing process-introduced error from biological variance, which significantly impacts our ability to identify sub-consensus single-nucleotide variants (SNVs). Here we have taken a systematic approach to evaluate laboratory and bioinformatic pipelines to accurately identify low-frequency SNVs in viral populations. Artificial DNA and RNA “populations” were created by introducing known SNVs at predetermined frequencies into template nucleic acid before being sequenced on an Illumina MiSeq platform. These were used to assess the effects of abundance and starting input material type, technical replicates, read length and quality, short-read aligner, and percentage frequency thresholds on the ability to accurately call variants. Analyses revealed that the abundance and type of input nucleic acid had the greatest impact on the accuracy of SNV calling as measured by a micro-averaged Matthews correlation coefficient score, with DNA and high RNA inputs (10⁷ copies) allowing for variants to be called at a 0.2% frequency. Reduced input RNA (10⁵ copies) required more technical replicates to maintain accuracy, while low RNA inputs (10³ copies) suffered from consensus-level errors. Base errors identified at specific motifs identified in all technical replicates were also identified which can be excluded to further increase SNV calling accuracy. These findings indicate that samples with low RNA inputs should be excluded for SNV calling and reinforce the importance of optimising the technical and bioinformatics steps in pipelines that are used to accurately identify sequence variants.

Keywords

Funding Information

Biotechnology and Biological Sciences Research Council (BB/E/I/00007035, BB/E/I/00007036, BBS/E/I/00007037, 1646570)
Defra (SE2944)

This publication has 36 references indexed in Scilit:

The GEM mapper: fast, accurate and versatile alignment by filtration
Nature Methods, 2012
LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets
Nucleic Acids Research, 2012
Fast gapped-read alignment with Bowtie 2
Nature Methods, 2012
Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID
Proceedings of the National Academy of Sciences of the United States of America, 2011
Beyond the Consensus: Dissecting Within-Host Viral Population Diversity of Foot-and-Mouth Disease Virus by Using Next-Generation Genome Sequencing
Journal of Virology, 2011
Quality control and preprocessing of metagenomic datasets
Bioinformatics, 2011
Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing
PLOS ONE, 2010
Measuring dementia carers' unmet need for services - an exploratory mixed method study
BMC Health Services Research, 2010
Molecular Epidemiology of the Foot-and-Mouth Disease Virus Outbreak in the United Kingdom in 2001
Journal of Virology, 2006
HIV Quasispecies and Resampling
Science, 1996

Cited by 8 articles