SARS-CoV-2 sequence typing, evolution and signatures of selection using CoVa, a Python-based command-line utility
Preprint
- 10 June 2020
- preprint
- Published by Cold Spring Harbor Laboratory
Abstract
The current global pandemic COVID-19, caused by SARS-CoV-2, has resulted in millions of infections worldwide in a few months. Global efforts to tackle this situation have produced a tremendous body of genomic data, which can be used for tracing transmission routes, characterization of isolates, and monitoring variants with potential for unusual virulence. Several groups have analyzed these genomes using different approaches. However, as new data become available, the research community needs a pipeline to perform a set of routine analyses, that can quickly incorporate new genome sequences and update the analysis reports. We developed a programmatic tool, CoVa, with this objective. It is a fast, accurate and user-friendly utility to perform a variety of genome analyses on hundreds of SARS-CoV-2 sequences. Using CoVa, we define a modified sequence typing nomenclature and identify sites under positive selection. Further analysis identified some peptides and sites showing geographical patterns of selection. Specifically, we show differences in sequence type distribution between sequences from India and those from the rest of the world. We also show that several sites show signatures of positive selection uniquely in sequences from India. Preliminary evolutionary analysis, using features that will be incorporated into CoVa in the near future, show a mutation rate of 7.4 × 10−4 substitutions/site/year, confirm a temporal signal with a November 2019 origin of SARS-CoV-2, and a heterogeneity in the geographical distribution of Indian samples.Keywords
This publication has 35 references indexed in Scilit:
- Large scale genomic analysis of 3067 SARS-CoV-2 genomes reveals a clonal geo-distribution and a rich genetic variations of hotspots mutationsPublished by Cold Spring Harbor Laboratory ,2020
- The genomic variation landscape of globally-circulating clades of SARS-CoV-2 defines a genetic barcoding schemePublished by Cold Spring Harbor Laboratory ,2020
- The global population of SARS-CoV-2 is composed of six major subtypesPublished by Cold Spring Harbor Laboratory ,2020
- The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2Nature Microbiology, 2020
- IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic EraMolecular Biology and Evolution, 2020
- ModelFinder: fast model selection for accurate phylogenetic estimatesNature Methods, 2017
- FUBAR: A Fast, Unconstrained Bayesian AppRoximation for Inferring SelectionMolecular Biology and Evolution, 2013
- Adding unaligned sequences into an existing alignment using MAFFT and LASTBioinformatics, 2012
- FastTree 2 – Approximately Maximum-Likelihood Trees for Large AlignmentsPLOS ONE, 2010
- MAFFT version 5: improvement in accuracy of multiple sequence alignmentNucleic Acids Research, 2005