SARS-CoV-2 sequence typing, evolution and signatures of selection using CoVa, a Python-based command-line utility

Preprint

preprint
Published by Cold Spring Harbor Laboratory

https://doi.org/10.1101/2020.06.09.082834

Abstract

The current global pandemic COVID-19, caused by SARS-CoV-2, has resulted in millions of infections worldwide in a few months. Global efforts to tackle this situation have produced a tremendous body of genomic data, which can be used for tracing transmission routes, characterization of isolates, and monitoring variants with potential for unusual virulence. Several groups have analyzed these genomes using different approaches. However, as new data become available, the research community needs a pipeline to perform a set of routine analyses, that can quickly incorporate new genome sequences and update the analysis reports. We developed a programmatic tool, CoVa, with this objective. It is a fast, accurate and user-friendly utility to perform a variety of genome analyses on hundreds of SARS-CoV-2 sequences. Using CoVa, we define a modified sequence typing nomenclature and identify sites under positive selection. Further analysis identified some peptides and sites showing geographical patterns of selection. Specifically, we show differences in sequence type distribution between sequences from India and those from the rest of the world. We also show that several sites show signatures of positive selection uniquely in sequences from India. Preliminary evolutionary analysis, using features that will be incorporated into CoVa in the near future, show a mutation rate of 7.4 × 10⁻⁴ substitutions/site/year, confirm a temporal signal with a November 2019 origin of SARS-CoV-2, and a heterogeneity in the geographical distribution of Indian samples.

Keywords

This publication has 35 references indexed in Scilit:

Large scale genomic analysis of 3067 SARS-CoV-2 genomes reveals a clonal geo-distribution and a rich genetic variations of hotspots mutations
Published by Cold Spring Harbor Laboratory ,2020
The genomic variation landscape of globally-circulating clades of SARS-CoV-2 defines a genetic barcoding scheme
Published by Cold Spring Harbor Laboratory ,2020
The global population of SARS-CoV-2 is composed of six major subtypes
Published by Cold Spring Harbor Laboratory ,2020
The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2
Nature Microbiology, 2020
IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era
Molecular Biology and Evolution, 2020
ModelFinder: fast model selection for accurate phylogenetic estimates
Nature Methods, 2017
FUBAR: A Fast, Unconstrained Bayesian AppRoximation for Inferring Selection
Molecular Biology and Evolution, 2013
Adding unaligned sequences into an existing alignment using MAFFT and LAST
Bioinformatics, 2012
FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments
PLOS ONE, 2010
MAFFT version 5: improvement in accuracy of multiple sequence alignment
Nucleic Acids Research, 2005

Cited by 3 articles