Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders
Open Access
- 2 September 2020
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 30 (10), 1434-1448
- https://doi.org/10.1101/gr.266221.120
Abstract
The human pathogen severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the major pandemic of the 21st century. We analyzed >4,700 SARS-CoV-2 genomes and associated meta-data retrieved from public repositories. SARS-CoV-2 sequences have a high sequence identity (>99.9%), which drops to >96% when compared to bat coronavirus genome. We built a mutation-annotated reference SARS-CoV-2 phylogeny with two main macro-haplogroups, A and B, both of Asian origin, and >160 sub-branches representing virus strains of variable geographical origins worldwide, revealing a rather uniform mutation occurrence along branches that could have implications for diagnostics and the design of future vaccines. Identification of the root of SARS-CoV-2 genomes is not without problems, owing to conflicting interpretations derived from either using the bat coronavirus genomes as an outgroup or relying on the sampling chronology of the SARS-CoV-2 genomes and TMRCA estimates; however, the overall scenario favors haplogroup A as the ancestral node. Phylogenetic analysis indicates a TMRCA for SARS-CoV-2 genomes dating to 12 November 2019 - thus matching epidemiological records. Sub-haplogroup A2 most likely originated in Europe from an Asian ancestor and gave rise to sub-clade A2a, which represents the major non-Asian outbreak, especially in Africa and Europe. Multiple founder effect episodes, most likely associated with super-spreader hosts, might explain COVID-19 pandemic to a large extent.Funding Information
- Instituto de Salud Carlos III (Instituto de Salud Carlos III(ISCIII)/PI16/01478/Cofinanciado FEDER, Instituto de Salud Carlos, III(ISCIII)/DTS19/00049/Cofinanciado FEDER, Instituto de Salud Carlos III(ISCIII)/PI19/01039/Cofinanciado FEDER, Instituto de Salud Carlos III(ISCIII)/PI16/01569/Cofinanciado FEDER, Instituto de Salud Carlos III(ISCIII)/ PI19/01090/Cofinanciado FEDER)
This publication has 58 references indexed in Scilit:
- Inferring Epidemic Contact Structure from Phylogenetic TreesPLoS Computational Biology, 2012
- PhyloTempo: A Set of R Scripts for Assessing and Visualizing Temporal Clustering in Genealogies Inferred from Serially Sampled Viral SequencesEvolutionary Bioinformatics, 2012
- Statistical Power Analysis of Neutrality Tests Under Demographic Expansions, Contractions and Bottlenecks With RecombinationGenetics, 2008
- Bayesian inference of population size history from multiple lociBMC Evolutionary Biology, 2008
- BEAST: Bayesian evolutionary analysis by sampling treesBMC Evolutionary Biology, 2007
- PAML 4: Phylogenetic Analysis by Maximum LikelihoodMolecular Biology and Evolution, 2007
- Application of Phylogenetic Networks in Evolutionary StudiesMolecular Biology and Evolution, 2005
- A practical guide to mitochondrial DNA error prevention in clinical, forensic, and population geneticsBiochemical and Biophysical Research Communications, 2005
- MUSCLE: multiple sequence alignment with high accuracy and high throughputNucleic Acids Research, 2004
- Mathematical model for studying genetic variation in terms of restriction endonucleases.Proceedings of the National Academy of Sciences of the United States of America, 1979