Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
Top Cited Papers
Open Access
- 24 April 2020
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLOS ONE
- Vol. 15 (4), e0232391
- https://doi.org/10.1371/journal.pone.0232391
Abstract
The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.Keywords
Funding Information
- Natural Sciences and Engineering Research Council of Canada (R2824A01)
- Natural Sciences and Engineering Research Council of Canada (R3511A12)
This publication has 83 references indexed in Scilit:
- Identification of diverse full-length endogenous betaretroviruses in megabats and microbatsRetrovirology, 2013
- Recent Transmission of a Novel Alphacoronavirus, Bat Coronavirus HKU10, from Leschenault's Rousettes to Pomona Leaf-Nosed Bats: First Evidence of Interspecies Transmission of Coronavirus between Bats of Different SubordersJournal of Virology, 2012
- Genomic Characterization of Severe Acute Respiratory Syndrome-Related Coronavirus in European Bats and Classification of Coronaviruses Based on Partial RNA-Dependent RNA Polymerase Gene SequencesJournal of Virology, 2010
- Coronavirus Genomics and Bioinformatics AnalysisViruses, 2010
- Virus-Host Coevolution: Common Patterns of Nucleotide Motif Usage in Flaviviridae and Their HostsPLOS ONE, 2009
- Detection of Novel SARS-like and Other Coronaviruses in Bats from KenyaEmerging Infectious Diseases, 2009
- Patterns of Evolution and Host Gene Mimicry in Influenza and Other RNA VirusesPLoS Pathogens, 2008
- Structures of Two Coronavirus Main Proteases: Implications for Substrate Binding and Antiviral Drug DesignJournal of Virology, 2008
- Evolutionary Insights into the Ecology of CoronavirusesJournal of Virology, 2007
- Rates of Molecular Evolution in RNA Viruses: A Quantitative Phylogenetic AnalysisJournal of Molecular Evolution, 2002