Mash-based analyses of Escherichia coli genomes reveal 14 distinct phylogroups
Open Access
- 26 January 2021
- journal article
- research article
- Published by Springer Science and Business Media LLC in Communications Biology
- Vol. 4 (1), 1-12
- https://doi.org/10.1038/s42003-020-01626-5
Abstract
In this study, more than one hundred thousand Escherichia coli and Shigella genomes were examined and classified. This is, to our knowledge, the largest E. coli genome dataset analyzed to date. A Mash-based analysis of a cleaned set of 10,667 E. coli genomes from GenBank revealed 14 distinct phylogroups. A representative genome or medoid identified for each phylogroup was used as a proxy to classify 95,525 unassembled genomes from the Sequence Read Archive (SRA). We find that most of the sequenced E. coli genomes belong to four phylogroups (A, C, B1 and E2(O157)). Authenticity of the 14 phylogroups is supported by several different lines of evidence: phylogroup-specific core genes, a phylogenetic tree constructed with 2613 single copy core genes, and differences in the rates of gene gain/loss/duplication. The methodology used in this work is able to reproduce known phylogroups, as well as to identify previously uncharacterized phylogroups in E. coli species.Funding Information
- U.S. Department of Health & Human Services | NIH | National Institute of General Medical Sciences (1P20GM121293, P20 GM103429)
- Arkansas Research Alliance
- U.S. Department of Health & Human Services | NIH | National Institute of General Medical Sciences
This publication has 55 references indexed in Scilit:
- The Evolutionary Path to Extraintestinal Pathogenic, Drug-Resistant Escherichia coli Is Marked by Drastic Reduction in Detectable Recombination within the Core GenomeGenome Biology and Evolution, 2013
- MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and UsabilityMolecular Biology and Evolution, 2013
- Whole-genome phylogeny ofEscherichia coli/Shigellagroup by feature frequency profiles (FFPs)Proceedings of the National Academy of Sciences of the United States of America, 2011
- BIGSdb: Scalable analysis of bacterial genome variation at the population levelBMC Bioinformatics, 2010
- Search and clustering orders of magnitude faster than BLASTBioinformatics, 2010
- Escherichia coli phylogenetic group determination and its application in the identification of the major animal source of fecal contaminationBMC Microbiology, 2010
- Cryptic Lineages of the Genus EscherichiaApplied and Environmental Microbiology, 2009
- Streamlining and Large Ancestral Genomes in Archaea Inferred with a Phylogenetic Birth-and-Death ModelMolecular Biology and Evolution, 2009
- Toward a More Robust Assessment of IntraspeciesDiversity, Using Fewer GeneticMarkersApplied and Environmental Microbiology, 2006
- Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction NetworksGenome Research, 2003