A Standardized Framework for Better Understanding of Phenotypic Differences within Bacterial Phyla Based on Protein Domain

Abstract
We propose a standardized framework to classify target species based on their protein domains, which can be utilized in different contexts, like eukaryotes and prokaryotes. In this study, by applying the framework to the bacterial kingdom as an implementation example and comparing the results with the current taxonomy standards at the phylum level, we came to the conclusion that the sequence of domains rather than the content of domains in a protein and the presence of one domain rather than the number of occurrences of one domain play more important roles in deciding bacterial phenotypes as well as matching the current taxonomy. In addition, the comparison also helps us to better focus on the species that conflict with the current phylum category, as well as to further investigate their phenotypic or genotypic differences. IMPORTANCE A 3-step framework was designed which can be applied to clustering species based on their protein domains, and different candidate models are proposed in each step for better adaptation of various scenarios. We show its implementation for the bacterial kingdom as an example, which helps us to find the most appropriate model combination that will best reflect the relationship between domains and phenotypes in this context. In addition, identifying species that are distant in the results but should be closely related phylogenetically can help us to focus on the mismatch for better understanding of their key phenotypic or genotypic differences.
Funding Information
  • National Natural Science Foundation of China (32070025, 31800136, 82041019)
  • State Key Laboratory of Pathogen and Biosecurity (SKLPBS1807)