A systematic machine learning and data type comparison yields metagenomic predictors of infant age, sex, breastfeeding, antibiotic usage, country of origin, and delivery type
Open Access
- 1 May 2020
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 16 (5), e1007895
- https://doi.org/10.1371/journal.pcbi.1007895
Abstract
The microbiome is a new frontier for building predictors of human phenotypes. However, machine learning in the microbiome is fraught with issues of reproducibility, driven in large part by the wide range of analytic models and metagenomic data types available. We aimed to build robust metagenomic predictors of host phenotype by comparing prediction performances and biological interpretation across 8 machine learning methods and 4 different types of metagenomic data. Using 1,570 samples from 300 infants, we fit 7,865 models for 6 host phenotypes. We demonstrate the dependence of accuracy on algorithm choice and feature definition in microbiome data and propose a framework for building microbiome-derived indicators of host phenotype. We additionally identify biological features predictive of age, sex, breastfeeding status, historical antibiotic usage, country of origin, and delivery type. Our complete results can be viewed at http://apps.chiragjpgroup.org/ubiome_predictions/. Author summary The human microbiome is hypothesized to influence human phenotype. However, many published host-microbe associations may not be reproducible. A number of reasons could be behind irreproducible results, including a wide array of methods for measuring the microbiome through genetic sequence, annotation pipelines, and analytical models/prediction approaches. Therefore, there is a need to compare different modeling strategies and microbiome data types (i.e. species abundance versus metabolic pathway abundance) to determine how to build robust and reproducible host-microbiome predictions. In this work, we executed a broad comparison of different predictive methods as a function of microbiome data types to effectively predict host characteristics. Our pipeline was able uncover robust microbial associations with phenotype. We additionally recommended considerations for reproducible microbiome-host association pipeline development. We claim our work is a necessary stepping stone in increasing the utility of emerging cohort data and enabling the next generation of efficient microbiome association studies in human health.Funding Information
- National Institute of Allergy and Infectious Diseases (R01AI127250)
- National Institute of Environmental Health Sciences (R00ES23504)
- National Institute of Diabetes and Digestive and Kidney Diseases (DK110919)
- National Science Foundation (1636870)
- ADA Foundation (1636870)
- Richard and Susan Smith Family Foundation
This publication has 41 references indexed in Scilit:
- Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomesNature Biotechnology, 2014
- Persistent gut microbiota immaturity in malnourished Bangladeshi childrenNature, 2014
- Prokka: rapid prokaryotic genome annotationBioinformatics, 2014
- Pfam: the protein families databaseNucleic Acids Research, 2013
- A metagenome-wide association study of gut microbiota in type 2 diabetesNature, 2012
- A novel hierarchical clustering algorithm for gene sequencesBMC Bioinformatics, 2012
- Fast gapped-read alignment with Bowtie 2Nature Methods, 2012
- A new repeat-masking method enables specific detection of homologous sequencesNucleic Acids Research, 2010
- Long-term impacts of antibiotic exposure on the human intestinal microbiotaMicrobiology, 2010
- Why Most Published Research Findings Are FalsePLoS Medicine, 2005