ADAGE-Based Integration of Publicly Available Pseudomonas aeruginosa Gene Expression Data with Denoising Autoencoders Illuminates Microbe-Host Interactions

Open Access

23 February 2016

journal article
Published by American Society for Microbiology in mSystems

Vol. 1 (1)
https://doi.org/10.1128/msystems.00025-15

Abstract

The increasing number of genome-wide assays of gene expression available from public databases presents opportunities for computational methods that facilitate hypothesis generation and biological interpretation of these data. We present an unsupervised machine learning approach, ADAGE (analysis using denoising autoencoders of gene expression), and apply it to the publicly available gene expression data compendium for Pseudomonas aeruginosa. In this approach, the machine-learned ADAGE model contained 50 nodes which we predicted would correspond to gene expression patterns across the gene expression compendium. While no biological knowledge was used during model construction, cooperonic genes had similar weights across nodes, and genes with similar weights across nodes were significantly more likely to share KEGG pathways. By analyzing newly generated and previously published microarray and transcriptome sequencing data, the ADAGE model identified differences between strains, modeled the cellular response to low oxygen, and predicted the involvement of biological processes based on low-level gene expression differences. ADAGE compared favorably with traditional principal component analysis and independent component analysis approaches in its ability to extract validated patterns, and based on our analyses, we propose that these approaches differ in the types of patterns they preferentially identify. We provide the ADAGE model with analysis of all publicly available P. aeruginosa GeneChip experiments and open source code for use with other species and settings. Extraction of consistent patterns across large-scale collections of genomic data using methods like ADAGE provides the opportunity to identify general principles and biologically important patterns in microbial biology. This approach will be particularly useful in less-well-studied microbial species. IMPORTANCE The quantity and breadth of genome-scale data sets that examine RNA expression in diverse bacterial and eukaryotic species are increasing more rapidly than for curated knowledge. Our ADAGE method integrates such data without requiring gene function, gene pathway, or experiment labeling, making practical its application to any large gene expression compendium. We built a Pseudomonas aeruginosa ADAGE model from a diverse set of publicly available experiments without any prespecified biological knowledge, and this model was accurate and predictive. We provide ADAGE results for the complete P. aeruginosa GeneChip compendium for use by researchers studying P. aeruginosa and source code that facilitates ADAGE's application to other species and data types. Author Video: An author video summary of this article is available.

Keywords

Funding Information

William H. Neukom Institute for Computational Science
HHS | National Institutes of Health (AI091702)
HHS | National Institutes of Health (DK007301)
HHS | National Institutes of Health (CA023108)
HHS | National Institutes of Health (GM106394)
Gordon and Betty Moore Foundation (GBMF4552)
Cystic Fibrosis Foundation (STANTO07R0)
Cystic Fibrosis Foundation (STANTO15R0)

This publication has 63 references indexed in Scilit:

Comprehensive molecular portraits of human breast tumours
Nature, 2012
Integrated genomic analyses of ovarian carcinoma
Nature, 2011
Independent component analysis: Mining microarray data for fundamental human gene expression modules
Journal of Biomedical Informatics, 2010
Applications of next generation sequencing in molecular ecology of non-model organisms
Heredity, 2010
Atlas of Gene Expression in the Developing Kidney at Microanatomic Resolution
Developmental Cell, 2008
Next-generation DNA sequencing
Nature Biotechnology, 2008
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
Proceedings of the National Academy of Sciences of the United States of America, 2005
Computational analysis of microarray data
Nature Reviews Genetics, 2001
Significance analysis of microarrays applied to the ionizing radiation response
Proceedings of the National Academy of Sciences of the United States of America, 2001
KEGG: Kyoto Encyclopedia of Genes and Genomes
Nucleic Acids Research, 2000

Cited by 113 articles