Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals

3 October 2006

journal article
research article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences of the United States of America

Vol. 103 (40), 14865-14870
https://doi.org/10.1073/pnas.0605152103

Abstract

Powerful algorithms are required to deal with the dimensionality of metabolomics data. Although many achieve high classification accuracy, the models they generate have limited value unless it can be demonstrated that they are reproducible and statistically relevant to the biological problem under investigation. Random forest (RF) generates models, without any requirement for dimensionality reduction or feature selection, in which individual variables are ranked for significance and displayed in an explicit manner. In metabolome fingerprinting by mass spectrometry, each metabolite can be represented by signals at several m/z. Exploiting a prior understanding of expected biochemical differences between sample classes, we aimed to develop meaningful metrics relevant to the significance both of the overall RF model and individual, potentially explanatory, signals. Pair-wise comparison of related plant genotypes with strong phenotypic differences demonstrated that robust models are not only reproducible but also logically structured, highlighting correlated m/z derived from just a small number of explanatory metabolites reflecting the biological differences between sample classes. RF models were also generated by using groupings of samples known to be increasingly phenotypically similar. Although classification accuracy was often reasonable, we demonstrated reproducibly in both Arabidopsis and potato a performance threshold based on margin statistics beyond which such models showed little structure indicative of either generalizibility or further biological interpretability. In a multiclass problem using 25 Arabidopsis genotypes, despite the complicating effects of ecotype background and secondary metabolome perturbations common to several mutations, the ranking of metabolome signals by RF provided scope for deeper interpretability.

Keywords

This publication has 38 references indexed in Scilit:

Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer
Proceedings of the National Academy of Sciences of the United States of America, 2006
Modelling of classification rules on metabolic patterns including machine learning and expert knowledge
Journal of Biomedical Informatics, 2005
Potential of metabolomics as a functional genomics tool
Trends in Plant Science, 2004
Supervised machine learning techniques for the classification of metabolic disorders in newborns
Bioinformatics, 2004
Identification of optimal classification functions for biological sample and state discrimination from metabolic profiling data
Bioinformatics, 2004
Classification and identification of Arabidopsis cell wall mutants using Fourier‐Transform InfraRed (FT‐IR) microspectroscopy
The Plant Journal, 2003
Nontargeted Metabolome Analysis by Use of Fourier Transform Ion Cyclotron Mass Spectrometry
OMICS: A Journal of Integrative Biology, 2002
Genomic Computing. Explanatory Analysis of Plant Expression Profiling Data Using Machine Learning
Plant Physiology, 2001
Metabolic Profiling Allows Comprehensive Phenotyping of Genetically or Environmentally Modified Plant Systems
Plant Cell, 2001
Boosting the margin: a new explanation for the effectiveness of voting methods
The Annals of Statistics, 1998

Cited by 40 articles