Probabilistic principal component analysis for metabolomic data

Open Access

23 November 2010

journal article
Published by Springer Science and Business Media LLC in BMC Bioinformatics

Vol. 11 (1), 571
https://doi.org/10.1186/1471-2105-11-571

Abstract

Data from metabolomic studies are typically complex and high-dimensional. Principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, PCA is limited by the fact that it is not based on a statistical model. Here, probabilistic principal component analysis (PPCA) which addresses some of the limitations of PCA, is reviewed and extended. A novel extension of PPCA, called probabilistic principal component and covariates analysis (PPCCA), is introduced which provides a flexible approach to jointly model metabolomic data and additional covariate information. The use of a mixture of PPCA models for discovering the number of inherent groups in metabolomic data is demonstrated. The jackknife technique is employed to construct confidence intervals for estimated model parameters throughout. The optimal number of principal components is determined through the use of the Bayesian Information Criterion model selection tool, which is modified to address the high dimensionality of the data. The methods presented are illustrated through an application to metabolomic data sets. Jointly modeling metabolomic data and covariates was successfully achieved and has the potential to provide deeper insight to the underlying data structure. Examination of confidence intervals for the model parameters, such as loadings, allows for principled and clear interpretation of the underlying data structure. A software package called MetabolAnalyze, freely available through the R statistical software, has been developed to facilitate implementation of the presented methods in the metabolomics field.

Keywords

This publication has 26 references indexed in Scilit:

Exploring Voting Blocs Within the Irish Electorate
Journal of the American Statistical Association, 2008
Inferring differentiation pathways from gene expression
Bioinformatics, 2008
Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering
Journal of Classification, 2007
Mass spectrometry‐based metabolomics
Mass Spectrometry Reviews, 2006
Model-Based Clustering, Discriminant Analysis, and Density Estimation
Journal of the American Statistical Association, 2002
NMR-BASED METABOLOMICS
Drug and Chemical Toxicology, 2002
A hierarchical latent variable model for data visualization
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998
Bootstrap Methods: Another Look at the Jackknife
The Annals of Statistics, 1979
Estimating the Dimension of a Model
The Annals of Statistics, 1978
A new look at the statistical model identification
IEEE Transactions on Automatic Control, 1974

Cited by 123 articles