Mixture models with multiple levels, with application to the analysis of multifactor gene expression data

Open Access

5 February 2008

journal article
research article
Published by Oxford University Press (OUP) in Biostatistics

Vol. 9 (3), 540-554
https://doi.org/10.1093/biostatistics/kxm051

Abstract

Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data- summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we may gain new insight into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated. In this paper, we propose 2 model selection procedures for model-based clustering. Model selection in model-based clustering has to date focused on the identification of data dimensions that are relevant for clustering. However, in more complex data structures, with multiple experimental factors, such an approach does not provide easily interpreted clustering outcomes. We propose a mixture model with multiple levels, , that provides sparse representations both “within” and “between” cluster profiles. We explore various flexible “within-cluster” parameterizations and discuss how efficient parameterizations can greatly enhance the objective interpretability of the generated clusters. Moreover, we allow for a sparse “between-cluster” representation with a different number of clusters at different levels of an experimental factor of interest. This enhances interpretability of clusters generated in multiple-factor contexts. Interpretable cluster profiles can assist in detecting biologically relevant groups of genes that may be missed with less efficient parameterizations. We use our multilevel mixture model to mine a proliferating cell line expression data set for annotational context and regulatory motifs. We also investigate the performance of the multilevel clustering approach on several simulated data sets.

Keywords

This publication has 12 references indexed in Scilit:

A Unified Approach for Simultaneous Gene Clustering and Differential Expression Identification
Biometrics, 2006
Variable Selection for Model-Based Clustering
Journal of the American Statistical Association, 2006
Clustering Based on a Multilayer Mixture Model
Journal of Computational and Graphical Statistics, 2005
Bayesian Variable Selection in Clustering High-Dimensional Data
Journal of the American Statistical Association, 2005
Simultaneous feature selection and clustering using mixture models
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004
Clustering and classification based on the L1 data depth
Journal of Multivariate Analysis, 2004
GOstat: find statistically overrepresented Gene Ontologies within a group of genes
Bioinformatics, 2004
Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments
Statistical Applications in Genetics and Molecular Biology, 2004
Model-Based Clustering, Discriminant Analysis, and Density Estimation
Journal of the American Statistical Association, 2002
Maximum likelihood estimation via the ECM algorithm: A general framework
Biometrika, 1993

Cited by 7 articles