Combining Mixture Components for Clustering

Top Cited Papers

1 January 2010

journal article
Published by Taylor & Francis Ltd in Journal of Computational and Graphical Statistics

Vol. 19 (2), 332-353
https://doi.org/10.1198/jcgs.2010.08111

Abstract

Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K. These clusterings can be compared on substantive grounds, and we also describe an automatic way of selecting the number of clusters via a piecewise linear regression fit to the rescaled entropy plot. We illustrate the method with simulated data and a flow cytometry dataset. Supplemental materials are available on the journal web site and described at the end of the article.

Keywords

This publication has 19 references indexed in Scilit:

Mixture models with multiple levels, with application to the analysis of multifactor gene expression data
Biostatistics, 2008
High-Content Flow Cytometry and Temporal Data Analysis for Defining a Cellular Signature of Graft-Versus-Host Disease
Transplantation and Cellular Therapy, 2007
Model-Based Clustering, Discriminant Analysis, and Density Estimation
Journal of the American Statistical Association, 2002
Assessing a mixture model for clustering with the integrated completed likelihood
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000
Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes
Journal of the American Statistical Association, 1998
Detecting Features in Spatial Point Processes with Clutter via Model-Based Clustering
Journal of the American Statistical Association, 1998
How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis
The Computer Journal, 1998
Model-Based Gaussian and Non-Gaussian Clustering
Biometrics, 1993
A classification EM algorithm for clustering and two stochastic versions
Computational Statistics & Data Analysis, 1992
The Dip Test of Unimodality
The Annals of Statistics, 1985

Cited by 203 articles