Nonparametric cluster significance testing with reference to a unimodal null distribution

1 December 2021

journal article
research article
Published by Oxford University Press (OUP) in Biometrics

Vol. 77 (4), 1215-1226
https://doi.org/10.1111/biom.13376

Abstract

Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high-dimension low-sample size (HDLSS) settings. In this paper, we propose a novel method to evaluate the significance of identified clusters by comparing the explained variation due to the clustering from the original data to that produced by clustering a unimodal reference distribution that preserves the covariance structure in the data. The reference distribution is generated using kernel density estimation, and thus, does not require that the data follow a particular distribution. By utilizing sparse covariance estimation, the method is adapted for the HDLSS setting. The approach can be used to test the null hypothesis that the data cannot be partitioned into clusters and to determine the optimal number of clusters. Simulation examples, theoretical evaluations, and applications to temporomandibular disorder research and cancer microarray data illustrate the utility of the proposed method.

Other Versions

Version 2, 2016-10-05, preprints

Funding Information

National Institute of Dental and Craniofacial Research (R03DE023592)
National Institute of Environmental Health Sciences (P03ES010126)
National Center for Advancing Translational Sciences (UL1RR025747)
National Science Foundation (DGE‐1144081)

This publication has 29 references indexed in Scilit:

Study Methods, Recruitment, Sociodemographic Findings, and Demographic Representativeness in the OPPERA Study
The Journal of Pain, 2011
High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence
Electronic Journal of Statistics, 2011
Limit distribution theory for maximum likelihood estimation of a log-concave density
The Annals of Statistics, 2009
Sparse inverse covariance estimation with the graphical lasso
Biostatistics, 2007
Are clusters found in one dataset present in another dataset?
Biostatistics, 2006
Estimating the Number of Clusters in a Data Set Via the Gap Statistic
Journal of the Royal Statistical Society Series B: Statistical Methodology, 2001
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature, 2000
Some asymptotics for multimodality tests based on kernel density estimates
Probability Theory and Related Fields, 1992
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
Journal of Computational and Applied Mathematics, 1987
On the Modes of a Mixture of Two Normal Distributions
Technometrics, 1970

Cited by 1 article