Nonparametric cluster significance testing with reference to a unimodal null distribution
- 1 December 2021
- journal article
- research article
- Published by Oxford University Press (OUP) in Biometrics
- Vol. 77 (4), 1215-1226
- https://doi.org/10.1111/biom.13376
Abstract
Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high-dimension low-sample size (HDLSS) settings. In this paper, we propose a novel method to evaluate the significance of identified clusters by comparing the explained variation due to the clustering from the original data to that produced by clustering a unimodal reference distribution that preserves the covariance structure in the data. The reference distribution is generated using kernel density estimation, and thus, does not require that the data follow a particular distribution. By utilizing sparse covariance estimation, the method is adapted for the HDLSS setting. The approach can be used to test the null hypothesis that the data cannot be partitioned into clusters and to determine the optimal number of clusters. Simulation examples, theoretical evaluations, and applications to temporomandibular disorder research and cancer microarray data illustrate the utility of the proposed method.Other Versions
Funding Information
- National Institute of Dental and Craniofacial Research (R03DE023592)
- National Institute of Environmental Health Sciences (P03ES010126)
- National Center for Advancing Translational Sciences (UL1RR025747)
- National Science Foundation (DGE‐1144081)
This publication has 29 references indexed in Scilit:
- Study Methods, Recruitment, Sociodemographic Findings, and Demographic Representativeness in the OPPERA StudyThe Journal of Pain, 2011
- High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergenceElectronic Journal of Statistics, 2011
- Limit distribution theory for maximum likelihood estimation of a log-concave densityThe Annals of Statistics, 2009
- Sparse inverse covariance estimation with the graphical lassoBiostatistics, 2007
- Are clusters found in one dataset present in another dataset?Biostatistics, 2006
- Estimating the Number of Clusters in a Data Set Via the Gap StatisticJournal of the Royal Statistical Society Series B: Statistical Methodology, 2001
- Distinct types of diffuse large B-cell lymphoma identified by gene expression profilingNature, 2000
- Some asymptotics for multimodality tests based on kernel density estimatesProbability Theory and Related Fields, 1992
- Silhouettes: A graphical aid to the interpretation and validation of cluster analysisJournal of Computational and Applied Mathematics, 1987
- On the Modes of a Mixture of Two Normal DistributionsTechnometrics, 1970