Estimating the Number of Clusters in a Data Set Via the Gap Statistic

Top Cited Papers

1 July 2001

journal article
Published by Oxford University Press (OUP) in Journal of the Royal Statistical Society Series B: Statistical Methodology

Vol. 63 (2), 411-423
https://doi.org/10.1111/1467-9868.00293

Abstract

We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K‐means or hierarchical), comparing the change in within‐cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

Keywords

CLUSTERING
GROUPS
HIERARCHY
UNIFORM DISTRIBUTION

Cited by 3809 articles