Efficient disk-based K-means clustering for relational databases
- 2 August 2004
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering
- Vol. 16 (8), 909-921
- https://doi.org/10.1109/tkde.2004.25
Abstract
K-means is one of the most popular clustering algorithms. We introduce an efficient disk-based implementation of K-means. The proposed algorithm is designed to work inside a relational database management system. It can cluster large data sets having very high dimensionality. In general, it only requires three scans over the data set. It is optimized to perform heavy disk I/O and its memory requirements are low. Its parameters are easy to set. An extensive experimental section evaluates quality of results and performance. The proposed algorithm is compared against the Standard K-means algorithm as well as the Scalable K-means algorithm.Keywords
This publication has 28 references indexed in Scilit:
- Outlier detection for high dimensional dataPublished by Association for Computing Machinery (ACM) ,2001
- Cure: an efficient clustering algorithm for large databasesInformation Systems, 2001
- SMEM Algorithm for Mixture ModelsNeural Computation, 2000
- Mining frequent patterns without candidate generationPublished by Association for Computing Machinery (ACM) ,2000
- CACTUS—clustering categorical data using summariesPublished by Association for Computing Machinery (ACM) ,1999
- Fast algorithms for projected clusteringPublished by Association for Computing Machinery (ACM) ,1999
- Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical ValuesData Mining and Knowledge Discovery, 1998
- BIRCHACM SIGMOD Record, 1996
- On Convergence Properties of the EM Algorithm for Gaussian MixturesNeural Computation, 1996
- Hierarchical Mixtures of Experts and the EM AlgorithmNeural Computation, 1994