Efficient disk-based K-means clustering for relational databases

2 August 2004

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering

Vol. 16 (8), 909-921
https://doi.org/10.1109/tkde.2004.25

Abstract

K-means is one of the most popular clustering algorithms. We introduce an efficient disk-based implementation of K-means. The proposed algorithm is designed to work inside a relational database management system. It can cluster large data sets having very high dimensionality. In general, it only requires three scans over the data set. It is optimized to perform heavy disk I/O and its memory requirements are low. Its parameters are easy to set. An extensive experimental section evaluates quality of results and performance. The proposed algorithm is compared against the Standard K-means algorithm as well as the Scalable K-means algorithm.

Keywords

This publication has 28 references indexed in Scilit:

Outlier detection for high dimensional data
Published by Association for Computing Machinery (ACM) ,2001
Cure: an efficient clustering algorithm for large databases
Information Systems, 2001
SMEM Algorithm for Mixture Models
Neural Computation, 2000
Mining frequent patterns without candidate generation
Published by Association for Computing Machinery (ACM) ,2000
CACTUS—clustering categorical data using summaries
Published by Association for Computing Machinery (ACM) ,1999
Fast algorithms for projected clustering
Published by Association for Computing Machinery (ACM) ,1999
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
Data Mining and Knowledge Discovery, 1998
BIRCH
ACM SIGMOD Record, 1996
On Convergence Properties of the EM Algorithm for Gaussian Mixtures
Neural Computation, 1996
Hierarchical Mixtures of Experts and the EM Algorithm
Neural Computation, 1994

Cited by 63 articles