Fuzzy c-Means Algorithms for Very Large Data

Top Cited Papers

25 May 2012

journal article
research article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Fuzzy Systems

Vol. 20 (6), 1130-1146
https://doi.org/10.1109/tfuzz.2012.2201485

Abstract

Very large (VL) data or big data are any data that you cannot load into your computer's working memory. This is not an objective definition, but a definition that is easy to understand and one that is practical, because there is a dataset too big for any computer you might use; hence, this is VL data for you. Clustering is one of the primary tasks used in the pattern recognition and data mining communities to search VL databases (including VL images) in various applications, and so, clustering algorithms that scale well to VL data are important and useful. This paper compares the efficacy of three different implementations of techniques aimed to extend fuzzy c-means (FCM) clustering to VL data. Specifically, we compare methods that are based on 1) sampling followed by noniterative extension; 2) incremental techniques that make one sequential pass through subsets of the data; and 3) kernelized versions of FCM that provide approximations based on sampling, including three proposed algorithms. We use both loadable and VL datasets to conduct the numerical experiments that facilitate comparisons based on time and space complexity, speed, quality of approximations to batch FCM (for loadable data), and assessment of matches between partitions and ground truth. Empirical results show that random sampling plus extension FCM, bit-reduced FCM, and approximate kernel FCM are good choices to approximate FCM for VL data. We conclude by demonstrating the VL algorithms on a dataset with 5 billion objects and presenting a set of recommendations regarding the use of different VL FCM clustering schemes.

Keywords

This publication has 37 references indexed in Scilit:

Spectral methods in machine learning and new strategies for very large datasets
Proceedings of the National Academy of Sciences of the United States of America, 2009
A Scalable Framework For Segmenting Magnetic Resonance Images
Journal of Signal Processing Systems, 2008
The Universal Protein Resource (UniProt)
Nucleic Acids Research, 2006
Clustering data streams: theory and practice
IEEE Transactions on Knowledge and Data Engineering, 2003
Fast accurate fuzzy clustering through data reduction
IEEE Transactions on Fuzzy Systems, 2003
Cure: an efficient clustering algorithm for large databases
Information Systems, 2001
Incremental clustering for very large document databases: Initial MARIAN Experience
Information Sciences, 1995
A possibilistic approach to clustering
IEEE Transactions on Fuzzy Systems, 1993
Relational duals of the c-means clustering algorithms
Pattern Recognition, 1989
Comparing partitions
Journal of Classification, 1985

Cited by 373 articles