K-means Clustering in the Cloud -- A Mahout Test

1 March 2011

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 514-519
https://doi.org/10.1109/waina.2011.136

Abstract

The K-Means is a well known clustering algorithm that has been successfully applied to a wide variety of problems. However, its application has usually been restricted to small datasets. Mahout is a cloud computing approach to K-Means that runs on a Hadoop system. Both Mahout and Hadoop are free and open source. Due to their inexpensive and scalable characteristics, these platforms can be a promising technology to solve data intensive problems which were not trivial in the past. In this work we studied the performance of Mahout using a large data set. The tests were running on Amazon EC2 instances and allowed to compare the gain in runtime when running on a multi node cluster. This paper presents some results of ongoing research.

Keywords

This publication has 9 references indexed in Scilit:

An experience report on scaling tools for mining software repositories using MapReduce
Published by Association for Computing Machinery (ACM) ,2010
Comparison and analysis of clustering techniques
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
MapReduce
Communications of the ACM, 2008
Assessing Cluster Quality Using Multiple Measures - A Decision Tree Based Approach
Published by Springer Science and Business Media LLC ,2006
Hierarchical Clustering Algorithms for Document Datasets
Data Mining and Knowledge Discovery, 2005
A Survey of Outlier Detection Methodologies
Artificial Intelligence Review, 2004
Scalability for clustering algorithms revisited
ACM SIGKDD Explorations Newsletter, 2000
Data clustering
ACM Computing Surveys, 1999
Clustering by competitive agglomeration
Pattern Recognition, 1997

Cited by 45 articles