K-means Clustering in the Cloud -- A Mahout Test
- 1 March 2011
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 514-519
- https://doi.org/10.1109/waina.2011.136
Abstract
The K-Means is a well known clustering algorithm that has been successfully applied to a wide variety of problems. However, its application has usually been restricted to small datasets. Mahout is a cloud computing approach to K-Means that runs on a Hadoop system. Both Mahout and Hadoop are free and open source. Due to their inexpensive and scalable characteristics, these platforms can be a promising technology to solve data intensive problems which were not trivial in the past. In this work we studied the performance of Mahout using a large data set. The tests were running on Amazon EC2 instances and allowed to compare the gain in runtime when running on a multi node cluster. This paper presents some results of ongoing research.Keywords
This publication has 9 references indexed in Scilit:
- An experience report on scaling tools for mining software repositories using MapReducePublished by Association for Computing Machinery (ACM) ,2010
- Comparison and analysis of clustering techniquesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- MapReduceCommunications of the ACM, 2008
- Assessing Cluster Quality Using Multiple Measures - A Decision Tree Based ApproachPublished by Springer Science and Business Media LLC ,2006
- Hierarchical Clustering Algorithms for Document DatasetsData Mining and Knowledge Discovery, 2005
- A Survey of Outlier Detection MethodologiesArtificial Intelligence Review, 2004
- Scalability for clustering algorithms revisitedACM SIGKDD Explorations Newsletter, 2000
- Data clusteringACM Computing Surveys, 1999
- Clustering by competitive agglomerationPattern Recognition, 1997