Parallel K-Medoids clustering algorithm based on Hadoop

1 June 2014

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 649-652
https://doi.org/10.1109/icsess.2014.6933652

Abstract

The K-Medoids clustering algorithm solves the problem of the K-Means algorithm on processing the outlier samples, but it is not be able to process big-data because of the time complexity[1]. MapReduce is a parallel programming model for processing big-data, and has been implemented in Hadoop. In order to break the big-data limits, the parallel K-Medoids algorithm HK-Medoids based on Hadoop was proposed. Every submitted job has many iterative MapReduce procedures: In the map phase, each sample was assigned to one cluster whose center is the most similar with the sample; in the combine phase, an intermediate center for each cluster was calculated; and in the reduce phase, the new center was calculated. The iterator stops when the new center is similar to the old one. The experimental results showed that HK-Medoids algorithm has a good clustering result and linear speedup for big-data.

Keywords

This publication has 4 references indexed in Scilit:

K-means Clustering in the Cloud -- A Mahout Test
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
A simple and fast algorithm for K-medoids clustering
Expert Systems with Applications, 2009
MapReduce
Communications of the ACM, 2008
CLARANS: a method for clustering objects for spatial data mining
IEEE Transactions on Knowledge and Data Engineering, 2002

Cited by 12 articles