Comparison of clustering techniques for measuring similarity in articles

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT)

Abstract

Clustering groups the objects into clusters having similarity with each other. This paper focuses on the two techniques of clustering i.e. hierarchical clustering and k-means clustering. The research is to compare various similarities measuring methods and finding out the best one. Research work is started by selecting different categories of textual contents or articles. For each selected category, articles have been selected from various news channels. Search words are identified which are most relevant for a respective category. Now these words are used as input for processing in the program to create a matrix of words. This matrix is then processed in Matlab using different measuring methods. The final outcome is demonstrated by the Cophenatic correlation coefficient & Silhouette Value to find out the best method of similarity measure. In this paper, five categories have been selected for the analysis which are “Business”, “Education”, “Election”, “Entertainment” and “Game” and 28 news articles have been filtered out for each category from various news channels. Different numbers of words are selected like 35, 49, 25, 30 and 35 against the mentioned categories for the implementation of the proposed technique. The research work finally concludes that for hierarchical clustering - `Cityblock' and for k-means clustering - `Correlation' is the best method however cityblock is at second position in the k-means clustering.

Keywords

This publication has 4 references indexed in Scilit:

Clustering Techniques and the Similarity Measures used in Clustering: A Survey
International Journal of Computer Applications, 2016
A Survey Of Hierarchical Clustering Algorithms
Journal of Mathematics and Computer Science, 2012
Fast Hierarchical Clustering Based on Compressed Data and OPTICS
Lecture Notes in Computer Science, 2000
Automatic subspace clustering of high dimensional data for data mining applications
ACM SIGMOD Record, 1998

Cited by 9 articles