Comparison of clustering techniques for measuring similarity in articles

Abstract
Clustering groups the objects into clusters having similarity with each other. This paper focuses on the two techniques of clustering i.e. hierarchical clustering and k-means clustering. The research is to compare various similarities measuring methods and finding out the best one. Research work is started by selecting different categories of textual contents or articles. For each selected category, articles have been selected from various news channels. Search words are identified which are most relevant for a respective category. Now these words are used as input for processing in the program to create a matrix of words. This matrix is then processed in Matlab using different measuring methods. The final outcome is demonstrated by the Cophenatic correlation coefficient & Silhouette Value to find out the best method of similarity measure. In this paper, five categories have been selected for the analysis which are “Business”, “Education”, “Election”, “Entertainment” and “Game” and 28 news articles have been filtered out for each category from various news channels. Different numbers of words are selected like 35, 49, 25, 30 and 35 against the mentioned categories for the implementation of the proposed technique. The research work finally concludes that for hierarchical clustering - `Cityblock' and for k-means clustering - `Correlation' is the best method however cityblock is at second position in the k-means clustering.

This publication has 4 references indexed in Scilit: