Hybrid self-optimized clustering model based on citation links and textual features to detect research topics

Open Access

27 October 2017

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 12 (10), e0187164
https://doi.org/10.1371/journal.pone.0187164

Abstract

The challenge of detecting research topics in a specific research field has attracted attention from researchers in the bibliometrics community. In this study, to solve two problems of clustering papers, i.e., the influence of different distributions of citation links and involved textual features on similarity computation, the authors propose a hybrid self-optimized clustering model to detect research topics by extending the hybrid clustering model to identify “core documents”. First, the Amsler network, consisting of bibliographic coupling and co-citation links, is created to calculate the citation-based similarity based on the cosine angle of papers. Second, the cosine similarity is also used to compute the text-based similarity, which consists of the textual statistical and topological features. Then, the cosine angle of the linear combination of citation- and text-based similarity is considered as the hybrid similarity. Finally, the Louvain method is applied to cluster papers, and the terms based on term frequency are used to label clusters. To test the performance of the proposed model, a dataset related to the data envelopment analysis field is used for comparison and analysis of clustering results. Based on the benchmark built, different clustering methods with different citation links or textual features are compared according to evaluation measures. The results show that the proposed model can obtain reasonable and effective clustering results, and the research topics of data envelopment analysis field are also analyzed based on the proposed model. As different features are considered in the proposed model compared with previous hybrid clustering models, the proposed clustering model can provide inspiration for further studies on topic identification by other researchers.

Funding Information

National Natural Science Foundation of China (51375429)
National Natural Science Foundation of China (51475410)
Natural Science Foundation of Zhejiang Province (LY17E050010)
Natural Science Foundation of Zhejiang Province (LY17G010007)
Zhejiang Science & Technology Plan of China (2015C33024)

This publication has 48 references indexed in Scilit:

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
PLOS ONE, 2011
Weighted hybrid clustering by combining text mining and bibliometrics on a large‐scale journal database
Journal of the American Society for Information Science and Technology, 2010
Enhanced soft subspace clustering integrating within-cluster and between-cluster information
Pattern Recognition, 2010
Document–document similarity approaches and science mapping: Experimental comparison of five approaches
Journal of Informetrics, 2009
Fast unfolding of communities in large networks
Journal of Statistical Mechanics: Theory and Experiment, 2008
Modularity and community structure in networks
Proceedings of the National Academy of Sciences of the United States of America, 2006
Link‐based similarity measures for the classification of Web documents
Journal of the American Society for Information Science and Technology, 2005
Fast algorithm for detecting community structure in networks
Physical Review E, 2004
Co‐citation in the scientific literature: A new measure of the relationship between two documents
Journal of the American Society for Information Science, 1973
Objective Criteria for the Evaluation of Clustering Methods
Journal of the American Statistical Association, 1971

Cited by 21 articles