Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Open Access

17 March 2011

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 6 (3), e18029
https://doi.org/10.1371/journal.pone.0018029

Abstract

We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

Keywords

This publication has 37 references indexed in Scilit:

Visual conceptualizations and models of science
Journal of Informetrics, 2009
Document–document similarity approaches and science mapping: Experimental comparison of five approaches
Journal of Informetrics, 2009
Semantically linking and browsing PubMed abstracts with gene ontology
BMC Genomics, 2008
MScanner: a classifier for retrieving Medline citations
BMC Bioinformatics, 2008
PubMed related articles: a probabilistic topic-based model for content similarity
BMC Bioinformatics, 2007
Biomedical knowledge navigation by literature clustering
Journal of Biomedical Informatics, 2007
Text Categorization Models for High-Quality Article Retrieval in Internal Medicine
Journal of the American Medical Informatics Association, 2004
Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
PLoS Biology, 2004
Divergence measures based on the Shannon entropy
IEEE Transactions on Information Theory, 1991
The use of hierarchic clustering in information retrieval
Information Storage and Retrieval, 1971

Cited by 213 articles