MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence

Open Access

10 June 2015

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 31 (12), i339-i347
https://doi.org/10.1093/bioinformatics/btv237

Abstract

Motivation: Medical Subject Headings (MeSHs) are used by National Library of Medicine (NLM) to index almost all citations in MEDLINE, which greatly facilitates the applications of biomedical information retrieval and text mining. To reduce the time and financial cost of manual annotation, NLM has developed a software package, Medical Text Indexer (MTI), for assisting MeSH annotation, which uses k-nearest neighbors (KNN), pattern matching and indexing rules. Other types of information, such as prediction by MeSH classifiers (trained separately), can also be used for automatic MeSH annotation. However, existing methods cannot effectively integrate multiple evidence for MeSH annotation. Methods: We propose a novel framework, MeSHLabeler, to integrate multiple evidence for accurate MeSH annotation by using ‘learning to rank’. Evidence includes numerous predictions from MeSH classifiers, KNN, pattern matching, MTI and the correlation between different MeSH terms, etc. Each MeSH classifier is trained independently, and thus prediction scores from different classifiers are incomparable. To address this issue, we have developed an effective score normalization procedure to improve the prediction accuracy. Results: MeSHLabeler won the first place in Task 2A of 2014 BioASQ challenge, achieving the Micro F-measure of 0.6248 for 9,040 citations provided by the BioASQ challenge. Note that this accuracy is around 9.15% higher than 0.5724, obtained by MTI. Availability and implementation: The software is available upon request. Contact: zhusf@fudan.edu.cn

Keywords

This publication has 23 references indexed in Scilit:

MeSH indexing based on automatically generated summaries
BMC Bioinformatics, 2013
Recommending MeSH terms for annotating biomedical articles
Journal of the American Medical Informatics Association, 2011
Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization
Information Sciences, 2011
LIBSVM
ACM Transactions on Intelligent Systems and Technology, 2011
Field independent probabilistic model for clustering multi-field documents
Information Processing & Management, 2009
Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity
Bioinformatics, 2009
MeSH Up: effective MeSH text classification for improved document retrieval
Bioinformatics, 2009
Evaluation of query expansion using MeSH in PubMed
Information Retrieval Journal, 2008
PubMed related articles: a probabilistic topic-based model for content similarity
BMC Bioinformatics, 2007
Automatic assignment of biomedical categories: toward a generic approach
Bioinformatics, 2005

Cited by 56 articles