BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale

Open Access

1 April 2020

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 16 (4), e1007617
https://doi.org/10.1371/journal.pcbi.1007617

Abstract

A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings-which involve the learning of vector representations of concepts using machine learning models-have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on 30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https:// github.com/ncbi-nlp/BioConceptVec..

Funding Information

the Intramural Research Program of the NIH, National Library of Medicine
the Intramural Research Program of the NIH, National Library of Medicine
the Intramural Research Program of the NIH, National Library of Medicine
the Intramural Research Program of the NIH, National Library of Medicine
the Intramural Research Program of the NIH, National Library of Medicine
the Intramural Research Program of the NIH, National Library of Medicine

This publication has 56 references indexed in Scilit:

Benchmarking Clinical Speech Recognition and Information Extraction: New Data, Methods, and Evaluations
JMIR Public Health and Surveillance, 2015
Evaluating the state of the art in disorder recognition and normalization of the clinical narrative
Journal of the American Medical Informatics Association, 2014
The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions
Journal of Biomedical Informatics, 2013
PubTator: a web-based text mining tool for assisting biocuration
Nucleic Acids Research, 2013
Vector Space Models of Word Meaning and Phrase Meaning: A Survey
Language and Linguistics Compass, 2012
SR4GN: A Species Recognition Software Tool for Gene Normalization
PLOS ONE, 2012
Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II
Database: The Journal of Biological Databases and Curation, 2012
Molecular signatures database (MSigDB) 3.0
Bioinformatics, 2011
Network biology: understanding the cell's functional organization
Nature Reviews Genetics, 2004
The Unified Medical Language System (UMLS): integrating biomedical terminology
Nucleic Acids Research, 2004

Cited by 42 articles