BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
Open Access
- 1 April 2020
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 16 (4), e1007617
- https://doi.org/10.1371/journal.pcbi.1007617
Abstract
A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings-which involve the learning of vector representations of concepts using machine learning models-have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on 30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https:// github.com/ncbi-nlp/BioConceptVec..Funding Information
- the Intramural Research Program of the NIH, National Library of Medicine
- the Intramural Research Program of the NIH, National Library of Medicine
- the Intramural Research Program of the NIH, National Library of Medicine
- the Intramural Research Program of the NIH, National Library of Medicine
- the Intramural Research Program of the NIH, National Library of Medicine
- the Intramural Research Program of the NIH, National Library of Medicine
This publication has 56 references indexed in Scilit:
- Benchmarking Clinical Speech Recognition and Information Extraction: New Data, Methods, and EvaluationsJMIR Public Health and Surveillance, 2015
- Evaluating the state of the art in disorder recognition and normalization of the clinical narrativeJournal of the American Medical Informatics Association, 2014
- The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactionsJournal of Biomedical Informatics, 2013
- PubTator: a web-based text mining tool for assisting biocurationNucleic Acids Research, 2013
- Vector Space Models of Word Meaning and Phrase Meaning: A SurveyLanguage and Linguistics Compass, 2012
- SR4GN: A Species Recognition Software Tool for Gene NormalizationPLOS ONE, 2012
- Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track IIDatabase: The Journal of Biological Databases and Curation, 2012
- Molecular signatures database (MSigDB) 3.0Bioinformatics, 2011
- Network biology: understanding the cell's functional organizationNature Reviews Genetics, 2004
- The Unified Medical Language System (UMLS): integrating biomedical terminologyNucleic Acids Research, 2004