Building a PubMed knowledge graph

Open Access

26 June 2020

journal article
research article
Published by Springer Science and Business Media LLC in Scientific Data

Vol. 7 (1), 1-15
https://doi.org/10.1038/s41597-020-0543-2

Abstract

PubMed^® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID^®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.

Keywords

This publication has 54 references indexed in Scilit:

BioCreative III interactive task: an overview
BMC Bioinformatics, 2011
The gene normalization task in BioCreative III
BMC Bioinformatics, 2011
Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers
BMC Bioinformatics, 2011
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text
Journal of the American Medical Informatics Association, 2011
Entrez Gene: gene-centered information at NCBI
Nucleic Acids Research, 2010
Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics
Bioinformatics, 2010
LINNAEUS: A species name identification system for biomedical literature
BMC Bioinformatics, 2010
Overview of BioCreative II gene mention recognition
Genome Biology, 2008
OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature
BMC Bioinformatics, 2008
ChEBI: a database and ontology for chemical entities of biological interest
Nucleic Acids Research, 2007

Cited by 96 articles