NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature
Open Access
- 25 March 2021
- journal article
- research article
- Published by Springer Science and Business Media LLC in Scientific Data
- Vol. 8 (1), 1-12
- https://doi.org/10.1038/s41597-021-00875-1
Abstract
Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.Funding Information
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine (Intramural Research Program, Intramural Research Program)
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- U.S. Department of Health & Human Services | NIH | U.S. National Library of Medicine
- NIH Intramural Research Program, National Library of Medicine
This publication has 28 references indexed in Scilit:
- tmChem: a high performance approach for chemical named entity recognition and normalizationJournal of Cheminformatics, 2015
- An analysis on the entity annotations in biological corporaF1000Research, 2014
- NCBI disease corpus: A resource for disease name recognition and concept normalizationJournal of Biomedical Informatics, 2014
- The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013Nucleic Acids Research, 2012
- Concept annotation in the CRAFT corpusBMC Bioinformatics, 2012
- Chemical Entity Recognition and Resolution to ChEBIISRN Bioinformatics, 2012
- Text mining for the biocuration workflowDatabase: The Journal of Biological Databases and Curation, 2012
- Understanding PubMed(R) user search behavior through log analysisDatabase: The Journal of Biological Databases and Curation, 2009
- Abbreviation definition identification based on automatic precision estimatesBMC Bioinformatics, 2008
- The Unified Medical Language System (UMLS): integrating biomedical terminologyNucleic Acids Research, 2004