tmChem: a high performance approach for chemical named entity recognition and normalization

Top Cited Papers

Open Access

19 January 2015

journal article
Published by Springer Science and Business Media LLC in Journal of Cheminformatics

Vol. 7 (S1), S3
https://doi.org/10.1186/1758-2946-7-s1-s3

Abstract

Chemical compounds and drugs are an important class of entities in biomedical research with great potential in a wide range of applications, including clinical medicine. Locating chemical named entities in the literature is a useful step in chemical text mining pipelines for identifying the chemical mentions, their properties, and their relationships as discussed in the literature. We introduce the tmChem system, a chemical named entity recognizer created by combining two independent machine learning models in an ensemble. We use the corpus released as part of the recent CHEMDNER task to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation) and 0.8745 f-measure on the CDI subtask (abstract-level evaluation). We also report a high-recall combination (0.9212 for CEM and 0.9224 for CDI). tmChem achieved the highest f-measure reported in the CHEMDNER task for the CEM subtask, and the high recall variant achieved the highest recall on both the CEM and CDI tasks. We report that tmChem is a state-of-the-art tool for chemical named entity recognition and that performance for chemical named entity recognition has now tied (or exceeded) the performance previously reported for genes and diseases. Future research should focus on tighter integration between the named entity recognition and normalization steps for improved performance. The source code and a trained model for both models of tmChem is available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmChem. The results of running tmChem (Model 2) on PubMed are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator

Keywords

This publication has 31 references indexed in Scilit:

DNorm: disease name normalization with pairwise learning to rank
Bioinformatics, 2013
PubTator: a web-based text mining tool for assisting biocuration
Nucleic Acids Research, 2013
Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts
Database: The Journal of Biological Databases and Curation, 2012
The gene normalization task in BioCreative III
BMC Bioinformatics, 2011
Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction
Journal of Biomedical Informatics, 2011
A dictionary to identify small molecules and drugs in free text
Bioinformatics, 2009
Understanding PubMed(R) user search behavior through log analysis
Database: The Journal of Biological Databases and Curation, 2009
Abbreviation definition identification based on automatic precision estimates
BMC Bioinformatics, 2008
Integrating high dimensional bi-directional parsing models for gene mention tagging
Bioinformatics, 2008
Detection of IUPAC and IUPAC-like chemical names
Bioinformatics, 2008

Cited by 215 articles