Quantitative Assessment of Dictionary-based Protein Named Entity Tagging

Open Access

1 September 2006

journal article
Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association

Vol. 13 (5), 497-507
https://doi.org/10.1197/jamia.m2085

Abstract

Objective: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. Methods: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. Results: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. Conclusion: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.

Keywords

This publication has 56 references indexed in Scilit:

Resolving abbreviations to their senses in Medline
Bioinformatics, 2005
Overview of BioCreAtIvE: critical assessment of information extraction for biology
BMC Bioinformatics, 2005
GAPSCORE: finding gene and protein names one word at a time
Bioinformatics, 2004
GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data
Journal of Biomedical Informatics, 2003
Tagging gene and protein names in biomedical text
Bioinformatics, 2002
Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method
Journal of Biomedical Informatics, 2001
EVENT EXTRACTION FROM BIOMEDICAL PAPERS USING A FULL PARSER
Pacific Symposium on Biocomputing, 2000
Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures
Published by World Scientific Pub Co Pte Ltd ,1999
Isolation and chromosomal assignment of human genes encoding cofactor of LIM homeodomain proteins, CLIM1 and CLIM2
Journal of Human Genetics, 1999
Characterization of the human 36-kDa carboxyl terminal LIM domain protein (hCLIM1)
Journal of Cellular Biochemistry, 1999

Cited by 28 articles