UMLS Concept Indexing for Production Databases: A Feasibility Study

Open Access

Abstract

Objectives: To explore the feasibility of using the National Library of Medicine's Unified Medical Language System (UMLS) Metathesaurus as the basis for a computational strategy to identify concepts in medical narrative text preparatory to indexing. To quantitatively evaluate this strategy in terms of true positives, false positives (spuriously identified concepts) and false negatives (concepts missed by the identification process). Methods: Using the 1999 UMLS Metathesaurus, the authors processed a training set of 100 documents (50 discharge summaries, 50 surgical notes) with a concept-identification program, whose output was manually analyzed. They flagged concepts that were erroneously identified and added new concepts that were not identified by the program, recording the reason for failure in such cases. After several refinements to both their algorithm and the UMLS subset on which it operated, they deployed the program on a test set of 24 documents (12 of each kind). Results: Of 8,745 matches in the training set, 7,227 (82.6 percent ) were true positives, whereas of 1,701 matches in the test set, 1,298 (76.3 percent) were true positives. Matches other than true positive indicated potential problems in production-mode concept indexing. Examples of causes of problems were redundant concepts in the UMLS, homonyms, acronyms, abbreviations and elisions, concepts that were missing from the UMLS, proper names, and spelling errors. Conclusions: The error rate was too high for concept indexing to be the only production-mode means of preprocessing medical narrative. Considerable curation needs to be performed to define a UMLS subset that is suitable for concept matching.

This publication has 24 references indexed in Scilit:

Query expansion and MEDLINE
Information Processing & Management, 1996
An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts
Computers in Biology and Medicine, 1996
Retrieval Feedback in MEDLINE
Journal of the American Medical Informatics Association, 1996
Designing a Controlled Medical Vocabulary Server: The VOSER Project
Computers and Biomedical Research, 1994
A Comparison of Two Methods for Indexing and Retrieval from a Full-text Medical Database
Medical Decision Making, 1993
A Comparison of Retrieval Effectiveness for Three Methods of Indexing Medical Literature
The American Journal of the Medical Sciences, 1992
On the application of syntactic methodologies in automatic text analysis
Information Processing & Management, 1990
SAPHIRE—An information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships
Computers and Biomedical Research, 1990
The measurement of term importance in automatic indexing
Journal of the American Society for Information Science, 1981
An algorithm for suffix stripping
Program: electronic library and information systems, 1980

Cited by 75 articles