Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis

Open Access

1 June 2012

journal article
Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association

Vol. 19 (e1), e149-e156
https://doi.org/10.1136/amiajnl-2011-000744

Abstract

Objective To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. Design Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. Results For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms. Conclusion The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

Keywords

This publication has 18 references indexed in Scilit:

The BioLexicon: a large-scale terminological resource for biomedical text mining
BMC Bioinformatics, 2011
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text
Journal of the American Medical Informatics Association, 2011
Quantitative Analysis of Culture Using Millions of Digitized Books
Science, 2011
The structural and content aspects of abstracts versus bodies of full text journal articles are different
BMC Bioinformatics, 2010
Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
Journal of the American Medical Informatics Association, 2010
The Enterprise Data Trust at Mayo Clinic: a semantically integrated warehouse of biomedical data
Journal of the American Medical Informatics Association, 2010
Rewriting and suppressing UMLS terms for improved biomedical term identification
Journal of Biomedical Semantics, 2010
Exploring semantic groups through visual approaches
Journal of Biomedical Informatics, 2003
"Understanding" Medical School Curriculum Content Using KnowledgeMap
Journal of the American Medical Informatics Association, 2003
UMLS Concept Indexing for Production Databases: A Feasibility Study
Journal of the American Medical Informatics Association, 2001

Cited by 65 articles