Using machine learning for concept extraction on clinical documents from multiple data sources

Open Access

1 September 2011

journal article
Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association

Vol. 18 (5), 580-587
https://doi.org/10.1136/amiajnl-2011-000155

Abstract

Objective Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources. Methods We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources. Results As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training. Conclusion Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.

Keywords

This publication has 32 references indexed in Scilit:

Lancet: a high precision medication event extraction system for clinical text
Journal of the American Medical Informatics Association, 2010
Extracting medication information from clinical text
Journal of the American Medical Informatics Association, 2010
High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge
Journal of the American Medical Informatics Association, 2010
Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
Journal of the American Medical Informatics Association, 2010
What can natural language processing do for clinical decision support?
Journal of Biomedical Informatics, 2009
Overview of BioCreative II gene mention recognition
Genome Biology, 2008
Integrating high dimensional bi-directional parsing models for gene mention tagging
Bioinformatics, 2008
Identifying Smokers with a Medical Extraction System
Journal of the American Medical Informatics Association, 2008
Identifying Patient Smoking Status from Medical Discharge Records
Journal of the American Medical Informatics Association, 2008
Quantitative Assessment of Dictionary-based Protein Named Entity Tagging
Journal of the American Medical Informatics Association, 2006

Cited by 104 articles