Identifying Smokers with a Medical Extraction System

Open Access

1 January 2008

journal article
Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association

Vol. 15 (1), 36-39
https://doi.org/10.1197/jamia.m2442

Abstract

The Clinical Language Understanding group at Nuance Communications has developed a medical information extraction system that combines a rule-based extraction engine with machine learning algorithms to identify and categorize references to patient smoking in clinical reports. The extraction engine identifies smoking references; documents that contain no smoking references are classified as UNKNOWN. For the remaining documents, the extraction engine uses linguistic analysis to associate features such as status and time to smoking mentions. Machine learning is used to classify the documents based on these features. This approach shows overall accuracy in the 90s on all data sets used. Classification using engine-generated and word-based features outperforms classification using only word-based features for all data sets, although the difference gets smaller as the data set size increases. These techniques could be applied to identify other risk factors, such as drug and alcohol use, or a family history of a disease.

Keywords

This publication has 4 references indexed in Scilit:

Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system
BMC Medical Informatics and Decision Making, 2006
Extracting diagnoses from discharge summaries.
2005
Automated encoding of clinical documents based on natural language processing
Journal of the American Medical Informatics Association, 2004
Fever detection from free-text clinical records for biosurveillance
Journal of Biomedical Informatics, 2004

Cited by 58 articles