Rapidly Retargetable Approaches to De-identification in Medical Records

Open Access

1 September 2007

journal article
Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association

Vol. 14 (5), 564-573
https://doi.org/10.1197/jamia.m2435

Abstract

Objective: This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation. Method: Our approach focused on rapid adaptation of existing toolkits for named entity recognition using two existing toolkits, Carafe and LingPipe. Results: The “out of the box” Carafe system achieved a very good score (phrase F-measure of 0.9664) with only four hours of work to adapt it to the de-identification task. With further tuning, we were able to reduce the token-level error term by over 36% through task-specific feature engineering and the introduction of a lexicon, achieving a phrase F-measure of 0.9736. Conclusions: We were able to achieve good performance on the de-identification task by the rapid retargeting of existing toolkits. For the Carafe system, we developed a method for tuning the balance of recall vs. precision, as well as a confidence score that correlated well with the measured F-score.

Keywords

This publication has 5 references indexed in Scilit:

Evaluating the State-of-the-Art in Automatic De-identification
Journal of the American Medical Informatics Association, 2007
Automating the Assignment of Diagnosis Codes to Patient Encounters Using Example-based and Machine Learning Techniques
Journal of the American Medical Informatics Association, 2006
Identifying gene and protein mentions in text using conditional random fields
BMC Bioinformatics, 2005
Gene name identification and normalization using a model organism database
Journal of Biomedical Informatics, 2004
A Limited Memory Algorithm for Bound Constrained Optimization
SIAM Journal on Scientific Computing, 1995

Cited by 93 articles