Rapidly Retargetable Approaches to De-identification in Medical Records
Open Access
- 1 September 2007
- journal article
- Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association
- Vol. 14 (5), 564-573
- https://doi.org/10.1197/jamia.m2435
Abstract
Objective: This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation. Method: Our approach focused on rapid adaptation of existing toolkits for named entity recognition using two existing toolkits, Carafe and LingPipe. Results: The “out of the box” Carafe system achieved a very good score (phrase F-measure of 0.9664) with only four hours of work to adapt it to the de-identification task. With further tuning, we were able to reduce the token-level error term by over 36% through task-specific feature engineering and the introduction of a lexicon, achieving a phrase F-measure of 0.9736. Conclusions: We were able to achieve good performance on the de-identification task by the rapid retargeting of existing toolkits. For the Carafe system, we developed a method for tuning the balance of recall vs. precision, as well as a confidence score that correlated well with the measured F-score.Keywords
This publication has 5 references indexed in Scilit:
- Evaluating the State-of-the-Art in Automatic De-identificationJournal of the American Medical Informatics Association, 2007
- Automating the Assignment of Diagnosis Codes to Patient Encounters Using Example-based and Machine Learning TechniquesJournal of the American Medical Informatics Association, 2006
- Identifying gene and protein mentions in text using conditional random fieldsBMC Bioinformatics, 2005
- Gene name identification and normalization using a model organism databaseJournal of Biomedical Informatics, 2004
- A Limited Memory Algorithm for Bound Constrained OptimizationSIAM Journal on Scientific Computing, 1995