A comprehensive study of named entity recognition in Chinese clinical text

Open Access

1 September 2014

journal article
research article
Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association

Vol. 21 (5), 808-814
https://doi.org/10.1136/amiajnl-2013-002381

Abstract

Objective Named entity recognition (NER) is one of the fundamental tasks in natural language processing. In the medical domain, there have been a number of studies on NER in English clinical notes; however, very limited NER research has been carried out on clinical notes written in Chinese. The goal of this study was to systematically investigate features and machine learning algorithms for NER in Chinese clinical text. Materials and methods We randomly selected 400 admission notes and 400 discharge summaries from Peking Union Medical College Hospital in China. For each note, four types of entity—clinical problems, procedures, laboratory test, and medications—were annotated according to a predefined guideline. Two-thirds of the 400 notes were used to train the NER systems and one-third for testing. We investigated the effects of different types of feature including bag-of-characters, word segmentation, part-of-speech, and section information, and different machine learning algorithms including conditional random fields (CRF), support vector machines (SVM), maximum entropy (ME), and structural SVM (SSVM) on the Chinese clinical NER task. All classifiers were trained on the training dataset and evaluated on the test set, and micro-averaged precision, recall, and F-measure were reported. Results Our evaluation on the independent test set showed that most types of feature were beneficial to Chinese NER systems, although the improvements were limited. The system achieved the highest performance by combining word segmentation and section information, indicating that these two types of feature complement each other. When the same types of optimized feature were used, CRF and SSVM outperformed SVM and ME. More specifically, SSVM achieved the highest performance of the four algorithms, with F-measures of 93.51% and 90.01% for admission notes and discharge summaries, respectively.

This publication has 15 references indexed in Scilit:

Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features
BMC Medical Informatics and Decision Making, 2013
Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort
BMC Medical Informatics and Decision Making, 2013
A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries
Journal of the American Medical Informatics Association, 2011
Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010
Journal of the American Medical Informatics Association, 2011
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text
Journal of the American Medical Informatics Association, 2011
Extracting medication information from clinical text
Journal of the American Medical Informatics Association, 2010
Agreement, the F-Measure, and Reliability in Information Retrieval
Journal of the American Medical Informatics Association, 2005
A General Natural-language Text Processor for Clinical Radiology
Journal of the American Medical Informatics Association, 1994
A Broyden—Fletcher—Goldfarb—Shanno optimization procedure for molecular geometries
Chemical Physics Letters, 1985
Generalized Iterative Scaling for Log-Linear Models
The Annals of Mathematical Statistics, 1972

Cited by 136 articles