Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus
Open Access
- 1 November 2020
- journal article
- research article
- Published by American Society of Clinical Oncology (ASCO) in JCO Clinical Cancer Informatics
- Vol. 4 (4), 383-391
- https://doi.org/10.1200/cci.19.00147
Abstract
Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.This publication has 15 references indexed in Scilit:
- Deep learning in clinical natural language processing: a methodical reviewJournal of the American Medical Informatics Association, 2019
- Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical RecordsCancer Research, 2019
- Generating Real-World Tumor Burden Endpoints from Electronic Health Record Data: Comparison of RECIST, Radiology-Anchored, and Clinician-Anchored Approaches for Abstracting Real-World Progression in Non-Small Cell Lung CancerAdvances in Therapy, 2019
- Nomogram prediction of overall survival for patients with non-small-cell lung cancer incorporating pretreatment peripheral blood markers†European Journal of Cardio-Thoracic Surgery, 2018
- Clinical information extraction applications: A literature reviewJournal of Biomedical Informatics, 2017
- New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)European Journal of Cancer, 2009
- Recommendations for the assessment of progression in randomised cancer treatment trialsEuropean Journal of Cancer, 2009
- Evaluation of lymph nodes with RECIST 1.1European Journal of Cancer, 2009
- Histologic grade is an independent prognostic factor for survival in non–small cell lung cancer: An analysis of 5018 hospital- and 712 population-based casesThe Journal of Thoracic and Cardiovascular Surgery, 2006
- Clinical Features of 5,628 Primary Lung Cancer PatientsSocial psychiatry. Sozialpsychiatrie. Psychiatrie sociale, 2005