Claims‐Based Algorithms for Identifying Patients With Pulmonary Hypertension: A Comparison of Decision Rules and Machine‐Learning Approaches

Open Access

6 October 2020

journal article
research article
Published by Ovid Technologies (Wolters Kluwer Health) in Journal of the American Heart Association

Vol. 9 (19)
https://doi.org/10.1161/jaha.120.016648

Abstract

Background Real‐world healthcare data are an important resource for epidemiologic research. However, accurate identification of patient cohorts—a crucial first step underpinning the validity of research results—remains a challenge. We developed and evaluated claims‐based case ascertainment algorithms for pulmonary hypertension (PH), comparing conventional decision rules with state‐of‐the‐art machine‐learning approaches. Methods and Results We analyzed an electronic health record‐Medicare linked database from two large academic tertiary care hospitals (years 2007–2013). Electronic health record charts were reviewed to form a gold standard cohort of patients with (n=386) and without PH (n=164). Using health encounter data captured in Medicare claims (including patients’ demographics, diagnoses, medications, and procedures), we developed and compared 2 approaches for identifying patients with PH: decision rules and machine‐learning algorithms using penalized lasso regression, random forest, and gradient boosting machine. The most optimal rule‐based algorithm—having ≥3 PH‐related healthcare encounters and having undergone right heart catheterization—attained an area under the receiver operating characteristic curve of 0.64 (sensitivity, 0.75; specificity, 0.48). All 3 machine‐learning algorithms outperformed the most optimal rule‐based algorithm (P<0.001). A model derived from the random forest algorithm achieved an area under the receiver operating characteristic curve of 0.88 (sensitivity, 0.87; specificity, 0.70), and gradient boosting machine achieved comparable results (area under the receiver operating characteristic curve, 0.85; sensitivity, 0.87; specificity, 0.70). Penalized lasso regression achieved an area under the receiver operating characteristic curve of 0.73 (sensitivity, 0.70; specificity, 0.68). Conclusions Research‐grade case identification algorithms for PH can be derived and rigorously validated using machine‐learning algorithms. Simple decision rules commonly applied in published literature performed poorly; more complex rule‐based algorithms may potentially address the limitation of this approach. PH research using claims data would be considerably strengthened through the use of validated algorithms for cohort ascertainment.

Keywords

This publication has 24 references indexed in Scilit:

Pulmonary Hypertension Surveillance
Social psychiatry. Sozialpsychiatrie. Psychiatrie sociale, 2014
A review of approaches to identifying patient phenotype cohorts using electronic health records
Journal of the American Medical Informatics Association, 2014
Definitions and Diagnosis of Pulmonary Hypertension
Journal of the American College of Cardiology, 2013
Improving Case Definition of Crohnʼs Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing
Inflammatory Bowel Diseases, 2013
Referral of Patients With Pulmonary Hypertension Diagnoses to Tertiary Pulmonary Hypertension Centers
JAMA Internal Medicine, 2013
Contemporary Trends in the Diagnosis and Management of Pulmonary Arterial Hypertension
Social psychiatry. Sozialpsychiatrie. Psychiatrie sociale, 2013
Regression Shrinkage and Selection via The Lasso: A Retrospective
Journal of the Royal Statistical Society Series B: Statistical Methodology, 2011
Electronic medical records for discovery research in rheumatoid arthritis
Arthritis Care & Research, 2010
INACCURACY OF THE ADMINISTRATIVE DATABASE
Neurosurgery, 2009
A method of comparing the areas under receiver operating characteristic curves derived from the same cases.
Radiology, 1983

Cited by 18 articles