Claims‐Based Algorithms for Identifying Patients With Pulmonary Hypertension: A Comparison of Decision Rules and Machine‐Learning Approaches
Open Access
- 6 October 2020
- journal article
- research article
- Published by Ovid Technologies (Wolters Kluwer Health) in Journal of the American Heart Association
- Vol. 9 (19)
- https://doi.org/10.1161/jaha.120.016648
Abstract
Background Real‐world healthcare data are an important resource for epidemiologic research. However, accurate identification of patient cohorts—a crucial first step underpinning the validity of research results—remains a challenge. We developed and evaluated claims‐based case ascertainment algorithms for pulmonary hypertension (PH), comparing conventional decision rules with state‐of‐the‐art machine‐learning approaches. Methods and Results We analyzed an electronic health record‐Medicare linked database from two large academic tertiary care hospitals (years 2007–2013). Electronic health record charts were reviewed to form a gold standard cohort of patients with (n=386) and without PH (n=164). Using health encounter data captured in Medicare claims (including patients’ demographics, diagnoses, medications, and procedures), we developed and compared 2 approaches for identifying patients with PH: decision rules and machine‐learning algorithms using penalized lasso regression, random forest, and gradient boosting machine. The most optimal rule‐based algorithm—having ≥3 PH‐related healthcare encounters and having undergone right heart catheterization—attained an area under the receiver operating characteristic curve of 0.64 (sensitivity, 0.75; specificity, 0.48). All 3 machine‐learning algorithms outperformed the most optimal rule‐based algorithm (P<0.001). A model derived from the random forest algorithm achieved an area under the receiver operating characteristic curve of 0.88 (sensitivity, 0.87; specificity, 0.70), and gradient boosting machine achieved comparable results (area under the receiver operating characteristic curve, 0.85; sensitivity, 0.87; specificity, 0.70). Penalized lasso regression achieved an area under the receiver operating characteristic curve of 0.73 (sensitivity, 0.70; specificity, 0.68). Conclusions Research‐grade case identification algorithms for PH can be derived and rigorously validated using machine‐learning algorithms. Simple decision rules commonly applied in published literature performed poorly; more complex rule‐based algorithms may potentially address the limitation of this approach. PH research using claims data would be considerably strengthened through the use of validated algorithms for cohort ascertainment.Keywords
This publication has 24 references indexed in Scilit:
- Pulmonary Hypertension SurveillanceSocial psychiatry. Sozialpsychiatrie. Psychiatrie sociale, 2014
- A review of approaches to identifying patient phenotype cohorts using electronic health recordsJournal of the American Medical Informatics Association, 2014
- Definitions and Diagnosis of Pulmonary HypertensionJournal of the American College of Cardiology, 2013
- Improving Case Definition of Crohnʼs Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language ProcessingInflammatory Bowel Diseases, 2013
- Referral of Patients With Pulmonary Hypertension Diagnoses to Tertiary Pulmonary Hypertension CentersJAMA Internal Medicine, 2013
- Contemporary Trends in the Diagnosis and Management of Pulmonary Arterial HypertensionSocial psychiatry. Sozialpsychiatrie. Psychiatrie sociale, 2013
- Regression Shrinkage and Selection via The Lasso: A RetrospectiveJournal of the Royal Statistical Society Series B: Statistical Methodology, 2011
- Electronic medical records for discovery research in rheumatoid arthritisArthritis Care & Research, 2010
- INACCURACY OF THE ADMINISTRATIVE DATABASENeurosurgery, 2009
- A method of comparing the areas under receiver operating characteristic curves derived from the same cases.Radiology, 1983