Revisiting performance metrics for prediction with rare outcomes

1 September 2021

journal article
research article
Published by SAGE Publications in Statistical Methods in Medical Research

Vol. 30 (10), 2352-2366
https://doi.org/10.1177/09622802211038754

Abstract

Machine learning algorithms are increasingly used in the clinical literature, claiming advantages over logistic regression. However, they are generally designed to maximize the area under the receiver operating characteristic curve. While area under the receiver operating characteristic curve and other measures of accuracy are commonly reported for evaluating binary prediction problems, these metrics can be misleading. We aim to give clinical and machine learning researchers a realistic medical example of the dangers of relying on a single measure of discriminatory performance to evaluate binary prediction questions. Prediction of medical complications after surgery is a frequent but challenging task because many post-surgery outcomes are rare. We predicted post-surgery mortality among patients in a clinical registry who received at least one aortic valve replacement. Estimation incorporated multiple evaluation metrics and algorithms typically regarded as performing well with rare outcomes, as well as an ensemble and a new extension of the lasso for multiple unordered treatments. Results demonstrated high accuracy for all algorithms with moderate measures of cross-validated area under the receiver operating characteristic curve. False positive rates were

<

1%, however, true positive rates were

<

7%, even when paired with a 100% positive predictive value, and graphical representations of calibration were poor. Similar results were seen in simulations, with the addition of high area under the receiver operating characteristic curve (

>

90%) accompanying low true positive rates. Clinical studies should not primarily report only area under the receiver operating characteristic curve or accuracy.

Funding Information

National Institute of General Medical Sciences (NIH R01-GM111339)

This publication has 51 references indexed in Scilit:

A Sparse-Group Lasso
Journal of Computational and Graphical Statistics, 2013
SMOTE for high-dimensional class-imbalanced data
BMC Bioinformatics, 2013
Mortality Risk Score Prediction in an Elderly Population Using Machine Learning
American Journal of Epidemiology, 2013
Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemble‐based methods?
Biometrical Journal, 2012
Treatment Options in Severe Aortic Stenosis
Circulation, 2011
Prediction Modeling Using EHR Data
Medical Care, 2010
Assessing the Performance of Prediction Models
Epidemiology, 2010
Support vector machine versus logistic regression modeling for prediction of hospital mortality in critically ill patients with haematological malignancies
BMC Medical Informatics and Decision Making, 2008
Much Ado About Nothing
The American Statistician, 2007
Super Learner
Statistical Applications in Genetics and Molecular Biology, 2007

Cited by 10 articles