Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models

Open Access

21 January 2021

journal article
research article
Published by Springer Science and Business Media LLC in npj Digital Medicine

Vol. 4 (1), 1-8
https://doi.org/10.1038/s41746-020-00380-6

Abstract

Artificial intelligence models match or exceed dermatologists in melanoma image classification. Less is known about their robustness against real-world variations, and clinicians may incorrectly assume that a model with an acceptable area under the receiver operating characteristic curve or related performance metric is ready for clinical use. Here, we systematically assessed the performance of dermatologist-level convolutional neural networks (CNNs) on real-world non-curated images by applying computational “stress tests”. Our goal was to create a proxy environment in which to comprehensively test the generalizability of off-the-shelf CNNs developed without training or evaluation protocols specific to individual clinics. We found inconsistent predictions on images captured repeatedly in the same setting or subjected to simple transformations (e.g., rotation). Such transformations resulted in false positive or negative predictions for 6.5–22% of skin lesions across test datasets. Our findings indicate that models meeting conventionally reported metrics need further validation with computational stress tests to assess clinic readiness.

Keywords

Funding Information

Melanoma Research Alliance (622732, 622732)
UCSF Helen Diller Family Comprehensive Cancer Center
UCSF Summer Explore Fellowship, Marguerite Schoeneman Award, Alameda-Contra Costa Medical Association Summer Fellowship, UCSF/UCB Joint Medical Program Thesis Grant
Doris Duke Charitable Foundation

This publication has 32 references indexed in Scilit:

Automated Dermatological Diagnosis: Hype or Reality?
Journal of Investigative Dermatology, 2018
A Novel Contributory Cross-Domain Group Password-Based Authenticated Key Exchange Protocol with Adaptive Security
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
Dermatologist-level classification of skin cancer with deep neural networks
Nature, 2017
Deep Residual Learning for Image Recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
ImageNet Large Scale Visual Recognition Challenge
International Journal of Computer Vision, 2015
PH2 - A dermoscopic image database for research and benchmarking
2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2013
Assessing the Performance of Prediction Models
Epidemiology, 2010
Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support
Journal of Biomedical Informatics, 2008
Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach
Biometrics, 1988

Cited by 25 articles