ENRICHing Medical Imaging Training Sets Enables More Efficient Machine Learning

Preprint

25 May 2021

preprint
research article
Published by Cold Spring Harbor Laboratory

https://doi.org/10.1101/2021.05.22.21257645

Abstract

Objective: Deep learning (DL) has been applied in proofs of concept across biomedical imaging, including across modalities and medical specialties^1–17. Labeled data is critical to training and testing DL models, but human expert labelers are limited. In addition, DL traditionally requires copious training data, which is computationally expensive to process and iterate over. Consequently, it is useful to prioritize using those images that are most likely to improve a model’s performance, a practice known as instance selection. The challenge is determining how best to prioritize. It is natural to prefer straightforward, robust, quantitative metrics as the basis for prioritization for instance selection. However, in current practice such metrics are not tailored to, and almost never used for, image datasets.Methods: To address this problem, we introduce ENRICH—EliminateNoise andRedundancy for ImagingChallenges—a customizable method that prioritizes images based on how much diversity each image adds to the training set.Results: First, we show that medical datasets are special in that in general each image adds less diversity than in non-medical datasets. Next, we demonstrate that ENRICH achieves nearly maximal performance on classification and segmentation tasks on several medical image datasets using only a fraction of the available images and outperforms random image selection, the negative control. Finally, we show that ENRICH can also be used to identify errors and outliers in imaging datasets.Conclusion: ENRICH is a simple, computationally efficient method for prioritizing images for expert labeling and use in DL.

Keywords

This publication has 27 references indexed in Scilit:

Fast and accurate view classification of echocardiograms using deep learning
npj Digital Medicine, 2018
Dermatologist-level classification of skin cancer with deep neural networks
Nature, 2017
Learning how to Active Learn: A Deep Reinforcement Learning Approach
Published by Association for Computational Linguistics (ACL) ,2017
Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs
JAMA, 2016
Cost-Effective Active Learning for Deep Image Classification
IEEE Transactions on Circuits and Systems for Video Technology, 2016
Robust estimates of overall immune-repertoire diversity from high-throughput measurements on samples
Nature Communications, 2016
The U.S. Radiologist Workforce: An Analysis of Temporal and Geographic Variation by Using Large National Datasets
Radiology, 2016
A review of instance selection methods
Artificial Intelligence Review, 2010
Multi-class active learning for image classification
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
Improving generalization with active learning
Machine Learning, 1994