A Survey on Classifying Big Data with Label Noise

Open Access

23 November 2022

journal article
research article
Published by Association for Computing Machinery (ACM) in Journal of Data and Information Quality

Vol. 14 (4), 1-43
https://doi.org/10.1145/3492546

Abstract

Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e. high volume, high variety, and high velocity problems. The surveyed works include distributed solutions capable of operating on data sets of arbitrary sizes, deep learning techniques for large-scale data sets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.

Keywords

This publication has 70 references indexed in Scilit:

Robust ensemble learning for mining noisy data streams
Decision Support Systems, 2011
A study of the effect of different types of noise on the precision of supervised learning techniques
Artificial Intelligence Review, 2010
Knowledge discovery from imbalanced and noisy data
Data & Knowledge Engineering, 2009
Classification in the presence of class noise using a probabilistic Kernel Fisher method
Pattern Recognition, 2007
Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets
Data Mining and Knowledge Discovery, 2006
Class Noise vs. Attribute Noise: A Quantitative Study
Artificial Intelligence Review, 2004
Analysis of new techniques to obtain quality training sets
Pattern Recognition Letters, 2003
Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)
The Annals of Statistics, 2000
Support-vector networks
Machine Learning, 1995
C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993
Machine Learning, 1994

Cited by 7 articles