A Survey on Classifying Big Data with Label Noise
Open Access
- 23 November 2022
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in Journal of Data and Information Quality
- Vol. 14 (4), 1-43
- https://doi.org/10.1145/3492546
Abstract
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e. high volume, high variety, and high velocity problems. The surveyed works include distributed solutions capable of operating on data sets of arbitrary sizes, deep learning techniques for large-scale data sets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.Keywords
This publication has 70 references indexed in Scilit:
- Robust ensemble learning for mining noisy data streamsDecision Support Systems, 2011
- A study of the effect of different types of noise on the precision of supervised learning techniquesArtificial Intelligence Review, 2010
- Knowledge discovery from imbalanced and noisy dataData & Knowledge Engineering, 2009
- Classification in the presence of class noise using a probabilistic Kernel Fisher methodPattern Recognition, 2007
- Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data DatasetsData Mining and Knowledge Discovery, 2006
- Class Noise vs. Attribute Noise: A Quantitative StudyArtificial Intelligence Review, 2004
- Analysis of new techniques to obtain quality training setsPattern Recognition Letters, 2003
- Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)The Annals of Statistics, 2000
- Support-vector networksMachine Learning, 1995
- C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993Machine Learning, 1994