To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques
Top Cited Papers
- 21 July 2015
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Knowledge and Data Engineering
- Vol. 28 (1), 238-251
- https://doi.org/10.1109/tkde.2015.2458858
Abstract
Class imbalance problem is quite pervasive in our nowadays human practice. This problem basically refers to the skewness in the data underlying distribution which, in turn, imposes many difficulties on typical machine learning algorithms. To deal with the emerging issues arising from multi-class skewed distributions, existing efforts are mainly divided into two categories: model-oriented solutions and data-oriented techniques. Focusing on the latter, this paper presents a new over-sampling technique which is inspired by Mahalanobis distance. The presented over-sampling technique, called MDO (Mahalanobis Distance-based Over-sampling technique), generates synthetic samples which have the same Mahalanobis distance from the considered class mean as other minority class examples. By preserving the covariance structure of the minority class instances and intelligently generating synthetic samples along the probability contours, new minority class instances are modelled better for learning algorithms. Moreover, MDO can reduce the risk of overlapping between different class regions which are considered as a serious challenge in multi-class problems. Our theoretical analyses and empirical observations across wide spectrum multi-class imbalanced benchmarks indicate that MDO is the method of choice by offering statistical superior MAUC and precision compared to the popular over-sampling techniques.Keywords
This publication has 34 references indexed in Scilit:
- Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approachesKnowledge-Based Systems, 2013
- Hellinger distance decision trees are robust and skew-insensitiveData Mining and Knowledge Discovery, 2011
- Protein classification with imbalanced dataProteins, 2007
- A study of the behavior of several methods for balancing machine learning training dataACM SIGKDD Explorations Newsletter, 2004
- EditorialACM SIGKDD Explorations Newsletter, 2004
- Classification by pairwise couplingThe Annals of Statistics, 1998
- The use of the area under the ROC curve in the evaluation of machine learning algorithmsPattern Recognition, 1997
- The condensed nearest neighbor rule (Corresp.)IEEE Transactions on Information Theory, 1968
- Multiple Comparisons among MeansJournal of the American Statistical Association, 1961
- The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of VarianceJournal of the American Statistical Association, 1937