To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques

Top Cited Papers

Abstract

Class imbalance problem is quite pervasive in our nowadays human practice. This problem basically refers to the skewness in the data underlying distribution which, in turn, imposes many difficulties on typical machine learning algorithms. To deal with the emerging issues arising from multi-class skewed distributions, existing efforts are mainly divided into two categories: model-oriented solutions and data-oriented techniques. Focusing on the latter, this paper presents a new over-sampling technique which is inspired by Mahalanobis distance. The presented over-sampling technique, called MDO (Mahalanobis Distance-based Over-sampling technique), generates synthetic samples which have the same Mahalanobis distance from the considered class mean as other minority class examples. By preserving the covariance structure of the minority class instances and intelligently generating synthetic samples along the probability contours, new minority class instances are modelled better for learning algorithms. Moreover, MDO can reduce the risk of overlapping between different class regions which are considered as a serious challenge in multi-class problems. Our theoretical analyses and empirical observations across wide spectrum multi-class imbalanced benchmarks indicate that MDO is the method of choice by offering statistical superior MAUC and precision compared to the popular over-sampling techniques.

Keywords

This publication has 34 references indexed in Scilit:

Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches
Knowledge-Based Systems, 2013
Hellinger distance decision trees are robust and skew-insensitive
Data Mining and Knowledge Discovery, 2011
Protein classification with imbalanced data
Proteins, 2007
A study of the behavior of several methods for balancing machine learning training data
ACM SIGKDD Explorations Newsletter, 2004
Editorial
ACM SIGKDD Explorations Newsletter, 2004
Classification by pairwise coupling
The Annals of Statistics, 1998
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognition, 1997
The condensed nearest neighbor rule (Corresp.)
IEEE Transactions on Information Theory, 1968
Multiple Comparisons among Means
Journal of the American Statistical Association, 1961
The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance
Journal of the American Statistical Association, 1937

Cited by 244 articles