Class-imbalanced classifiers for high-dimensional data

Open Access

9 February 2012

journal article
research article
Published by Oxford University Press (OUP) in Briefings in Bioinformatics

Vol. 14 (1), 13-26
https://doi.org/10.1093/bib/bbs006

Abstract

A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte–Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.

Keywords

This publication has 40 references indexed in Scilit:

Class prediction for high-dimensional class-imbalanced data
BMC Bioinformatics, 2010
Evaluation of Pulsed-Field Gel Electrophoresis Profiles for Identification of Salmonella Serotypes
Journal of Clinical Microbiology, 2010
Meta-learning for imbalanced data and classification ensemble in binary classification
Neurocomputing, 2009
Filtering for increased power for microarray data analysis
BMC Bioinformatics, 2009
BUILDING AN ORGAN-SPECIFIC CARCINOGENIC DATABASE FOR SAR ANALYSES
Journal of Toxicology and Environmental Health, Part A, 2004
Breast cancer classification and prognosis based on gene expression profiles from a population-based study
Proceedings of the National Academy of Sciences of the United States of America, 2003
Strategies for learning in class imbalance problems
Pattern Recognition, 2003
A molecular signature of metastasis in primary solid tumors
Nature Genetics, 2002
New Support Vector Algorithms
Neural Computation, 2000
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognition, 1997

Cited by 224 articles