Class-imbalanced classifiers for high-dimensional data
Open Access
- 9 February 2012
- journal article
- research article
- Published by Oxford University Press (OUP) in Briefings in Bioinformatics
- Vol. 14 (1), 13-26
- https://doi.org/10.1093/bib/bbs006
Abstract
A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte–Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.Keywords
This publication has 40 references indexed in Scilit:
- Class prediction for high-dimensional class-imbalanced dataBMC Bioinformatics, 2010
- Evaluation of Pulsed-Field Gel Electrophoresis Profiles for Identification of Salmonella SerotypesJournal of Clinical Microbiology, 2010
- Meta-learning for imbalanced data and classification ensemble in binary classificationNeurocomputing, 2009
- Filtering for increased power for microarray data analysisBMC Bioinformatics, 2009
- BUILDING AN ORGAN-SPECIFIC CARCINOGENIC DATABASE FOR SAR ANALYSESJournal of Toxicology and Environmental Health, Part A, 2004
- Breast cancer classification and prognosis based on gene expression profiles from a population-based studyProceedings of the National Academy of Sciences of the United States of America, 2003
- Strategies for learning in class imbalance problemsPattern Recognition, 2003
- A molecular signature of metastasis in primary solid tumorsNature Genetics, 2002
- New Support Vector AlgorithmsNeural Computation, 2000
- The use of the area under the ROC curve in the evaluation of machine learning algorithmsPattern Recognition, 1997