SMOTE for high-dimensional class-imbalanced data

Top Cited Papers

Open Access

22 March 2013

journal article
research article
Published by Springer Science and Business Media LLC in BMC Bioinformatics

Vol. 14 (1), 106
https://doi.org/10.1186/1471-2105-14-106

Abstract

Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

Keywords

This publication has 34 references indexed in Scilit:

An active learning based classification strategy for the minority class problem: application to histopathology annotation
BMC Bioinformatics, 2011
Identification of microRNA precursors based on random forest with network-level representation method of stem-loop structure
BMC Bioinformatics, 2011
Class prediction for high-dimensional class-imbalanced data
BMC Bioinformatics, 2010
microPred: effective classification of pre-miRNAs for human miRNA gene prediction
Bioinformatics, 2009
Regularized linear discriminant analysis and its application in microarrays
Biostatistics, 2006
An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival
Proceedings of the National Academy of Sciences of the United States of America, 2005
Classification and knowledge discovery in protein databases
Journal of Biomedical Informatics, 2004
Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes
Proceedings of the National Academy of Sciences of the United States of America, 2004
Breast cancer classification and prognosis based on gene expression profiles from a population-based study
Proceedings of the National Academy of Sciences of the United States of America, 2003
A molecular signature of metastasis in primary solid tumors
Nature Genetics, 2002

Cited by 585 articles