SMOTE for high-dimensional class-imbalanced data
Top Cited Papers
Open Access
- 22 March 2013
- journal article
- research article
- Published by Springer Science and Business Media LLC in BMC Bioinformatics
- Vol. 14 (1), 106
- https://doi.org/10.1186/1471-2105-14-106
Abstract
Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.Keywords
This publication has 34 references indexed in Scilit:
- An active learning based classification strategy for the minority class problem: application to histopathology annotationBMC Bioinformatics, 2011
- Identification of microRNA precursors based on random forest with network-level representation method of stem-loop structureBMC Bioinformatics, 2011
- Class prediction for high-dimensional class-imbalanced dataBMC Bioinformatics, 2010
- microPred: effective classification of pre-miRNAs for human miRNA gene predictionBioinformatics, 2009
- Regularized linear discriminant analysis and its application in microarraysBiostatistics, 2006
- An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survivalProceedings of the National Academy of Sciences of the United States of America, 2005
- Classification and knowledge discovery in protein databasesJournal of Biomedical Informatics, 2004
- Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomesProceedings of the National Academy of Sciences of the United States of America, 2004
- Breast cancer classification and prognosis based on gene expression profiles from a population-based studyProceedings of the National Academy of Sciences of the United States of America, 2003
- A molecular signature of metastasis in primary solid tumorsNature Genetics, 2002