SVMs Modeling for Highly Imbalanced Classification

Top Cited Papers

Abstract

Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this paper, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different ldquorebalancerdquo heuristics in SVM modeling, including cost-sensitive learning, and over- and undersampling. These SVM-based strategies are compared with various state-of-the-art approaches on a variety of data sets by using various metrics, including G-mean, area under the receiver operating characteristic curve, F-measure, and area under the precision/recall curve. We show that we are able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this paper, the novel granular SVMs-repetitive undersampling algorithm (GSVM-RU) is the best in terms of both effectiveness and efficiency. GSVM-RU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVM-RU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.

Keywords

This publication has 18 references indexed in Scilit:

Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Maximizing sensitivity in medical diagnosis using biased minimax probability Machine
IEEE Transactions on Biomedical Engineering, 2006
The relationship between Precision-Recall and ROC curves
Published by Association for Computing Machinery (ACM) ,2006
Granular support vector machines with association rules mining for protein homology prediction
Artificial Intelligence in Medicine, 2005
Extreme re-balancing for SVMs
ACM SIGKDD Explorations Newsletter, 2004
Learning from imbalanced data sets with boosting and data generation
ACM SIGKDD Explorations Newsletter, 2004
Mining with rarity
ACM SIGKDD Explorations Newsletter, 2004
Editorial
ACM SIGKDD Explorations Newsletter, 2004
Efficient support vector classifiers for named entity recognition
Published by Association for Computational Linguistics (ACL) ,2002
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognition, 1997

Cited by 681 articles