Classification of imbalanced data sets using Multi Objective Genetic Programming

Abstract
Classification of imbalanced data set is a challenging problem as it is very difficult to achieve good classification accuracy for each class in case of imbalanced data sets. This problem arises in many real world applications like medical diagnosis of rare medical disease, fraud detection in financial domain, and faulty area detection in network troubleshooting etc. The imbalanced data set consists of small number of instances of minority classes and large number of instances of majority classes. Overall classification accuracy is computed by taking the ratio of correctly classified instances to total number of instances in a data set. For imbalanced data sets, correct classification of minority class instances contribute minimum in improvement of overall classification accuracy as compared to classification of majority class instances. Conventional classification techniques like Artificial Neural Network (ANN), Support Vector Machine (SVM), and Naïve Bayes (NB) consider overall classification accuracy of the classifier only and thus evolve biased classifiers in case of imbalanced data set. However, instances of minority classes may contain rare but important information in many real world data sets. Thus, a classification technique that provides good classification accuracy on both minority and majority classes is needed. This paper proposes a combination of Multi Objective Genetic Programming (MOGP) and probability based Gaussian classifier for classification of imbalanced data set. MOGP considers classification accuracy of each class as separate objective and not the overall accuracy as single objective. Gaussian classifier is generative classifier in which distribution of one class never affect the classification of instances of other classes. The proposed methodology is applied on classification of imbalanced data sets from medical, life science, automobile, and space science domain. The results suggest that MOGP classifier outperformed other conventional classifiers (ANN, SVM, and NB) on tested imbalanced data sets.

This publication has 15 references indexed in Scilit: