A Large-Scale Study of the Impact of Feature Selection Techniques on Defect Classification Models
- 1 May 2017
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)
- p. 146-157
- https://doi.org/10.1109/msr.2017.18
Abstract
The performance of a defect classification model depends on the features that are used to train it. Feature redundancy, correlation, and irrelevance can hinder the performance of a classification model. To mitigate this risk, researchers often use feature selection techniques, which transform or select a subset of the features in order to improve the performance of a classification model. Recent studies compare the impact of different feature selection techniques on the performance of defect classification models. However, these studies compare a limited number of classification techniques and have arrived at contradictory conclusions about the impact of feature selection techniques. To address this limitation, we study 30 feature selection techniques (11 filter-based ranking techniques, six filter based subset techniques, 12 wrapper-based subset techniques, and a no feature selection configuration) and 21 classification techniques when applied to 18 datasets from the NASA and PROMISE corpora. Our results show that a correlation-based filter-subset feature selection technique with a BestFirst search method outperforms other feature selection techniques across the studied datasets (it outperforms in 70%-87% of the PROMISE-NASA data sets) and across the studied classification techniques (it outperforms for 90% of the techniques). Hence, we recommend the application of such a selection technique when building defect classification models.Keywords
This publication has 48 references indexed in Scilit:
- Choosing software metrics for defect prediction: an investigation on feature selection techniquesSoftware: Practice and Experience, 2011
- Classification and regression treesWIREs Data Mining and Knowledge Discovery, 2011
- A combinational incremental ensemble of classifiers as a technique for predicting students’ performance in distance educationKnowledge-Based Systems, 2010
- A feature selection technique for classificatory analysisPattern Recognition Letters, 2005
- Consistency-based search in feature selectionArtificial Intelligence, 2003
- Benchmarking attribute selection techniques for discrete class data miningIEEE Transactions on Knowledge and Data Engineering, 2003
- Logistic Regression Analysis and Reporting: A PrimerUnderstanding Statistics, 2002
- Radial basis functionsActa Numerica, 2000
- A critique of software defect prediction modelsIEEE Transactions on Software Engineering, 1999
- Selection of relevant features and examples in machine learningArtificial Intelligence, 1997