Abstract
Simulation was used to evaluate the performances of several methods of variable selection in regression modeling: stepwise regression based on partial F-tests, stepwise minimization of Mallows’ C p statistic and Schwarz’s Bayes Information Criterion (BIC), and regression trees constructed with two kinds of pruning. Five to 25 covariates were generated in multivariate clusters, and responses were obtained from an ordinary linear regression model involving three of the covariates; each data set had 50 observations. The regression-tree approaches were markedly inferior to the other methods in discriminating between informative and noninformative covariates, and their predictions of responses in “new” data sets were much more variable and less accurate than those of the other methods. The F-test, C p and BIC approaches were similar in their overall frequencies of “correct” decisions about inclusion or exclusion of covariates, with the C p method leading to the largest models and the BIC method to the smallest, The three methods were also comparable in their ability to predict “new” observations, with perhaps a tendency for the C p approach to perform relatively more poorly for large covariate pools. The abilities of all methods to discriminate between informative and noninformative covariates and to predict “new” observations decreased with increasing size of the covariate pool.