Prediction error estimation: a comparison of resampling methods
Open Access
- 19 May 2005
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 21 (15), 3301-3307
- https://doi.org/10.1093/bioinformatics/bti499
Abstract
Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the ‘true’ prediction error of a prediction model in the presence of feature selection. Results: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. Contact:annette.molinaro@yale.edu Supplementary Information: A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm).Keywords
This publication has 27 references indexed in Scilit:
- Improvements on Cross-Validation: The 632+ Bootstrap MethodJournal of the American Statistical Association, 1997
- R: A Language for Data Analysis and GraphicsJournal of Computational and Graphical Statistics, 1996
- R: A Language for Data Analysis and GraphicsJournal of Computational and Graphical Statistics, 1996
- A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methodsBiometrika, 1989
- Estimating the Error Rate of a Prediction Rule: Improvement on Cross-ValidationJournal of the American Statistical Association, 1983
- Estimating the Error Rate of a Prediction Rule: Improvement on Cross-ValidationJournal of the American Statistical Association, 1983
- Asymptotics for and against cross-validationBiometrika, 1977
- The Predictive Sample Reuse Method with ApplicationsJournal of the American Statistical Association, 1975
- Estimation of Error Rates in Discriminant AnalysisTechnometrics, 1968
- Estimation of Error Rates in Discriminant AnalysisTechnometrics, 1968