Artificial Skill due to Predictor Screening

15 January 2009

journal article
Published by American Meteorological Society in Journal of Climate

Vol. 22 (2), 331-345
https://doi.org/10.1175/2008jcli2414.1

Abstract

This paper shows that if predictors are selected preferentially because of their strong correlation with a prediction variable, then standard methods for validating prediction models derived from these predictors will be biased. This bias is demonstrated by screening random numbers and showing that regression models derived from these random numbers have apparent skill, in a cross-validation sense, even though the predictors cannot possibly have the slightest predictive usefulness. This result seemingly implies that random numbers can give useful predictions, since the sample being predicted is separate from the sample used to estimate the regression model. The resolution of this paradox is that, prior to cross validation, all of the data had been used to evaluate correlations for selecting predictors. This situation differs from real-time forecasts in that the future sample is not available for screening. These results clarify the fallacy in assuming that if a model performs well in cross-validation mode, then it will perform well in real-time forecasts. This bias appears to afflict several forecast schemes that have been proposed in the literature, including operational forecasts of Indian monsoon rainfall and number of Atlantic hurricanes. The cross-validated skill of these models probably would not be distinguishable from that of a no-skill model if prior screening were taken into account.

Keywords

FORECASTING TECHNIQUES

This publication has 26 references indexed in Scilit:

Revised Prediction of Seasonal Atlantic Basin Tropical Cyclone Activity from 1 August
Weather and Forecasting, 2007
A Bayesian Framework for Multimodel Regression
Journal of Climate, 2007
New statistical models for long-range forecasting of southwest monsoon rainfall over India
Climate Dynamics, 2006
Predictability of seasonal climate variations: a pedagogical review
Published by Cambridge University Press (CUP) ,2006
Optimally Persistent Patterns in Time-Varying Fields
Journal of the Atmospheric Sciences, 2001
The Elements of Statistical Learning
Published by Springer Science and Business Media LLC ,2001
Strategies for Assessing Skill and Significance of Screening Regression Models with Emphasis on Monte Carlo Techniques
Journal of Climate and Applied Meteorology, 1984
Effects of sampling errors in statistical estimation
Deep Sea Research Part A. Oceanographic Research Papers, 1983
Techniques for statistical analysis and prediction of geophysical fluid systems
Geophysical & Astrophysical Fluid Dynamics, 1977
TESTS OF SIGNIFICANCE FOR THE LATENT ROOTS OF COVARIANCE AND CORRELATION MATRICES
Biometrika, 1956

Cited by 100 articles