Abstract
Studies on forecast evaluation often rely on estimating limiting observed frequencies conditioned on specific forecast probabilities (the reliability diagram or calibration function). Obviously, statistical estimates of the calibration function are based on only limited amounts of data and therefore contain residual errors. Although errors and variations of calibration function estimates have been studied previously, either they are often assumed to be small or unimportant, or they are ignored altogether. It is demonstrated how these errors can be described in terms of bias and variance, two concepts well known in the statistics literature. Bias and variance adversely affect estimates of the reliability and sharpness terms of the Brier score, recalibration of forecasts, and the assessment of forecast reliability through reliability diagram plots. Ways to communicate and appreciate these errors are presented. It is argued that these errors can become quite substantial if individual sample points have too large influence on the estimate, which can be avoided by using regularization techniques. As an illustration, it is discussed how to choose an appropriate bin size in the binning and counting method, and an appropriate bandwidth parameter for kernel estimates.

This publication has 13 references indexed in Scilit: