A Debiased Ranked Probability Skill Score to Evaluate Probabilistic Ensemble Forecasts with Small Ensemble Sizes

Abstract
The ranked probability skill score (RPSS) is a widely used measure to quantify the skill of ensemble forecasts. The underlying score is defined by the quadratic norm and is comparable to the mean squared error (mse) but it is applied in probability space. It is sensitive to the shape and the shift of the predicted probability distributions. However, the RPSS shows a negative bias for ensemble systems with small ensemble size, as recently shown. Here, two strategies are explored to tackle this flaw of the RPSS. First, the RPSS is examined for different norms L (RPSSL). It is shown that the RPSSL=1 based on the absolute rather than the squared difference between forecasted and observed cumulative probability distribution is unbiased; RPSSL defined with higher-order norms show a negative bias. However, the RPSSL=1 is not strictly proper in a statistical sense. A second approach is then investigated, which is based on the quadratic norm but with sampling errors in climatological probabilities considered in the reference forecasts. This technique is based on strictly proper scores and results in an unbiased skill score, which is denoted as the debiased ranked probability skill score (RPSSD) hereafter. Both newly defined skill scores are independent of the ensemble size, whereas the associated confidence intervals are a function of the ensemble size and the number of forecasts. The RPSSL=1 and the RPSSD are then applied to the winter mean [December–January–February (DJF)] near-surface temperature predictions of the ECMWF Seasonal Forecast System 2. The overall structures of the RPSSL=1 and the RPSSD are more consistent and largely independent of the ensemble size, unlike the RPSSL=2. Furthermore, the minimum ensemble size required to predict a climate anomaly given a known signal-to-noise ratio is determined by employing the new skill scores. For a hypothetical setup comparable to the ECMWF hindcast system (40 members and 15 hindcast years), statistically significant skill scores were only found for a signal-to-noise ratio larger than ∼0.3.