I’m pleased to have had a chance to to review this new paper just published in the Journal of Climate:
An Evaluation of Decadal Probability Forecasts from State-of-the-Art Climate Models Suckling, Emma B., Leonard A. Smith, 2013: An Evaluation of Decadal Probability Forecasts from State-of-the-Art Climate Models*. J. Climate, 26, 9334–9347. doi: http://dx.doi.org/10.1175/JCLI-D-12-00485.1
The lead author, Emma Suckling, was kind enough to provide me with a copy for reading. This paper seeks to find the errors in the EU based ENSEMBLES project by hindcasting and evaluating the error. I was struck by the fact that in figure 2 below, there was broad disagreement between four models, with one having errors as large as 4.5 a decade out.
The conclusion rather says it all, these models just don’t have the physical processes of the dynamic and complex Earth captured yet, hence the photo I included above.
While state-of-the-art models of Earth’s climate system have improved tremendously over the last 20 years, nontrivial structural flaws still hinder their ability to forecast the decadal dynamics of the Earth system realistically. Contrasting the skill of these models not only with each other but also with empirical models can reveal the space and time scales on which simulation models exploit their physical basis effectively and quantify their ability to add information to operational forecasts. The skill of decadal probabilistic hindcasts for annual global-mean and regional-mean temperatures from the EU Ensemble-Based Predictions of Climate Changes and Their Impacts (ENSEMBLES) project is contrasted with several empirical models. Both the ENSEMBLES models and a “dynamic climatology” empirical model show probabilistic skill above that of a static climatology for global-mean temperature. The dynamic climatology model, however, often outperforms the ENSEMBLES models. The fact that empirical models display skill similar to that of today’s state-of-the-art simulation models suggests that empirical forecasts can improve decadal forecasts for climate services, just as in weather, medium-range, and seasonal forecasting. It is suggested that the direct comparison of simulation models with empirical models becomes a regular component of large model forecast evaluations. Doing so would clarify the extent to which state-of-the-art simulation models provide information beyond that available from simpler empirical models and clarify current limitations in using simulation forecasting for decision support. Ultimately, the skill of simulation models based on physical principles is expected to surpass that of empirical models in a changing climate; their direct comparison provides information on progress toward that goal, which is not available in model–model intercomparisons.
State-of-the-art dynamical simulation models of Earth’s climate system1 are often used to make probabilistic pre- dictions about the future climate and related phenomena with the aim of providing useful information for decision support (Anderson et al. 1999; Met Office 2011; Weigela and Bowlerb 2009; Alessandri et al. 2011; Hagedorn et al. 2005; Hagedorn and Smith 2009; Meehl et al. 2009; Doblas-Reyes et al. 2010, 2011; Solomon et al. 2007; Reifen and Toumi 2009). Evaluating the performance of such predictions from a model or set of models is crucial not only in terms of making scientific progress but also in determining how much information may be available to decision makers via climate services. It is desirable to establish a robust and transparent approach to forecast evaluation, for the purpose of examining the extent to which today’s best available models are adequate over the spatial and temporal scales of interest for the task at hand. A useful reality check is provided by comparing the simulation models not only with other simulation models but also with empirical models that do not include direct physical simulation.
Decadal prediction brings several challenges for the design of ensemble experiments and their evaluation (Meehl et al. 2009; van Oldenborgh et al. 2012; Doblas- Reyes et al. 2010; Fildes and Kourentzes 2011; Doblas- Reyes et al. 2011); the analysis of decadal prediction
systems will form a significant focus of the Intergovernmental Panel on Climate Change (IPCC) Fifth Assess- ment Report (AR5). Decadal forecasts are of particular interest both for information on the impacts over the next 10 years, as well as from the perspective of climate model evaluation. Hindcast experiments over an archive of historical observations allow approaches from empirical forecasting to be used for model evaluation. Such approaches can aid in the evaluation of forecasts from simulation models (Fildes and Kourentzes 2011; van Oldenborgh et al. 2012) and potentially increase the practical value of such forecasts through blending fore- casts from simulation models with forecasts from empirical models that do not include direct physical simulation (Bro€cker and Smith 2008).
This paper contrasts the performance of decadal probability forecasts from simulation models with that of empirical models constructed from the record of available observations. Empirical models are unlikely to yield realistic forecasts for the future once climate change moves the Earth system away from the conditions observed in the past. A simulation model, which aims to capture the relevant physical processes and feedbacks, is expected to be at least competitive with the empirical model. If this is not the case in the recent past, then it is reasonable to demand evidence that those particular simulation models are likely to be more in- formative than empirical models in forecasting the near future.
A set of decadal simulations from the Ensemble- Based Predictions of Climate Changes and Their Impacts (ENSEMBLES) experiment (Hewitt and Griggs 2004; Doblas-Reyes et al. 2010), a precursor to phase 5 of the Coupled Model Intercomparison Project (CMIP5) de- cadal simulations (Taylor et al. 2009), is considered. The ENSEMBLES probability hindcasts are contrasted with forecasts from empirical models of the static climatology, persistence, and a ‘‘dynamic climatology’’ model de- veloped for evaluating other dynamical systems (Smith 1997; Binter 2012). Ensemble members are transformed into probabilistic forecasts via kernel dressing (Bro€cker and Smith 2008); their quality is quantified according to several proper scoring rules (Bro€cker and Smith 2006). The ENSEMBLES models do not demonstrate significantly greater skill than that of an empirical dynamic climatology model either for global-mean temperature or for the land-based Giorgi region2 temperatures (Giorgi 2002).
It is suggested that the direct comparison of simulation models with empirical models become a regular component of large model forecast evaluations. The methodology is easily adapted to other climate fore- casting experiments and can provide a useful guide to decision makers about whether state-of-the-art fore- casts from simulation models provide additional in- formation to that available from easily constructed empirical models.
An overview of the ENSEMBLES models used for decadal probabilistic forecasting is discussed in section 2. The appropriate choice of empirical model for probabilistic decadal predictions forms the basis of section 3, while section 4 contains details of the evaluation frame- work and the transformation of ensembles into probabilistic forecast distributions. The performance of the ENSEMBLES decadal hindcast simulations is pre- sented in section 5 and compared to that of the empirical models. Section 6 then provides a summary of conclu- sions and a discussion of their implications. The supplementary material includes graphics for models not shown in the main text, comparisons with alternative empirical models, results for regional forecasts, and the application of alternative (proper) skill scores. The basic conclusion is relatively robust: the empirical dynamic climatology (DC) model often outperforms the simulation models in terms of probability forecasting of temperature.
The quality of decadal probability forecasts from the ENSEMBLES simulation models has been compared with that of reference forecasts from several empirical models. In general, the stream 2 ENSEMBLES simu- lation models demonstrate less skill than the empirical DC model across the range of lead times from 1 to 10 years. The result holds for a variety of proper scoring rules including ignorance (Good 1952), the proper linear score (PL) (Jolliffe and Stephenson 2003), and the continuous ranked probability score (CRPS) (Bro€cker and Smith 2006). A similar result holds on smaller spatial scales for the Giorgi regions (see supplementary material). These new results for probability forecasts are consistent with evaluations of root-mean-square errors of decadal simulation models with other reference point forecasts (Fildes and Kourentzes 2011; van Oldenborgh et al. 2012; Weisheimer et al. 2009). The DC probability forecasts often place up to 4 bits more information (or 24 times more probability mass) on the observed outcome than the ENSEMBLES simulation models.
In the context of climate services, the comparable skill of simulation models and empirical models suggests that the empirical models will be of value for blending with simulation model ensembles; this is already done in ensemble forecasts for the medium range and on seasonal lead times. It also calls into question the extent to which current simulation models successfully capture the physics required for realistic simulation of the Earth system and can thereby be expected to provide robust, reliable predictions (and, of course, to outperform empirical models) on longer time scales.
The evaluation and comparison of decadal forecasts will always be hindered by the relatively small samples involved when contrasted with the case of weather forecasts; the decadal forecast–outcome archive currently considered is only half a century in duration. Advances both in modeling and in observation, as well as changes in Earth’s climate, are likely to mean the relevant forecast–outcome archive will remain small. One improvement that could be made to clarify the skill of the simulation models is to improve the experimental design of hindcasts: in particular, to increase the ensemble size used. For the ENSEMBLES models, each simulation ensemble consisted of only three members launched at 5 years intervals. Larger ensembles and more frequent forecast launch dates can ease the evaluation of skill without waiting for the forecast–outcome archive to grow larger.9
The analysis of hindcasts can never be interpreted as an out-of-sample evaluation. The mathematical structure of simulation models, as well as parameterizations and parameter values, has been developed with knowledge of the historical data. Empirical models with a simple mathematical structure suffer less from this effect. Prelaunch empirical models based on the DC structure and using only observations before the fore- cast launch date also outperform the ENSEMBLES simulation models. This result is robust over a range of ensemble interpretation parameters (i.e., variations in the kernel width used). Both prelaunch trend models and persistence models are less skillful than the DC models considered.
The comparison of near-term climate probability forecasts from Earth simulation models with those from dynamic climatology empirical models provides a useful benchmark as the simulation models improve in the future. The blending (Bro€cker and Smith 2008) of simulation models and empirical models is likely to provide more skillful probability forecasts in climate services, for both policy and adaptation decisions. In addition, clear communication of the (limited) expectations for skillful decadal forecasts can avoid casting doubt on well-founded physical understanding of the radiative response to increasing carbon dioxide concentration in Earth’s atmosphere. Finally, these comparisons cast a sharp light on distinguishing whether current limitations in estimating the skill of a model arise from external factors like the size of the forecast–outcome archive or from the experimental design. Such insights are a valuable product of ENSEMBLES and will contribute to the experimental design of future ensemble decadal prediction systems.