Guest Essay by Geoffrey H Sherrington
See Part One of June 20th at
In Part One, there was a challenge to find missing values for a graph that vaguely resembled the spaghetti graph of comparisons of GCMs. Here is the final result.
See Sherrington_PartTwo (PDF) for pictures & tables.
Each series follows the score of a football team over a season of 23 weeks. There are 18 teams (16 would make the schedule more neat) and each team plays all other teams. The insertion of some rest weeks about time slots 11,12,13 leads to a smaller average climb for those few weeks since fewer points were allocated. Win = 4 points, draw = 2 points, loss = 0 points. There has to be a result, so there are no artificial interpolations in this number set. The data are not strictly independent, because in a given week pairs of teams play in a related way, so that the pair outcome can only be 4. Because there is some system in the numbers, there are ways to attack a solution that would not be available from random numbers.
This set was chosen because the numbers were established in advance by a process that is not random, but merely constrained. (There are related sets from other years).
The exercise was done to see the responses of WUWT readers. It’s not easy to choose a number set to demonstrate what follows, without giving the game away. At first I thought that the solution, going from weeks 20 to 23, was difficult to impossible. However, the correct solution was cracked by Arnost, June 20 at 4.42 am. Congrats on a fine job.
Was the solution simply numeric, or was probability involved? I think the latter, as there appears to be no way to assign a 2-point draw in the last time slot, to team 14 rather than to team 15. Arnost’s explanation at June 20 6.56 pm mentions distributions, so probability was involved. Either that, or the scant information in his reply is a cover-up for chat with a moderator who knew the answer or Arnost guessed the data source, which is on the Internet (just joking).
Now to the hard part. How to compare a sporting score layout with an ensemble of GCMs? We will assume that for the GCM comparison, X-axis is time and Y-axis is temperature anomaly, as shown on the spaghetti graph.
I’m mainly interested in error, which by training I spilt into 2 types, being “bias” and “precision”. Bias measures the displacement of a number from the average of comparable estimates. Precision relates to the statistical probability that the number is firm and does not float about on replication. In the football example, the numbers have no bias or precision errors. They are as they are. Yet, on the other hand, knowing the rules in advance, one could say that the outcome of the top team playing the lowest team is likely to be a +4 for the former. This invokes probability, probability invokes statistics. People bet on these games. So how to analyse?
Take Series 1 and fit a linear least squares to its progress over 23 weeks. Constrain the Y intercept to 0 and derive y = 2.75x and R^2 = 0.9793. Do these numbers mean anything physical? Probably not. The goodness of fit and hence R^2 here depends on the sequence in which the games are played. If all the weaker teams were met before the stronger teams, we might have a curved response, but fitting a polynomial would not show much of interest either. I have seen it asserted (but do not think it likely) that comparisons of GCMs sometimes have involved just these approaches. They are not valid.
So, imagine that the football scores are GCM results. How does one compare the results of an ensemble? The football scores have an overall average that was plotted on the demo graph. But that was affected by the variable on the X-axis, when rest days were inserted. Is there a parallel in GCMs? Maybe random or unpredictable future events like a large volcanic eruption provide a parallel. Different modellers might program this to happen at different times, so the relation of results of runs is disturbed on the X-axis. Or, if such effects as volcanos are not incorporated, the times at which step changes or inflexions pass through the computation might differ from modeller to modeller, similar outcome. We become used to studying Y-axis values more often than X-axis.
For the football example, the average curve can be computed, a priori – without a game being played.
One needs to know only the points available each week divided by the number of teams playing. These are in the pre-printed games schedule. The curve of the average is constrained. Similarly, if the global temperature increase was constrained by modellers – by unconscious choice, by deliberate choice, or because of guidelines – to range between 0 and 5 deg C/century, average 2.5, the average curve could be found without doing runs by dividing the sum of constrained temperatures at any time by the number of modellers inputting it and reporting it.
Do such constraints enter GCM comparisons? On the Y-axis, the points available each period can be compared to the net energy balance of the globe (energy in – energy out) from time to time, which is translated to a temperature anomaly. On the X-axis, time is time unless the modellers place discontinuities at different times as discussed. The analogy would be valid, for example, if the net energy balance was constant; and differences between GCMs were the result of noise induced by the timing of the modelling assumptions. With thanks to Bob Tisdale, we can quote from an old and simple set of GIS FAQs –
Control runs establish the basic climate of the model. Control runs are long integrations where the model input forcings (solar irradiance, sulfates, ozone, greenhouse gases) are held constant and are not allowed to evolve with time. Usually the input forcings are held fixed either at present day values (i.e., for year 2000 or 2000 Control Run) or a pre-industrial values (i.e., for 1870 or 1870 Control Run). Note that in this context, “fixed” can have two different meanings. The solar forcing values are held fixed a constant, non varying number. The sulfate, ozone and greenhouse gases values, however, are fixed to continually cycle over the same 12-month input dataset every year. The CCSM is then run for an extended period of model time, 100s of years, up to about 1000 years, until the system is close to equilibrium (i.e., with only minor drifts in deep ocean temperature, surface temperature, top-of-the-atmosphere fluxes, etc).
Climate models are an imperfect representation of the earth’s climate system and climate modelers employ a technique called ensembling to capture the range of possible climate states. A climate model run ensemble consists of two or more climate model runs made with the exact same climate model, using the exact same boundary forcings, where the only difference between the runs is the initial conditions. An individual simulation within a climate model run ensemble is referred to as an ensemble member. The different initial conditions result in different simulations for each of the ensemble members due to the nonlinearity of the climate model system. Essentially, the earth’s climate can be considered to be a special ensemble that consists of only one member. Averaging over a multi-member ensemble of model climate runs gives a measure of the average model response to the forcings imposed on the model. http://web.archive.org/web/20090901051609/http://www.gisclimatechange.org/runSetsHelp.html
I stress that these control runs are constrained, as by the requirement to approach equilibrium. I don’t know if global climate is ever at equilibrium.
So, we can pose this question: Given that CGMs are constrained to a degree, does that constraint have adequate weight to influence the estimation of an ensemble average? Put another way, can we estimate the average curve of an ensemble from its constraints without even doing a run? Reader’s views are welcomed.
Steven Mosher has already commented “It matters little whether the ensemble mean has statistical meaning because it has practical skill, a skill that is better than any individual model. The reason why the ensemble has more skill is simple. The ensemble of models reduces weather noise and structural uncertainty.” Here, I have not even got into statistics. I’m just playing with numbers and wondering aloud.
Here’s how I’d prefer to approach some of the statistics for GCM modellers. Keep in mind that I have never done a run and so am likely to be naïve. First, let’s consider one group of modellers. The group will (presumably) do many model runs. Some of these will fail for known reasons, like typos or wrong factors in the inputs. We can exclude runs where the results are obviously wrong, even to an uniformed observer. However, after a while, a single model lab will have acquired a number of runs that look OK. These are what should be submitted for ensemble comparison, all of them, not a single run that is picked for its conformity with other modellers’ results or any other subjective reason. Do modellers swap notes? I’ve not seen this denied. In any event, there is an abundance of pressure to produce a result that is in line with past predictions and future wishes.
In my football example, the series were numbered 1-18. This was on purpose; they were numbered in order of final highest to lowest, so that some ambiguities like Arnost up above encountered were given a helping hand. Some might have been influenced by assuming (correctly) that the numbering was a leak to help them to the right result. Like swapping results would be.
If a modeller submitted all plausible run results, then a within-lab variance could be calculated via normal statistical methods. If all modellers contributed a number of runs, then the variance of each modeller could be combined by the usual statistical methods of calculating propagation of errors for the types of distributions derived from them. This would give a between-modeller variance, from which a form of bias can be derived. This conventional approach would remove some subjectivity that must be involved, to its detriment, if the modeller chooses but one run. It is likely that it would also broaden the final error bands. Both precision and bias are addressed this way.
However, the true measure of bias requires a known correct answer, not just an averaged answer, so that has to be obtained either by hindcasting or by waiting a few years after a comparison to re-examine the results against recent knowns.
The football example does not have bias, so I can’t use it here to show the point. But the statistical part I’ve just discussed has to be viewed in terms of the simple outcomes that the football model produced, especially the ability to give an average without even doing a run. If there is a component of that type of numerical outcome in the ensemble comparisons, then the average is a meaningless entity, skill or no skill.
Finally, onto projecting. The exercise was to take 20 points in a series and project to 23 points. The projection is constrained, as you can deduce. In terms of GCMs, I do not know if they advance step by step, but if they do, then they are somewhat similar to the football exercise. Each new point you calculate is pegged to the one before it, as they would be in a hypothetical serial GCM. (Here, I confess to not reading enough of the background to the spaghetti graph).
I’m used to the lines being normalised at a particular origin, so they fan out from a point at the start, or somewhere. If they don’t come together at some stage, it is hard to know how to compare them. A bias in an initial condition will affect bias of subsequent points and their relation to other runs. This places much stress on the initial condition. The football analogy starts all runs at zero and avoids this problem. It is possible to project forward from a point by deriving a statistical probability from previous points, as shown by some reader solutions of the football example. It is also possible to project forward by doing a numerical analysis of the parameters that are used to create a point. The methods of error analysis need not be similar for these two approaches.
In concluding, it is possible to find number sets generated in different ways that are quite dissimilar in the ways in which error estimation, averaging and projection are applicable. It can be difficult to discern the correct choice of method. With GCM ensembles, there is an evident problem of divergence of models from measured values over the past decade or more. Maybe some of the divergence comes from processes shown by the football example. There is no refuge in the Central Limit theorem or the so-called Law of Large Numbers. They are not applicable in the cases I have seen.
It has long been known that a constrained input can lead to constrained answers.
What I am missing is whether the constraints used in GCM models constrain the answers excessively. I have seen no mention of this in my limited reading. Therefore, I might well be wrong to raise these matters in this way. I hope that readers will show that I am wrong, because if I am right, then there are substantial problems.