Comparing Climate Models – Part Two

Guest Essay by Geoffrey H Sherrington

See Part One of June 20th at

http://wattsupwiththat.com/2013/06/20/comparing-climate-models-part-one/

In Part One, there was a challenge to find missing values for a graph that vaguely resembled the spaghetti graph of comparisons of GCMs. Here is the final result.

RESULTS

See Sherrington_PartTwo (PDF) for pictures & tables.

Each series follows the score of a football team over a season of 23 weeks. There are 18 teams (16 would make the schedule more neat) and each team plays all other teams. The insertion of some rest weeks about time slots 11,12,13 leads to a smaller average climb for those few weeks since fewer points were allocated. Win = 4 points, draw = 2 points, loss = 0 points. There has to be a result, so there are no artificial interpolations in this number set. The data are not strictly independent, because in a given week pairs of teams play in a related way, so that the pair outcome can only be 4. Because there is some system in the numbers, there are ways to attack a solution that would not be available from random numbers.

This set was chosen because the numbers were established in advance by a process that is not random, but merely constrained. (There are related sets from other years).

The exercise was done to see the responses of WUWT readers. It’s not easy to choose a number set to demonstrate what follows, without giving the game away. At first I thought that the solution, going from weeks 20 to 23, was difficult to impossible. However, the correct solution was cracked by Arnost, June 20 at 4.42 am. Congrats on a fine job.

Was the solution simply numeric, or was probability involved? I think the latter, as there appears to be no way to assign a 2-point draw in the last time slot, to team 14 rather than to team 15. Arnost’s explanation at June 20 6.56 pm mentions distributions, so probability was involved. Either that, or the scant information in his reply is a cover-up for chat with a moderator who knew the answer or Arnost guessed the data source, which is on the Internet (just joking).

……………………………..

ERROR

Now to the hard part. How to compare a sporting score layout with an ensemble of GCMs? We will assume that for the GCM comparison, X-axis is time and Y-axis is temperature anomaly, as shown on the spaghetti graph.

I’m mainly interested in error, which by training I spilt into 2 types, being “bias” and “precision”. Bias measures the displacement of a number from the average of comparable estimates. Precision relates to the statistical probability that the number is firm and does not float about on replication. In the football example, the numbers have no bias or precision errors. They are as they are. Yet, on the other hand, knowing the rules in advance, one could say that the outcome of the top team playing the lowest team is likely to be a +4 for the former. This invokes probability, probability invokes statistics. People bet on these games. So how to analyse?

Take Series 1 and fit a linear least squares to its progress over 23 weeks. Constrain the Y intercept to 0 and derive y = 2.75x and R^2 = 0.9793. Do these numbers mean anything physical? Probably not. The goodness of fit and hence R^2 here depends on the sequence in which the games are played. If all the weaker teams were met before the stronger teams, we might have a curved response, but fitting a polynomial would not show much of interest either. I have seen it asserted (but do not think it likely) that comparisons of GCMs sometimes have involved just these approaches. They are not valid.

So, imagine that the football scores are GCM results. How does one compare the results of an ensemble? The football scores have an overall average that was plotted on the demo graph. But that was affected by the variable on the X-axis, when rest days were inserted. Is there a parallel in GCMs? Maybe random or unpredictable future events like a large volcanic eruption provide a parallel. Different modellers might program this to happen at different times, so the relation of results of runs is disturbed on the X-axis. Or, if such effects as volcanos are not incorporated, the times at which step changes or inflexions pass through the computation might differ from modeller to modeller, similar outcome. We become used to studying Y-axis values more often than X-axis.

CONSTRAINTS

For the football example, the average curve can be computed, a priori – without a game being played.

One needs to know only the points available each week divided by the number of teams playing. These are in the pre-printed games schedule. The curve of the average is constrained. Similarly, if the global temperature increase was constrained by modellers – by unconscious choice, by deliberate choice, or because of guidelines – to range between 0 and 5 deg C/century, average 2.5, the average curve could be found without doing runs by dividing the sum of constrained temperatures at any time by the number of modellers inputting it and reporting it.

Do such constraints enter GCM comparisons? On the Y-axis, the points available each period can be compared to the net energy balance of the globe (energy in – energy out) from time to time, which is translated to a temperature anomaly. On the X-axis, time is time unless the modellers place discontinuities at different times as discussed. The analogy would be valid, for example, if the net energy balance was constant; and differences between GCMs were the result of noise induced by the timing of the modelling assumptions. With thanks to Bob Tisdale, we can quote from an old and simple set of GIS FAQs –

Control runs establish the basic climate of the model. Control runs are long integrations where the model input forcings (solar irradiance, sulfates, ozone, greenhouse gases) are held constant and are not allowed to evolve with time. Usually the input forcings are held fixed either at present day values (i.e., for year 2000 or 2000 Control Run) or a pre-industrial values (i.e., for 1870 or 1870 Control Run). Note that in this context, “fixed” can have two different meanings. The solar forcing values are held fixed a constant, non varying number. The sulfate, ozone and greenhouse gases values, however, are fixed to continually cycle over the same 12-month input dataset every year. The CCSM is then run for an extended period of model time, 100s of years, up to about 1000 years, until the system is close to equilibrium (i.e., with only minor drifts in deep ocean temperature, surface temperature, top-of-the-atmosphere fluxes, etc).

Climate models are an imperfect representation of the earth’s climate system and climate modelers employ a technique called ensembling to capture the range of possible climate states. A climate model run ensemble consists of two or more climate model runs made with the exact same climate model, using the exact same boundary forcings, where the only difference between the runs is the initial conditions. An individual simulation within a climate model run ensemble is referred to as an ensemble member. The different initial conditions result in different simulations for each of the ensemble members due to the nonlinearity of the climate model system. Essentially, the earth’s climate can be considered to be a special ensemble that consists of only one member. Averaging over a multi-member ensemble of model climate runs gives a measure of the average model response to the forcings imposed on the model. http://web.archive.org/web/20090901051609/http://www.gisclimatechange.org/runSetsHelp.html

I stress that these control runs are constrained, as by the requirement to approach equilibrium. I don’t know if global climate is ever at equilibrium.

So, we can pose this question: Given that CGMs are constrained to a degree, does that constraint have adequate weight to influence the estimation of an ensemble average? Put another way, can we estimate the average curve of an ensemble from its constraints without even doing a run? Reader’s views are welcomed.

Steven Mosher has already commentedIt matters little whether the ensemble mean has statistical meaning because it has practical skill, a skill that is better than any individual model. The reason why the ensemble has more skill is simple. The ensemble of models reduces weather noise and structural uncertainty.” Here, I have not even got into statistics. I’m just playing with numbers and wondering aloud.

Here’s how I’d prefer to approach some of the statistics for GCM modellers. Keep in mind that I have never done a run and so am likely to be naïve. First, let’s consider one group of modellers. The group will (presumably) do many model runs. Some of these will fail for known reasons, like typos or wrong factors in the inputs. We can exclude runs where the results are obviously wrong, even to an uniformed observer. However, after a while, a single model lab will have acquired a number of runs that look OK. These are what should be submitted for ensemble comparison, all of them, not a single run that is picked for its conformity with other modellers’ results or any other subjective reason. Do modellers swap notes? I’ve not seen this denied. In any event, there is an abundance of pressure to produce a result that is in line with past predictions and future wishes.

In my football example, the series were numbered 1-18. This was on purpose; they were numbered in order of final highest to lowest, so that some ambiguities like Arnost up above encountered were given a helping hand. Some might have been influenced by assuming (correctly) that the numbering was a leak to help them to the right result. Like swapping results would be.

If a modeller submitted all plausible run results, then a within-lab variance could be calculated via normal statistical methods. If all modellers contributed a number of runs, then the variance of each modeller could be combined by the usual statistical methods of calculating propagation of errors for the types of distributions derived from them. This would give a between-modeller variance, from which a form of bias can be derived. This conventional approach would remove some subjectivity that must be involved, to its detriment, if the modeller chooses but one run. It is likely that it would also broaden the final error bands. Both precision and bias are addressed this way.

However, the true measure of bias requires a known correct answer, not just an averaged answer, so that has to be obtained either by hindcasting or by waiting a few years after a comparison to re-examine the results against recent knowns.

The football example does not have bias, so I can’t use it here to show the point. But the statistical part I’ve just discussed has to be viewed in terms of the simple outcomes that the football model produced, especially the ability to give an average without even doing a run. If there is a component of that type of numerical outcome in the ensemble comparisons, then the average is a meaningless entity, skill or no skill.

………………………………………………………

Finally, onto projecting. The exercise was to take 20 points in a series and project to 23 points. The projection is constrained, as you can deduce. In terms of GCMs, I do not know if they advance step by step, but if they do, then they are somewhat similar to the football exercise. Each new point you calculate is pegged to the one before it, as they would be in a hypothetical serial GCM. (Here, I confess to not reading enough of the background to the spaghetti graph).

I’m used to the lines being normalised at a particular origin, so they fan out from a point at the start, or somewhere. If they don’t come together at some stage, it is hard to know how to compare them. A bias in an initial condition will affect bias of subsequent points and their relation to other runs. This places much stress on the initial condition. The football analogy starts all runs at zero and avoids this problem. It is possible to project forward from a point by deriving a statistical probability from previous points, as shown by some reader solutions of the football example. It is also possible to project forward by doing a numerical analysis of the parameters that are used to create a point. The methods of error analysis need not be similar for these two approaches.

In concluding, it is possible to find number sets generated in different ways that are quite dissimilar in the ways in which error estimation, averaging and projection are applicable. It can be difficult to discern the correct choice of method. With GCM ensembles, there is an evident problem of divergence of models from measured values over the past decade or more. Maybe some of the divergence comes from processes shown by the football example. There is no refuge in the Central Limit theorem or the so-called Law of Large Numbers. They are not applicable in the cases I have seen.

……………………………………………………

IN CONCLUSION

It has long been known that a constrained input can lead to constrained answers.

What I am missing is whether the constraints used in GCM models constrain the answers excessively. I have seen no mention of this in my limited reading. Therefore, I might well be wrong to raise these matters in this way. I hope that readers will show that I am wrong, because if I am right, then there are substantial problems.

………………………………………………….

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
38 Comments
Inline Feedbacks
View all comments
Nick Stokes
June 24, 2013 8:59 pm

Geoff,
“On the other hand, the outcome might be a result of what might be termed “pal-assisted”. “
There is a range where it is close to average, but also a large range where it isn’t. And it’s not obvious why palship should be so limited. There is a lot of autocorrelation between levels; I think one good match will be seen as quite a spread.
If you look at the whole table, there is a lot of scatter. CSIRO would have reported without knowledge of these other results, and I don’t see how anyone could possibly have predicted the ensemble mean.
But as I say, you just can’t rig a program like that. If you distort to get one thing right, ten others will go wrong.
I don’t think the results are meant to be iid 🙂 maybe i

June 24, 2013 11:12 pm

Nick,
Maybe there is a program that can be downloaded, that students use to solve the Nick-Stokes equations. Why should I download it if it accidently shows that 1+1 = 3? Good science, as you well know, is about delivering the goods. Mosh’s animation fits this category. Without a measure of how correct it is, if 1+1 does =2, it’s just graphics like I can make here, starting with very little knowledge of inputs.
I’m searching for why the goods are not being delivered by GCMs, starting with simple concepts & analogies. As is customary, all assumptions are held valid until shown not.
Remember that if the models are wrong in certain ways, the case for CO2 to explain the gap between model and actual disappears and it might be quite wrong to advocate cessation of fossil fuel burning.

Philip Bradley
June 24, 2013 11:31 pm

These are what should be submitted for ensemble comparison, all of them, not a single run that is picked for its conformity with other modellers’ results or any other subjective reason. Do modellers swap notes? I’ve not seen this denied. In any event, there is an abundance of pressure to produce a result that is in line with past predictions and future wishes.
There is no reason to invoke deliberate or concious bias. Unconcious confirmation bias is sufficient to explain the fact most model outputs that make their way into the the IPCC ensemble are around the claimed consensus for predicted CO2 driven warming.
I’ve called climate models, confirmation bias on steroids. There is also selection bias by the IPCC.

X Anomaly
June 25, 2013 12:46 am

Here is gfdl CM 2.6, via Issac Held
http://www.gfdl.noaa.gov/flash-video?vid=cm26_v5_sst&w=940
For all it’s binary finery…it’s BS. Give me next years SST mean for the Nino 3.4 region. A mere 1.8 % of the total sea surface area.

RichardLH
June 25, 2013 1:35 am

Geoff: I don’t doubt that the models COULD be correct. The probable timescale required to PROVE they are (or not) is only a few years now.
There are good reasons to suggest that one of the reasons they are wrong is that they are modeling based on figures produce with a 365 day Scytale (the rod around which you wrap the record) rather than the correct 1461 day one of a true solar year.
This destroys any 4 year pattern in the record (and yes there is a pattern (in the CET at least).
That is what I was trying to point out.

RichardLH
June 25, 2013 2:08 am

X Anomaly says:
June 25, 2013 at 12:46 am
“Give me next years SST mean for the Nino 3.4 region. A mere 1.8 % of the total sea surface area.”
How about a prediction of UAH Global for the next 18 months instead?
http://s1291.photobucket.com/user/RichardLH/story/70051

Frank K.
June 25, 2013 5:36 am

“A lot of people are running models now. Model E, CCSM, facilitate download and use…”
Model E is a piece of junk! Nobody knows what equations it’s solving…

June 25, 2013 5:53 am

Geoff
Re Climate sensitivity
Willis in his latest The Thousand-Year Model notes:

[4] One curious aspect of this result is that it is also well known [Houghton et al., 2001] that the same models that agree in simulating the anomaly in surface air temperature differ significantly in their predicted climate sensitivity. The cited range in climate sensitivity from a wide collection of models is usually 1.5 to 4.5C for a doubling of CO2, where most global climate models used for climate change studies vary by at least a factor of two in equilibrium sensitivity. . . .
Sensitivity (transient or equilibrium) is directly proportional to the ratio of the trend of the temperature to the trend of the forcing. . . .
Strange but true, functionally it turns out that all that the climate models do to forecast the global average surface temperature is to lag and resize the forcing. . . .
They’ve left out the most important part, the control mechanism composed of the emergent thermoregulatory phenomena like thunderstorms, so their models don’t work anything like the real climate, but the core physics is right. . . .
Over the last decades, the modelers will tell you that they’ve gotten better and better at replicating the historical world. And they have, because of evolutionary tuning.

From Fig. 2, the 1000 year Crowley model has much lower climate sensitivity than most of the other models.
So run tests for 1000 years instead of 100 years to weed out the worst performers?
Or expose the presuppositions!

David L. Hagen
June 25, 2013 6:01 am

Nick Stokes
Re: “If they don’t conserve mass, they may well collapse.”
CO2 Mass Conservation
Are there any models that accurately replicate the variations in CO2 correlating with surface temperature (2 year moving averaged) with CO2 mass conservation as shown by Murry Salby (links above)?
Are there any models that show the increasing CO2 variation and differences in phase from South Pole to North Pole as identified by Fred H. Haynie in The Future of Global Climate Change?

Robany Bigjobz
June 25, 2013 9:12 am

From the article: “The group will (presumably) do many model runs. Some of these will fail for known reasons, like typos or wrong factors in the inputs. We can exclude runs where the results are obviously wrong, even to an uniformed observer.”
First off, excluding “obviously wrong” runs is cherry picking data. If the model is accurately implementing the physics then it won’t produce any “obviously wrong” runs. “Obviously wrong” runs indicate the model is wrong and can be discounted.
Secondly, Roy Spencer demonstrated that all of those 73 models he examined are “obviously wrong” because they don’t match the data and thus are all wrong and can all be discounted.

Nick Stokes
June 25, 2013 1:03 pm

David L. Hagen says: June 25, 2013 at 6:01 am
“Are there any models that accurately replicate the variations in CO2 correlating…”

I think the answer is, I don’t know. But more precise links would help. I don’t try much with Salby’s stuff because he doesn’t seem to be able to write a straightforward document explaining it. And the other is a very long ramble.

eyesonu
June 25, 2013 5:31 pm

Nick Stokes says:
June 24, 2013 at 7:28 pm
===================
The link at the end of your comment that models sinking the Titanic. What relevance does that have to the thread? Any observations there? Forgot to turn out the lights as the ship went down. I would guess this is about the caliber of the GCM’s. The video of ocean temps you linked to is cute. Observations of assumptions? Please don’t bother to respond to this and disrupt the thread.

eyesonu
June 26, 2013 4:16 pm

In my post above I should have noted that the link/graphic on ocean temps would have been provided by Steven Mosher. e.g. …”The video of ocean temps you linked to is cute. Observations of assumptions? …”