One Model, One Vote

Guest Post by Willis Eschenbach

The IPCC, that charming bunch of United Nations intergovernmental bureaucrats masquerading as a scientific organization, views the world of climate models as a democracy. It seems that as long as your model is big enough, they will include your model in their confabulations. This has always seemed strange to me, that they don’t even have the most simple of tests to weed out the losers.

Through the good offices of Nic Lewis and Piers Forster, who have my thanks, I’ve gotten a set of 20 matched model forcing inputs and corresponding surface temperature outputs, as used by the IPCC. These are the individual models whose average I discussed in my post called Model Climate Sensitivity Calculated Directly From Model Results. I thought I’d investigate the temperatures first, and compare the model results to the HadCRUT and other observational surface temperature datasets. I start by comparing the datasets themselves. One of my favorite tools for comparing datasets is the “violin plot”. Figure 1 show a violin plot of a random (Gaussian normal) dataset.

vioplot random normal distribution n=1000Figure 1. Violin plot of 10,000 random datapoints, with mean of zero and standard deviation of 0.12

You can see that the “violin” shape, the orange area, is composed of two familiar “bell curves” placed vertically back-to-back. In the middle there is a “boxplot”, which is the box with the whiskers extending out top and bottom. In a boxplot, half of the data points have a value in the range between the top and the bottom of the box. The “whiskers” extending above and below the box are of the same height as the box, a distance known as the “interquartile range” because it runs from the first to the last quarter of the data. The heavy black line shows, not the mean (average) of the data, but the median of the data. The median is the value in the middle of the dataset if you sort the dataset by size. As a result, it is less affected by outliers than is the average (mean) of the same dataset.

So in short, a violin plot is a pair of mirror-image density plots showing how the data is distributed, overlaid with a boxplot. With that as prologue, let’s see what violin plots can show us about the global surface temperature outputs of the twenty climate models.

For me, one of the important metrics of any dataset is the “first difference”. This is the change in the measured value from one measurement to the next. In an annual dataset such as the model temperature outputs, the first difference of the dataset is a new dataset that shows the annual CHANGE in temperature.  In other words, how much warmer or cooler is a given year’s temperature compared to that of the previous year? In the real world and in the models, do we see big changes, or small changes?

This change in some value is often abbreviated by the symbol delta,”∆”, which means the difference in some measurement compared to the previous value. For example, the change in temperature would be called “∆T”.

So let’s begin by looking at the first differences of the modeled temperatures, ∆T. Figure 2 shows a violin plot of the first difference ∆T of each of the 20 model datasets, as numbers 1:20, plus the HadCRUT and random normal datasets.

delta T hadcrut and modelsFigure 2. Violin plots of 20 climate models (tan), plus the HadCRUT observational dataset (red), and a normal gaussian dataset (orange) for comparison. Horizontal dotted lines in each case show the total range of the HadCRUT observational dataset. Click any graphic to embiggen.

Well … the first thing we can say is that we are looking at very, very different distributions here. I mean, look at GDFL [11] and GISS [12], as compared with the observations …

Now, what do these differences between say GDFL and GISS mean when we look at a timeline of their modeled temperatures? Figure 3 shows a look at the two datasets, GDFL and GISS, along with my emulation of each result.

cmip5 emulations gfdl3 gissrFigure 3. Modeled temperatures (dotted gray lines) and emulations of two of the models, GDFL-ESM2M and GISS-E2-R. The emulation method is explained in the first link at the top of the post. Dates of major volcanoes are shown as vertical lines. 

The difference between the two model outputs is quite visible. There is little year-to-year variation in the GISS results, half or less than what we see in the real world. On the other hand, there very large year-to-year variation in the the GDFL results, up to twice the size of the largest annual changes ever seen in the observational record …

Now, it’s obvious that the distribution of any given model’s result will not be identical to that of the observations. But how much difference can we expect? To answer that, Figure 4 shows a set of 24 violin plots of random distributions, with the same number of datapoints (140 years of ∆T) as the model outputs.

vioplot 24 random normal distribution n=1000Figure 4. Violin plots of different random datasets with a sample size of N = 140, and the same standard deviation as the HadCRUT ∆T dataset.

As you can see, with a small sample size of only 140 data points, we can get a variety of shapes. It’s one of the problems in interpreting results with small datasets, it’s hard to be sure what you’re looking at. However, some things don’t change much. The interquartile distance (the height of the box) does not vary a lot. Nor do the locations of the ends of the whiskers. Now, if you re-examine the GDFL (11) and GISS (12) modeled temperatures (as redisplayed in Figure 5 below for convenience), you can see that they are nothing like any of these examples of normal datasets.

Here’s a couple of final oddities. Figure 5 includes three other observational datasets—the GISS global temperature index (LOTI), and the BEST and CRU land-only datasets.

vioplot models hadcrut giss best cruFigure 5. As in Figure 2, but including the GISS, BEST, and CRUTEM temperature datasets at lower right. Horizontal dotted lines show the total range of the HadCRUT observational dataset.

Here, we can see a curious consequence of the tuning of the models. I’d never seen how much the chosen target affects the results. You see, you get different results depending on what temperature dataset you choose to tune your climate model to … and the GISS model [12] has obviously been tuned to replicate the GISS temperature record [22]. Looks like they’ve tuned it quite well to match that record, actually. And CSIRO [7] may have done the same. In any case, they are the only two that have anything like the distinctive shape of the GISS global temperature record.

Finally, the two land-only datasets [23, 24 at lower right of Fig. 5] are fairly similar. However, note the differences between the two global temperature datasets (HadCRUT [21] and GISS LOTI [22]), and the two land-only datasets (BEST [23] and CRUTEM [24]). Recall that the land both warms and cools much more rapidly than the ocean. So as we would expect, there are larger annual swings in both of those land-only datasets, as is reflected in the size of the boxplot box and the position of the ends of the whiskers.

However, a number of the models (e.g 6, 9, & 11) resemble the land-only datasets much more than they do the global temperature datasets. This would indicate problems with the representation of the ocean in those models.

Conclusions? Well, the maximum year-to-year change in the earth’s temperature over the last 140 years has been 0.3°C, for both rising and falling temperatures.

So should we trust a model whose maximum year-to-year change is twice that, like GFDL [11]? What is the value of a model whose results are half that of the observations, like GISS [12] or CSIRO [7]?

My main conclusion is that at some point we need to get over the idea of climate model democracy, and start heaving overboard those models that are not lifelike, that don’t even vaguely resemble the observations.

My final observation is an odd one. It concerns the curious fact that an ensemble (a fancy term for an average) of climate models generally performs better than any model selected at random. Here’s how I’m coming to understand it.

Suppose you have a bunch of young kids who can’t throw all that well. You paint a target on the side of a barn, and the kids start throwing mudballs at the target.

Now, which one is likely to be closer to the center of the target—the average of all of the kids’ throws, or a randomly picked individual throw?

It seems clear that the average of all of the bad throws will be your better bet. A corollary is that the more throws, the more accurate your average is likely to be. So perhaps this is the justification in the minds of the IPCC folks for the inclusion of models that are quite unlike reality … they are included in the hope that they’ll balance out an equally bad model on the other side.

HOWEVER … there are problems with this assumption. One is that if all or most of the errors are in the same direction, then the average won’t be any better than a random result. In my example, suppose the target is painted high on the barn, and most of the kids miss below the target … the average won’t do any better than a random individual result.

Another problem is that many models share large segments of code, and more importantly they share a range of theoretical (and often unexamined) assumptions that may or may not be true about how the climate operates.

A deeper problem in this case is that the increased accuracy only applies to the hindcasts of the models … and they are already carefully tuned to create those results. Not the “twist the knobs” kind of tuning, of course, but lots and lots of evolutionary tuning. As a result, they are all pretty good at hindcasting the past temperature variations, and the average is even better at hindcasting … it’s that dang forecasting that is always the problem.

Or as the US stock brokerage ads are required to say, “Past performance is no guarantee of future success”. No matter how well an individual model or group of models can hindcast the past, it means absolutely nothing about their ability to forecast the future.

Best to all,

w.

NOTES:

DATA SOURCE: The model temperature data is from the study entitled Evaluating adjusted forcing and model spread for historical and future scenarios in the CMIP5 generation of climate models, by Forster, P. M., T. Andrews, P. Good, J. M. Gregory, L. S. Jackson, and M. Zelinka, 2013, Journal of Geophysical Research, 118, 1139–1150, provided courtesy of Piers Forster. Available as submitted here, and worth reading.

DATA AND CODE: As usual, my R code is a snarl, but for what it’s worth it’s here , and the data is in an Excel spreadsheet here.

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
88 Comments
Inline Feedbacks
View all comments
Alan Smersh
November 21, 2013 11:27 pm

e. smith
With the violin plot, my take is that, for the Human observer, discrepancies are immediately made most obvious to the naked eye at a glance. Whereas with the usual graphical plot of a curve or series of curves, they all begin to look very similar. The Human brain, being a pattern recognition machine, in a sense, is very adept at seeing slight differences in a patter, such as the Violin Type of Plot, but not so good at seeing differences in the very similar curves, or sets of curves.
I could be wrong, but this is my take.

tty
November 21, 2013 11:44 pm

Willis, have You done a plot of all the models together? The reason I ask is that the “ensemble” is usually shown with an two-sigma envelope that is said to be equal to 95% probability.
Now this is only true if the underlying data is normally distributed, and it would be interesting to know how true this assumption is, particularly as climate data usually does not follow a normal distribution.

November 21, 2013 11:53 pm

Sorry to correct you but the “whiskers” extend 1.5 times of the height of the box. Data beyond that are classified as outliers.

tonyb
Editor
November 22, 2013 12:16 am

Volcanic activity may cause warming OR cooling depending on the location of the volcano and the distance from it where the temperature effect is observed. See figure 1
http://www.pages.unibe.ch/products/scientific_foci/qsr_pages/zielinski.pdf
how long the effect is going to last is the subject of continued debate. In most cases the leffect appears to be pretty short lived.
tonyb

Jean Parisot
November 22, 2013 12:57 am

“DATA AND CODE: As usual, my R code is a snarl, but for what it’s worth it’s here , and the data is in an Excel spreadsheet here.”
Thanks, that sentence looks like something “professional” scientist should write more often.

Jimbo
November 22, 2013 1:33 am

Would it not be a good idea for the IPCC to look at say 5 of the models that came closest to temperature observations, then compare them to the 5 models that most diverged with temperature observations; then look under the hood for the differences in code / assumptions / inputs etc. Might this not indicate why most of the models fail? Has this already been done by the IPCC? If yes what was the result?

EternalOptimist
November 22, 2013 1:38 am

I always feared this would happen. The CAGW debate has erupted into Violins

knr
November 22, 2013 1:47 am

that they don’t even have the most simple of tests to weed out the losers.
Well they have to publish something , even if its rubish , so to be fair if they were to do this they would have nothing left . So you can see why they don’t .

climatereason
Editor
November 22, 2013 1:47 am

Further to my 12.16 above.
Am I reading the two graphs correctly that show major eruptions and temperatures?
Surely there was already a temperature down turn BEFORE the eruptions?
I did some research previously on the 1258 ‘super volcano’ which was the subject of much discussion by Michael Mann and is said to have precipitated the LIA.
However, the temperature/weather had already deteriorated in the decade prior to the eruption but warmed up again a year after.
tonyb

jimmi_the_dalek
November 22, 2013 2:42 am

you get different results depending on what temperature dataset you choose to tune your climate model to
In another recent thread Nick Stokes said that the models were not fitted to the temperature record,
GCMs are first principle models working from forcings. However, they have empirical models for things like clouds, updrafts etc which the basic grid-based fluid mechanics can’t do properly. The parameters are established by observation. I very much doubt that they fit to the temperature record; that would be very indirect. Cloud models are fit to cloud observations etc.
so which is it?

David A
November 22, 2013 4:20 am

Jimbo says:
November 22, 2013 at 1:33 am
Would it not be a good idea for the IPCC to look at say 5 of the models that came closest to temperature observations, then compare them to the 5 models that most diverged with temperature observations; then look under the hood for the differences in code / assumptions / inputs etc. Might this not indicate why most of the models fail? Has this already been done by the IPCC? If yes what was the result?
============================================
Your suggestion makes way to much since. RGB has posted on the inane practice of using the ensemble model mean for the IPCC predictions, when all the models run wrong in the warm direction. In that since they are informative of something very basic they have wrong. If they follow your suggestion they will likely find that by tuning way down “climate sensitivity” to CO2, they can produce far more accurate predictions.

November 22, 2013 4:25 am

I always liked calling them manta plots myself, but I’m curious why anyone thinks statistical examinations have any place in modern climate science?

David A
November 22, 2013 4:36 am

I hate it when a typo ruins a commonsense comment. “Your suggestion makes way to much since”
Drat- sense. David A says:
November 22, 2013 at 4:20 am

Billy Liar
November 22, 2013 4:38 am

In Figure 2 it looks like HadCRUT is a normal distribution by design – could that be true?

November 22, 2013 4:47 am

jimmi_the_dalek says: November 22, 2013 at 2:42 am
“In another recent thread Nick Stokes said that the models were not fitted to the temperature record,…
so which is it?”

We’ll I’d invite people who think they are so fitted, or tuned, to say how they think it is done, and offer their evidence.
Here’s my perspective. I’ve never been involved in writing a GCM. But I’ve spent a lot of time trying to get my CFD programs to do the right thing. Here’s how it goes:
1. Half the battle is just getting them to run to completion without going haywire. They are very complicated, and really the only thing you have going for you is the physics. It has to be consistent. Not, at that stage, necessarily totally right. But there’s a very narrow path to thread; if it runs, then you’ve surely got a lot of things right.
2. Then you check to see if various conservation relations hold. You’ve probably been doing this anyway. Again, if you’ve got that right, that’s reassuring about lots of physics. And if you haven’t, you probably haven’t got this far.
3. Then you check all boundary conditions to see if they are doing the right things. You’ll explicitly calculate stresses and fluxes etc to see if they are a, reasonable, and b, satisfy the equations you meant them to.
4. You check all the sub-models. That includes things like clouds, updrafts, gas exchange, precipitation. That’s when you’ll vary their parameters – not in response to something diffuse like average temperature, but something specifically responsive – ie cloud properties etc.
5. Then you might look at how they match global averages etc. But there’s little chance of tuning. There are not many parameters left – maybe things like thermal conductivity. Diffusive properties. And there’s not much you can tinker with without risking going back into collapse mode. Remember, the physics is the only thing keeping your program functioning. You’ve either got it right or not.
CFD programs have a mind of their own. You end up just trying to negotiate with them..

Brian H
November 22, 2013 4:47 am

My eyes aren’t so good, but I should be able to find these “horizontal dotted lines” you keep referring to. But I can’t. Where are they? To help you search, this is what I’m expecting to see: – – – – – – or . . . . . . or ………

Speed
November 22, 2013 4:48 am

Two pyramids, two masses for the dead, are twice as good as one; but not so two railways from London to York.
John Maynard Keynes
I get his point but he ignores the value of competition. With railroads, competition produces faster, more frequent and cheaper trips. With models, competition to produce the best model should improve the products but for competition to work it needs to produce winners and losers. The field is only improved when the losers are sent home.
ferdberple wrote, “a dart board shows the same pattern. Throw enough darts at the board and the average will be a bulls-eye.”
Isn’t this what the model aggregators are doing — throwing models at the board and claiming that the average is the bulls-eye? Or close enough? Superficially, this is a compelling argument and it doesn’t force anyone or any organization to make a decision or withdraw funding or work hard or quickly or take risks to get better. No one gets sent home.

David L. Hagen
November 22, 2013 4:55 am

Thanks Willis
Always insightful.
Digging deeper, Prof. Demetris Koutsoyiannis finds that natural climate persistence results in statistics that are dramatically different from random fluctuations. e.g. Modeling using Hurst Kolmogorov dynamics with a Hurst coefficient of 0.89 in a log climacogram compared to a 0.50 coefficient for random fluctuations. e.g. see Fig 9. in Hurst-Kolmogorov dynamics and uncertainty 2011
or Fig. 9-11 in Climatic variability over time scales spanning nine orders of magnitude: Connecting Milankovitch cycles with Hurst-Kolmogorov dynamics
Or in further publications & presentations by Koutsoyiannis on Hurst-Kolmogorov dynamics.
Best

Just an engineer
November 22, 2013 5:20 am

Hi Willis,
In your analogy, I wonder if it would not be more accurate that the boys were blindfolded?

David
November 22, 2013 6:17 am

Once upon a time as an engineer at a big company I was asked to evaluate a model someone had made. They have taken 40 measurements taken from the line during the manufacturing process, regressed them against 20 resultant yields. They then took the best *11* of the results, and did a multi-linear regression against the same 20 resultant yields, and came up with what they called a 80% correlation. To prove this was nonsense, I took 40 random number generators, regressed them against the same 20 yields, took the 11 best results, did a multi-linear regression and came up with an 83% “correlation”.
When I presented my data to management, they wanted to know what distribution I took my random numbers from. I told them “uniform”. *They then wanted me to see if I could improve my correlation by picking a different distribution.* I came really close to resigning.

ferdberple
November 22, 2013 6:32 am

David L. Hagen says:
November 22, 2013 at 4:55 am
Hurst-Kolmogorov dynamics and uncertainty 2011
===========
A very interesting paper, pointing out that Climate Change is a redundant and thus misleading term. Climate = Change .
The paper goes on to demonstrate why the treatment of climate as a non-stationary deterministic process leads to an underestimate of climate uncertainty. The paper goes on to argue that the problem is the underlying statistical assumption that climate (the future) is deterministic.
19th century Victorian era physics considered the future to be deterministic. the the universe was a clockwork. Wind it up and the future is predetermined by the past. However, since that time physics has come to understand the future much differently.
Consider a planet orbiting a star. An electron orbiting the hydrogen nucleolus as you will. This is a deterministic system. It is the lowest order of chaos – no chaos. Now add another planet in orbit. Another electron around the nucleolus. Suddenly you have the three body problem. You cannot be sure where the orbits will take the planets/electrons. Instead, you are left with a probability function. The Schrodinger equation, the Uncertainty Principle, Chaos.
The clockwork breaks down, our nice orderly view of the future does not exist. You cannot average the future and arrive at a meaningful result. On average if you have one foot in the oven and the other in the freezer you are comfortable.

ferdberple
November 22, 2013 6:46 am

The spaghetti graphs the IPCC publishes for the climate models are much more revealing than the ensemble mean. The spaghetti graphs tell us that the models already know that the future is uncertain. That many different futures are possible from a single set of forcings.
This is a critical issue because many scientists and policy makers are unaware of the problem inherent in time series analysis. They believe that for a single set of forcings only a single outcome is possible. Thus they talk about “climate sensitivity” as a single value. The argue that it is 1.5 or 2.5 or 3.5. When in fact it is all those values and non of those values at the same time.
If it wasn’t, if climate sensitivity was indeed a single value, then a single model would always deliver the same prediction for a given set a forcings. But the models do not do this. Every time you run a model, if it is in the least bit realistic, it will deliver a different prediction for future temperatures, without any change in the forcings. How then can climate sensitivity be a single value?
the climate models are telling us what we don’t want to hear. they are screaming at us that climate sensitivity is not and cannot exist as a single value.

Resourceguy
November 22, 2013 6:59 am

The frequency of model bias on the high side would be better connected to real world modeling issues if it was connected to the problem of extending such projections off the growth phase of multi decadal cycles. A classification system might be possible to sort out which models are most prone to last cycle take off effect to produce ridiculous runaway projections.