One Model, One Vote

Guest Post by Willis Eschenbach

The IPCC, that charming bunch of United Nations intergovernmental bureaucrats masquerading as a scientific organization, views the world of climate models as a democracy. It seems that as long as your model is big enough, they will include your model in their confabulations. This has always seemed strange to me, that they don’t even have the most simple of tests to weed out the losers.

Through the good offices of Nic Lewis and Piers Forster, who have my thanks, I’ve gotten a set of 20 matched model forcing inputs and corresponding surface temperature outputs, as used by the IPCC. These are the individual models whose average I discussed in my post called Model Climate Sensitivity Calculated Directly From Model Results. I thought I’d investigate the temperatures first, and compare the model results to the HadCRUT and other observational surface temperature datasets. I start by comparing the datasets themselves. One of my favorite tools for comparing datasets is the “violin plot”. Figure 1 show a violin plot of a random (Gaussian normal) dataset.

vioplot random normal distribution n=1000Figure 1. Violin plot of 10,000 random datapoints, with mean of zero and standard deviation of 0.12

You can see that the “violin” shape, the orange area, is composed of two familiar “bell curves” placed vertically back-to-back. In the middle there is a “boxplot”, which is the box with the whiskers extending out top and bottom. In a boxplot, half of the data points have a value in the range between the top and the bottom of the box. The “whiskers” extending above and below the box are of the same height as the box, a distance known as the “interquartile range” because it runs from the first to the last quarter of the data. The heavy black line shows, not the mean (average) of the data, but the median of the data. The median is the value in the middle of the dataset if you sort the dataset by size. As a result, it is less affected by outliers than is the average (mean) of the same dataset.

So in short, a violin plot is a pair of mirror-image density plots showing how the data is distributed, overlaid with a boxplot. With that as prologue, let’s see what violin plots can show us about the global surface temperature outputs of the twenty climate models.

For me, one of the important metrics of any dataset is the “first difference”. This is the change in the measured value from one measurement to the next. In an annual dataset such as the model temperature outputs, the first difference of the dataset is a new dataset that shows the annual CHANGE in temperature.  In other words, how much warmer or cooler is a given year’s temperature compared to that of the previous year? In the real world and in the models, do we see big changes, or small changes?

This change in some value is often abbreviated by the symbol delta,”∆”, which means the difference in some measurement compared to the previous value. For example, the change in temperature would be called “∆T”.

So let’s begin by looking at the first differences of the modeled temperatures, ∆T. Figure 2 shows a violin plot of the first difference ∆T of each of the 20 model datasets, as numbers 1:20, plus the HadCRUT and random normal datasets.

delta T hadcrut and modelsFigure 2. Violin plots of 20 climate models (tan), plus the HadCRUT observational dataset (red), and a normal gaussian dataset (orange) for comparison. Horizontal dotted lines in each case show the total range of the HadCRUT observational dataset. Click any graphic to embiggen.

Well … the first thing we can say is that we are looking at very, very different distributions here. I mean, look at GDFL [11] and GISS [12], as compared with the observations …

Now, what do these differences between say GDFL and GISS mean when we look at a timeline of their modeled temperatures? Figure 3 shows a look at the two datasets, GDFL and GISS, along with my emulation of each result.

cmip5 emulations gfdl3 gissrFigure 3. Modeled temperatures (dotted gray lines) and emulations of two of the models, GDFL-ESM2M and GISS-E2-R. The emulation method is explained in the first link at the top of the post. Dates of major volcanoes are shown as vertical lines. 

The difference between the two model outputs is quite visible. There is little year-to-year variation in the GISS results, half or less than what we see in the real world. On the other hand, there very large year-to-year variation in the the GDFL results, up to twice the size of the largest annual changes ever seen in the observational record …

Now, it’s obvious that the distribution of any given model’s result will not be identical to that of the observations. But how much difference can we expect? To answer that, Figure 4 shows a set of 24 violin plots of random distributions, with the same number of datapoints (140 years of ∆T) as the model outputs.

vioplot 24 random normal distribution n=1000Figure 4. Violin plots of different random datasets with a sample size of N = 140, and the same standard deviation as the HadCRUT ∆T dataset.

As you can see, with a small sample size of only 140 data points, we can get a variety of shapes. It’s one of the problems in interpreting results with small datasets, it’s hard to be sure what you’re looking at. However, some things don’t change much. The interquartile distance (the height of the box) does not vary a lot. Nor do the locations of the ends of the whiskers. Now, if you re-examine the GDFL (11) and GISS (12) modeled temperatures (as redisplayed in Figure 5 below for convenience), you can see that they are nothing like any of these examples of normal datasets.

Here’s a couple of final oddities. Figure 5 includes three other observational datasets—the GISS global temperature index (LOTI), and the BEST and CRU land-only datasets.

vioplot models hadcrut giss best cruFigure 5. As in Figure 2, but including the GISS, BEST, and CRUTEM temperature datasets at lower right. Horizontal dotted lines show the total range of the HadCRUT observational dataset.

Here, we can see a curious consequence of the tuning of the models. I’d never seen how much the chosen target affects the results. You see, you get different results depending on what temperature dataset you choose to tune your climate model to … and the GISS model [12] has obviously been tuned to replicate the GISS temperature record [22]. Looks like they’ve tuned it quite well to match that record, actually. And CSIRO [7] may have done the same. In any case, they are the only two that have anything like the distinctive shape of the GISS global temperature record.

Finally, the two land-only datasets [23, 24 at lower right of Fig. 5] are fairly similar. However, note the differences between the two global temperature datasets (HadCRUT [21] and GISS LOTI [22]), and the two land-only datasets (BEST [23] and CRUTEM [24]). Recall that the land both warms and cools much more rapidly than the ocean. So as we would expect, there are larger annual swings in both of those land-only datasets, as is reflected in the size of the boxplot box and the position of the ends of the whiskers.

However, a number of the models (e.g 6, 9, & 11) resemble the land-only datasets much more than they do the global temperature datasets. This would indicate problems with the representation of the ocean in those models.

Conclusions? Well, the maximum year-to-year change in the earth’s temperature over the last 140 years has been 0.3°C, for both rising and falling temperatures.

So should we trust a model whose maximum year-to-year change is twice that, like GFDL [11]? What is the value of a model whose results are half that of the observations, like GISS [12] or CSIRO [7]?

My main conclusion is that at some point we need to get over the idea of climate model democracy, and start heaving overboard those models that are not lifelike, that don’t even vaguely resemble the observations.

My final observation is an odd one. It concerns the curious fact that an ensemble (a fancy term for an average) of climate models generally performs better than any model selected at random. Here’s how I’m coming to understand it.

Suppose you have a bunch of young kids who can’t throw all that well. You paint a target on the side of a barn, and the kids start throwing mudballs at the target.

Now, which one is likely to be closer to the center of the target—the average of all of the kids’ throws, or a randomly picked individual throw?

It seems clear that the average of all of the bad throws will be your better bet. A corollary is that the more throws, the more accurate your average is likely to be. So perhaps this is the justification in the minds of the IPCC folks for the inclusion of models that are quite unlike reality … they are included in the hope that they’ll balance out an equally bad model on the other side.

HOWEVER … there are problems with this assumption. One is that if all or most of the errors are in the same direction, then the average won’t be any better than a random result. In my example, suppose the target is painted high on the barn, and most of the kids miss below the target … the average won’t do any better than a random individual result.

Another problem is that many models share large segments of code, and more importantly they share a range of theoretical (and often unexamined) assumptions that may or may not be true about how the climate operates.

A deeper problem in this case is that the increased accuracy only applies to the hindcasts of the models … and they are already carefully tuned to create those results. Not the “twist the knobs” kind of tuning, of course, but lots and lots of evolutionary tuning. As a result, they are all pretty good at hindcasting the past temperature variations, and the average is even better at hindcasting … it’s that dang forecasting that is always the problem.

Or as the US stock brokerage ads are required to say, “Past performance is no guarantee of future success”. No matter how well an individual model or group of models can hindcast the past, it means absolutely nothing about their ability to forecast the future.

Best to all,

w.

NOTES:

DATA SOURCE: The model temperature data is from the study entitled Evaluating adjusted forcing and model spread for historical and future scenarios in the CMIP5 generation of climate models, by Forster, P. M., T. Andrews, P. Good, J. M. Gregory, L. S. Jackson, and M. Zelinka, 2013, Journal of Geophysical Research, 118, 1139–1150, provided courtesy of Piers Forster. Available as submitted here, and worth reading.

DATA AND CODE: As usual, my R code is a snarl, but for what it’s worth it’s here , and the data is in an Excel spreadsheet here.

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
88 Comments
Inline Feedbacks
View all comments
David A
November 22, 2013 11:06 pm

Jimini says, “Davis A above says “one factor that would improve all the models is tuning down climate sensitivity to CO2.” but I thought that the sensitivity was one of the results, not one of the inputs.
——————————————————————————————————————-
I am fairly certain CO2 is the dominant forcing. Many of the feedbacks are based on the claimed, but not observed affect of additional CO2; IE increased water vapor, reduced albedo at the poles, etc. When ALL the models run wrong in a uniform direction, it is likely that something fundamental is wrong. CO2 forcing via direct radiative affects, and feedbacks, is the common thread from which all the WRONG models are weaved.
When, which out so much as a “How do you do”, the Mannian hockey stick overturned decades of scientific thought and research, the catestrophists created a dilemma for themselves. They claimed a flat past with flat CO2. It was necessary to insure that the historical record match the newly revealed flatness. Hansen & company went to work on the historical record to support Mann. See here… http://stevengoddard.wordpress.com/hansen-the-climate-chiropractor/ and here… http://stevengoddard.wordpress.com/2013/10/09/hansen-1986-five-degrees-warming-by-2010-2/
Jimbo, ya, (-; I knew what you were getting at, I just wanted to see if I could draw Mosher out of the trees to see if he could maybe acknowledge the big picture of the forest. No luck, once agan his Royalness did not acquiesce to an actual conversation after his condescending lecture to the proletariat. Alas, Mr Mosher remains, stuck in a world of numbers he knows far better then most, but the foundation they rest on, of that, he is unaware.

David A
November 23, 2013 6:03 am

Tim the Tool man. Why? Inconvenient tree rings are after all, a real world observation. The models are a computer simulation of an opinion. There are wrong answers in the models. There are only wrong interpretations of tree rings, how they were formed, what they mean. There is a dramatic difference between a model, and a real world observation. We must reject the models that do not match our real world observations.

Wes
November 23, 2013 6:07 am

Jim G says:
November 22, 2013 at 9:12 am:
I agree with Jim. I really don’t understand why they use and “anomalies” as there is so little intelligence in this metric. I don’t ever recall using this in my 40 year career as an engineer and physicist.
WP.

Jim G
November 23, 2013 7:59 am

Wes says:
November 23, 2013 at 6:07 am
Jim G says:
November 22, 2013 at 9:12 am:
“I agree with Jim. I really don’t understand why they use and “anomalies” as there is so little intelligence in this metric. I don’t ever recall using this in my 40 year career as an engineer and physicist.
WP.”
Try to even find any historical plots of actual temps. Everything is in “anomalies”. We are playing into the hands of the warmers when we use their methods without, at least, showing a few plots of actual temps to show how little change is actually occuring.

November 23, 2013 12:04 pm

As expected, an excellent post. One more graph would round it out, perhaps. Actually put all the data together and plot a composite violin (or sting wray?) plot to illustrate the effect of averaging all.

Tom van der Hoeven
November 23, 2013 12:20 pm

Hadcrut annual.csv
Willis, can you please give the Hadcrut annual.csv file aswell.
Tom

November 23, 2013 2:16 pm

David A writes “Why?”
How does your reasoning stand up if none of the models are modelling climate? Afterall you did say “When ALL the models run wrong in a uniform direction, it is likely that something fundamental is wrong.”

Crispin in Waterloo but really in Ulaanbaatar
November 23, 2013 5:53 pm

@AndyG55 and Just an engineer
This is my verbal ‘Like’ for both your comments.
Andy sez: Any model that hindcasts to fit pre-1979 Giss or HadCrud.. will ALWAYS create an overestimate of future temperatures.
This is kinda obvious, n’est pas? If a model is built or trained or compared or correlated or checked against a temperature set that has been fiddled to make the past look colder than it was, even if it is doing ‘really well’ on the basis if its internal mechanisms, it is going to over-estimate the future temperatures both near and far. It has been created to model a non-reality. And future reality bites hard.
That the divergence starts immediately is a really good indication that the whole of the past needs to be tipped up until the dart throwing mechanism shoots at lower targets. That tilting might best be accomplished by reducing the net forcing effect of additional CO2 and elevating something else. The internal mechanisms of a functionally useful model don’t have to be correct for short term predictions. Save for one of two model, the rest seem not to meet even this puny standard.

David A
November 23, 2013 10:15 pm

TimTheToolMan says:
November 23, 2013 at 2:16 pm
David A writes “Why?”
How does your reasoning stand up if none of the models are modelling climate? Afterall you did say “When ALL the models run wrong in a uniform direction, it is likely that something fundamental is wrong.”
———————————————————————————
Sorry, but I do not see the contradiction. I am simply asserting the simple common sense approach that the projections which least match real world observations be discarded, and the GCMs closest to the observations be analyzed to see what makes them better. I have yet to find a discussion on why certain models run closer to R.W.O. AndyG55 and Crispin above do make an excellent observation.
My links previously posted in this thread begin to address how the historic surface record has been adjusted to support the Mannian hockey stick. See here… http://stevengoddard.wordpress.com/hansen-the-climate-chiropractor/ and here… http://stevengoddard.wordpress.com/2013/10/09/hansen-1986-five-degrees-warming-by-2010-2/
Of course that is just a beginning and over the last decade plus, RSS has diverged from GISS more then ever. (Steven Mosher does not like to talk about this. Actually he does not like to talk much at all, he prefers to lecture; something I find a bit sad as he has a great deal of detailed expertise.

mbur
November 24, 2013 9:25 am

One Model, One Vote….And my vote goes for…normal gaussian dataset (orange) in figure2.
Because ,IMO, it most closely resembles…HadCRUT observational dataset (red) also in figure2.
Don’t those violin plots kinda show averages anyway,an average of an average is just an average to me.
Thanks for the interesting articles and comments.

dscott
November 24, 2013 10:52 am

The IPCC game is based upon exclusion of data and methods. Why not beat them at their own game? Produce 20 climate models, each a minor iteration of the next and then independently submit each result. Presto, you have out voted their closed voting block and more importantly introduced a negative confidence factor against the established models. They will no longer be able to claim unanimous agreement or 95% confidence level in the models when half of them are showing cyclical or decreasing trends.
Wait for liberal shrieking to begin in 3…., 2….., 1…..

Brian H
November 25, 2013 2:03 am

Wes says:
November 23, 2013 at 6:07 am
Jim G says:
November 22, 2013 at 9:12 am:
I agree with Jim. I really don’t understand why they use and “anomalies” as there is so little intelligence in this metric. I don’t ever recall using this in my 40 year career as an engineer and physicist.
WP.

Yes, and another unique-to-CS term is “forcings”, With forcings of anomalies, CS has created a mental playground where anything goes.

November 25, 2013 9:18 am

Good work to identify a serious limitation of the models but also climate science. Climatology was of little interest until it was chosen as a political vehicle. Prior to that it was all about averages not the concept of change.
In the 1970s trends became important in society in general and in climatology as global cooling posed a threat to world food production. Trends, especially simple linear trends, became the foundation of very simplistic computer models and politically fashionable because of their use in the Club of Rome work “Limits to Growth”.
In the 1980s the trend became global warming especially after Hansen’s politically contrived 1988 hearing. It was just another simplistic trend, but now with a full political agenda exemplified by Senator Wirth, who arranged Hansen’s appearance, comment that “We’ve got to ride the global warming tissue. Even if the theory of global warming is wrong, we will be doing the right thing…”
Just as the IPCC kept all the focus on CO2, so it kept the focus on averages and trends. It diverted from the other important statistic in any data set namely the variation. As the climate transitions from one trend to another the variability changes. This is primarily due to the changing wave pattern in the Circumpolar vortex. This is accentuated in middle latitude records, which dominate the data sets used for the climate models.
In my opinion Willis’s article serves to accentuate this failure to consider variation, but also the failure of the models because ether are built on completely inadequate data sets in space and time.
I wrote about the broader implications of a limited simplistic application of statistics to climatology in particular and society in general.
http://drtimball.com/2011/statistics-impact-on-modern-society-and-climate/