One Model, One Vote

Guest Post by Willis Eschenbach

The IPCC, that charming bunch of United Nations intergovernmental bureaucrats masquerading as a scientific organization, views the world of climate models as a democracy. It seems that as long as your model is big enough, they will include your model in their confabulations. This has always seemed strange to me, that they don’t even have the most simple of tests to weed out the losers.

Through the good offices of Nic Lewis and Piers Forster, who have my thanks, I’ve gotten a set of 20 matched model forcing inputs and corresponding surface temperature outputs, as used by the IPCC. These are the individual models whose average I discussed in my post called Model Climate Sensitivity Calculated Directly From Model Results. I thought I’d investigate the temperatures first, and compare the model results to the HadCRUT and other observational surface temperature datasets. I start by comparing the datasets themselves. One of my favorite tools for comparing datasets is the “violin plot”. Figure 1 show a violin plot of a random (Gaussian normal) dataset.

Figure 1. Violin plot of 10,000 random datapoints, with mean of zero and standard deviation of 0.12

You can see that the “violin” shape, the orange area, is composed of two familiar “bell curves” placed vertically back-to-back. In the middle there is a “boxplot”, which is the box with the whiskers extending out top and bottom. In a boxplot, half of the data points have a value in the range between the top and the bottom of the box. The “whiskers” extending above and below the box are of the same height as the box, a distance known as the “interquartile range” because it runs from the first to the last quarter of the data. The heavy black line shows, not the mean (average) of the data, but the median of the data. The median is the value in the middle of the dataset if you sort the dataset by size. As a result, it is less affected by outliers than is the average (mean) of the same dataset.

So in short, a violin plot is a pair of mirror-image density plots showing how the data is distributed, overlaid with a boxplot. With that as prologue, let’s see what violin plots can show us about the global surface temperature outputs of the twenty climate models.

For me, one of the important metrics of any dataset is the “first difference”. This is the change in the measured value from one measurement to the next. In an annual dataset such as the model temperature outputs, the first difference of the dataset is a new dataset that shows the annual CHANGE in temperature. In other words, how much warmer or cooler is a given year’s temperature compared to that of the previous year? In the real world and in the models, do we see big changes, or small changes?

This change in some value is often abbreviated by the symbol delta,”∆”, which means the difference in some measurement compared to the previous value. For example, the change in temperature would be called “∆T”.

So let’s begin by looking at the first differences of the modeled temperatures, ∆T. Figure 2 shows a violin plot of the first difference ∆T of each of the 20 model datasets, as numbers 1:20, plus the HadCRUT and random normal datasets.

Figure 2. Violin plots of 20 climate models (tan), plus the HadCRUT observational dataset (red), and a normal gaussian dataset (orange) for comparison. Horizontal dotted lines in each case show the total range of the HadCRUT observational dataset. Click any graphic to embiggen.

Well … the first thing we can say is that we are looking at very, very different distributions here. I mean, look at GDFL [11] and GISS [12], as compared with the observations …

Now, what do these differences between say GDFL and GISS mean when we look at a timeline of their modeled temperatures? Figure 3 shows a look at the two datasets, GDFL and GISS, along with my emulation of each result.

Figure 3. Modeled temperatures (dotted gray lines) and emulations of two of the models, GDFL-ESM2M and GISS-E2-R. The emulation method is explained in the first link at the top of the post. Dates of major volcanoes are shown as vertical lines.

The difference between the two model outputs is quite visible. There is little year-to-year variation in the GISS results, half or less than what we see in the real world. On the other hand, there very large year-to-year variation in the the GDFL results, up to twice the size of the largest annual changes ever seen in the observational record …

Now, it’s obvious that the distribution of any given model’s result will not be identical to that of the observations. But how much difference can we expect? To answer that, Figure 4 shows a set of 24 violin plots of random distributions, with the same number of datapoints (140 years of ∆T) as the model outputs.

Figure 4. Violin plots of different random datasets with a sample size of N = 140, and the same standard deviation as the HadCRUT ∆T dataset.

As you can see, with a small sample size of only 140 data points, we can get a variety of shapes. It’s one of the problems in interpreting results with small datasets, it’s hard to be sure what you’re looking at. However, some things don’t change much. The interquartile distance (the height of the box) does not vary a lot. Nor do the locations of the ends of the whiskers. Now, if you re-examine the GDFL (11) and GISS (12) modeled temperatures (as redisplayed in Figure 5 below for convenience), you can see that they are nothing like any of these examples of normal datasets.

Here’s a couple of final oddities. Figure 5 includes three other observational datasets—the GISS global temperature index (LOTI), and the BEST and CRU land-only datasets.

Figure 5. As in Figure 2, but including the GISS, BEST, and CRUTEM temperature datasets at lower right. Horizontal dotted lines show the total range of the HadCRUT observational dataset.

Here, we can see a curious consequence of the tuning of the models. I’d never seen how much the chosen target affects the results. You see, you get different results depending on what temperature dataset you choose to tune your climate model to … and the GISS model [12] has obviously been tuned to replicate the GISS temperature record [22]. Looks like they’ve tuned it quite well to match that record, actually. And CSIRO [7] may have done the same. In any case, they are the only two that have anything like the distinctive shape of the GISS global temperature record.

Finally, the two land-only datasets [23, 24 at lower right of Fig. 5] are fairly similar. However, note the differences between the two global temperature datasets (HadCRUT [21] and GISS LOTI [22]), and the two land-only datasets (BEST [23] and CRUTEM [24]). Recall that the land both warms and cools much more rapidly than the ocean. So as we would expect, there are larger annual swings in both of those land-only datasets, as is reflected in the size of the boxplot box and the position of the ends of the whiskers.

However, a number of the models (e.g 6, 9, & 11) resemble the land-only datasets much more than they do the global temperature datasets. This would indicate problems with the representation of the ocean in those models.

Conclusions? Well, the maximum year-to-year change in the earth’s temperature over the last 140 years has been 0.3°C, for both rising and falling temperatures.

So should we trust a model whose maximum year-to-year change is twice that, like GFDL [11]? What is the value of a model whose results are half that of the observations, like GISS [12] or CSIRO [7]?

My main conclusion is that at some point we need to get over the idea of climate model democracy, and start heaving overboard those models that are not lifelike, that don’t even vaguely resemble the observations.

My final observation is an odd one. It concerns the curious fact that an ensemble (a fancy term for an average) of climate models generally performs better than any model selected at random. Here’s how I’m coming to understand it.

Suppose you have a bunch of young kids who can’t throw all that well. You paint a target on the side of a barn, and the kids start throwing mudballs at the target.

Now, which one is likely to be closer to the center of the target—the average of all of the kids’ throws, or a randomly picked individual throw?

It seems clear that the average of all of the bad throws will be your better bet. A corollary is that the more throws, the more accurate your average is likely to be. So perhaps this is the justification in the minds of the IPCC folks for the inclusion of models that are quite unlike reality … they are included in the hope that they’ll balance out an equally bad model on the other side.

HOWEVER … there are problems with this assumption. One is that if all or most of the errors are in the same direction, then the average won’t be any better than a random result. In my example, suppose the target is painted high on the barn, and most of the kids miss below the target … the average won’t do any better than a random individual result.

Another problem is that many models share large segments of code, and more importantly they share a range of theoretical (and often unexamined) assumptions that may or may not be true about how the climate operates.

A deeper problem in this case is that the increased accuracy only applies to the hindcasts of the models … and they are already carefully tuned to create those results. Not the “twist the knobs” kind of tuning, of course, but lots and lots of evolutionary tuning. As a result, they are all pretty good at hindcasting the past temperature variations, and the average is even better at hindcasting … it’s that dang forecasting that is always the problem.

Or as the US stock brokerage ads are required to say, “Past performance is no guarantee of future success”. No matter how well an individual model or group of models can hindcast the past, it means absolutely nothing about their ability to forecast the future.

Best to all,

NOTES:

DATA SOURCE: The model temperature data is from the study entitled Evaluating adjusted forcing and model spread for historical and future scenarios in the CMIP5 generation of climate models, by Forster, P. M., T. Andrews, P. Good, J. M. Gregory, L. S. Jackson, and M. Zelinka, 2013, Journal of Geophysical Research, 118, 1139–1150, provided courtesy of Piers Forster. Available as submitted here, and worth reading.

DATA AND CODE: As usual, my R code is a snarl, but for what it’s worth it’s here , and the data is in an Excel spreadsheet here.

0 0 votes

Article Rating

88 Comments

Inline Feedbacks

View all comments

cnxtim

November 21, 2013 6:10 pm

Oh dear, the hypocrisy is rife, these graphs are so out of kilter with each other as to make any findings from using them collectively totally absurd, and surely the only thing that matters is accuracy, proven over time – anything else is just plain old GiGo – thanks great post…

DocMartyn

November 21, 2013 6:33 pm

What graphics package did you use for the violin plots?
They look lovely BTW

OssQss

November 21, 2013 6:34 pm

Willis, remember those analog kids will learn and get better as time goes by.
By comparison, our digital models can’t.
They currently have a fundamental CO2 issue that their analog masters cannot overcome with adding code. Stuck in the mud, if you will.
Just my take.
As always, thanks for the good read.
PS= I will never look at old Christmas decorations the same way again. Tis the season >

Luke Warmist

November 21, 2013 6:35 pm

Thanks Willis. I always enjoy your take on data sets, and despair at my own lack of imagination.

Jquip

November 21, 2013 6:39 pm

OssQss: “By comparison, our digital models can’t.”
Exactly the reason underlying the self-correction of science. Given two things, or more, toss out the worst and try again. If we aren’t doing that, then there’s no point to the endeavour.

jorgekafkazar

November 21, 2013 6:40 pm

Ja, another great post. Well done. The models don’t simulate climate; they only emulate it, in that they wiggle up and down, just like climate. They don’t add to our knowledge; they take away. And they cost millions, so far, trillions, ultimately.

Willis Eschenbach

Author

November 21, 2013 6:44 pm

DocMartyn says:
November 21, 2013 at 6:33 pm

What graphics package did you use for the violin plots?
They look lovely BTW

It’s all done in R on my Mac …
w.

ferdberple

November 21, 2013 7:05 pm

lots and lots of evolutionary tuning
================
exactly. Any model that wants to survive must tell the model builder what the model builder expects to hear, or the model will be replaced by a modified model. Over time this evolutionary (survival of the fittest) results in models that are very good at predicting what the model builder expects to hear. however, they have no more ability to tell the future than the model builder.

ferdberple

November 21, 2013 7:21 pm

increased accuracy only applies to the hindcasts of the models … and they are already carefully tuned to create those results
==========================
the climate models confuse hindcasting with training. When you let the model see the past you are training. Of course the model can memorize the past and repeat it. A parrot can do the same.
Hindcasting occurs when you don’t show the model the past and it can predict it anyways. This proves that the model likely has some skill. To date no climate model has demonstrated this ability. Let me repeat, no climate model to date has demonstrated any skill at hindcasting.
For example, given the current position of a planet and its current motion, gravitational models can predict with some accuracy its past position. We can verify this using historical records, which gives us great confidence in the accuracy of gravitational models. We don’t need to wait to see if the model is accurate in the future, because it has correctly predicted the past without knowing the past. This gives us confidence the models can predict the future.
However, what if we trained the models by telling them the past position of the planet. Would this give us any confidence in the ability of the model to make predictions about the past or future? No, because all the model need do is parrot what it has learned. This requires no ability to predict, it requires the ability to mimic. That is what we see with climate models, they mimic the builders, they don’t have any skill to predict.

old engineer

November 21, 2013 7:41 pm

Willis, thought provoking post as usual. I enjoyed the education on violin plots.
Your comment about the kids throwing mud balls at target on barn reminded me of the poem “Hiawatha Designs an Experiment” by Maurice G. Kendall, quoted in part below:
“Hiawatha, mighty hunter
He could shoot ten arrows upwards
Shoot them with such strength and swiftness
That the last had left the bowstring
Ere the first to earth descended.
This was commonly regarded
As a feat of skill and cunning.
One or two sarcastic spirits
Pointed out to him, however,
That it might be much more useful
If he sometimes hit the target.
Why not shoot a little straighter
And employ a smaller sample?
Hiawatha, who in college
Majored in applied statistics
Consequently felt entitled
To instruct his fellow men on
Any subject whatsoever,
Waxed exceedingly indignant
Talked about the law of error,
Talked about truncated normals,
Talked of loss of information,
Talked about his lack of bias
Pointed out that in the long run
Independent observations
Even though they missed the target
Had an average point of impact
Very near the spot he aimed at
(with possible exception
of a set of measure zero)…..”
Could it be that the IPCC has hired Hiawatha?

FrankK

November 21, 2013 7:42 pm

Interesting post W.
As with most models the “tuning” (as some suggest – fudging) means very little as the result is not unique and a good hindcast “fit” doesn’t mean the model is valid. With climate models its worse because they are all based on the premise that CO2 is the prime driver of temperature. Hence their predictions are no better than a guess.

Steven Mosher

November 21, 2013 7:57 pm

the violin plots are done with vioplot package ( assuming willis uses what I use )
Willis:
“So should we trust a model whose maximum year-to-year change is twice that, like GFDL [11]? What is the value of a model whose results are half that of the observations, like GISS [12] or CSIRO [7]?”
I’m glad you plotted the variability.
Now folks need to go re look at Santer and the 17 year goal posts.

Tim

November 21, 2013 8:00 pm

FEA was also originally assumed to be an unreliable tool, but now everybody and his dog use’s it when it come to product design.
The problem with climate models (and often FEA too) is the inputs and the constraints. I don’t believe we know of half of them and without that knowledge there’s no point in doing the analysis. Its like designing a bridge without knowing what material you are building it out of, how long it needs to be and what weight it needs to support.

Nick Stokes

November 21, 2013 8:08 pm

“The difference between the two model outputs is quite visible. There is little year-to-year variation in the GISS results, half or less than what we see in the real world. On the other hand, there very large year-to-year variation in the the GDFL results, up to twice the size of the largest annual changes ever seen in the observational record …”
GISS Model E wasn’t in the Forster et al data. I see that when you have used it before, you got results from this site. That’s an ensemble mean of five runs. So variation is down.

don

November 21, 2013 8:12 pm

Interesting. I don’t see a violin plot (the classical full figured woman as it were). I see diamonds. I see a few flying saucers. I even see one dumpy pear and a few spinning tops. It’s a veritable Rorschach test and too absurd to really exist. I must be projecting. If it were not for humans, the absurd wouldn’t exist.

Willis Eschenbach

Author

November 21, 2013 8:25 pm

Nick Stokes says:
November 21, 2013 at 8:08 pm

“The difference between the two model outputs is quite visible. There is little year-to-year variation in the GISS results, half or less than what we see in the real world. On the other hand, there very large year-to-year variation in the the GDFL results, up to twice the size of the largest annual changes ever seen in the observational record …”

GISS Model E wasn’t in the Forster et al data. I see that when you have used it before, you got results from this site. That’s an ensemble mean of five runs. So variation is down.

Thanks, Nick. Actually, while one GISS model wasn’t included in the Forster et al data, the GISS result displayed above was just another dataset in the group of forcing/result pairs that I got from Piers Forster. And the reason for the lack of variation in the result is the lack of variation in the forcing used by GISS. I’m gonna write about this when I get to it, but here’s the money graph …

This shows the forcing (in W/m2) of all of the models except one (inmcm4, which doesn’t include volcanoes). You can see the lack of variation in the GISS forcing and the large variation in the GFDL-ESM2M forcing, each of which is faithfully reflected in their respective temperature results.
w.

Willis Eschenbach

Author

November 21, 2013 8:29 pm

Steven Mosher says:
November 21, 2013 at 7:57 pm

the violin plots are done with vioplot package ( assuming willis uses what I use )

Hey, Steven, always good to hear from you. Indeed I use vioplot, but I don’t use their boxplot overlay. Instead I overlay my own boxplot, which shows the relationships more clearly (to my eye at least).
w.

wbrozek

November 21, 2013 8:34 pm

Well, the maximum year-to-year change in the earth’s temperature over the last 140 years has been 0.3°C, for both rising and falling temperatures.
Is it possible some models tried to model the satellite data?
With RSS for example, 1997 was 0.103, 1998 was 0.549 and 1999 was 0.103 again. So it rose and dropped 0.446.

ferdberple

November 21, 2013 8:36 pm

old engineer says:
November 21, 2013 at 7:41 pm
Your comment about the kids throwing mud balls at target on barn reminded me of the poem “Hiawatha Designs an Experiment
================
a dart board shows the same pattern. Throw enough darts at the board and the average will be a bulls-eye. Throw enough and some may even hit the bulls eye. It doesn’t mean you have any skill at throwing bulls-eyes.
throw enough climate models at the bulls eye and on occasion some will accidentally hit the bulls eye, and the rest will be scattered about. and all the models will demonstrate the same skill as a dart board at predicting future climate.
There is a 1/3 chance the future will be hotter, 1/3 chance it will be colder, and 1/3 chance it will be unchanged. Randomly pick any set of point in the past and this will be true. Thus, if you were to forecast any time in the future, this would also hold true and it would be a foolish bet to forecast otherwise.
To argue that humans are “different” is a nonsense. The earth has suffered much worse catastrophes, yet the 1/3 rule holds. The problem is that we believe the future to be deterministic, but it isn’t. God has a sense of humor and is the biggest practical joker in the Universe.

Steven Mosher

November 21, 2013 9:30 pm

cool willis a couple of folks have posted mods to the vioplot package, you might consider it

Steven Mosher

November 21, 2013 9:35 pm

Wilis as I recall FGOALS actually includes volcanic forcing summed into its TSI forcing. did you get each component or just the sum of forcings.

dalyplanet

November 21, 2013 10:09 pm

I always enjoy your graphical posts Willis, thank you again.

Alan Smersh

November 21, 2013 10:23 pm

For the unenlightened the full
distribution packages are here
for MAC, Windows, & Linux …..
http://cran.r-project.org/
But “R” is quite complex, and so we maybe need a tool to learn “R”
and capable of operating a remote “R” server to do complex plots.
R Instructor is an Android and iPhone, iPad and iPod Touch
application that uses plain, non-technical language and over
30 videos to explain how to make and modify plots, manage data
and conduct both parametric and non-parametric statistical tests.
(other instructional packages available) Costs less than 5 Bucks !
http://www.rinstructor.com/
Or Read The Manuals here, but there’s over 3,500 pages …
http://cran.r-project.org/manuals.html
Have fun with all that, people !

AndyG55

November 21, 2013 10:31 pm

Again I will say..
Any model that hindcasts to fit pre-1979 Giss or HadCrud.. will ALWAYS create an overestimate of future temperatures.
They are stuck in a Catch 22 situation.
Either get rid of all the pre-1979 adjustments and have some sort of hope of some sort of realist projection (but abolishing the warming trend that the climate bletheren rely on) … or
Leave it as it is an keep producing models that greatly overestimate.

george e. smith

November 21, 2013 11:02 pm

I didn’t catch the reason for the “violin” plots. Isn’t all of the information contained in either half of the drawing, and in the usual probability graph orientation ?