Testing … testing … is this model powered up?

Guest Post by Willis Eschenbach

Over at Judith Curry’s excellent blog she has a post on how to test the climate models. In response I wrote a bit about some model testing I did four years ago, and I thought I should expand it into a full post for WUWT. We are being asked to bet billions of dollars on computer model forecasts of future climate catastrophe. These global climate models, known as GCMs, forecast that the globe will warm extensively over the next century. In this context, it is prudent to take a look at how well the models have done at “hindcasting” historical temperatures, when presented with actual data from historical records.

I analysed the hindcasts of the models that were used in Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere, (PDF, 3.1Mb) by B. D. Santer et al. (including Gavin Schmidt), Science, 2005 [hereinafter Santer05].

In that study, results were presented for the first time showing two sets of observational data plus 9 separate GCM temperature “hindcasts” for the temperatures at the surface, troposphere, and stratosphere of the tropical region (20°N to 20°S) from 1979 to 2000. These models were given the actual 1979-2000 data for a variety of forcings (e.g., volcanic eruptions, ozone levels, see below for a complete list). When fed with all of these forcings for 1979-2000, the GCMs produced their calculated temperature hindcasts. I have used the same observational data and the same model results used by Santer. Here’s what their results look like:

Results from Santer05 Analysis. Red and orange (overlapping) are observational data (NOAA and HadCRUv2). Data digitized from Santer05. See below for data availability.

The first question that people generally ask about GCM results like this is “what temperature trend did the models predict?”. This, however, is the wrong initial question.

The proper question is “are the model results life-like?” By lifelike, I mean do the models generally act like the real world that they are supposedly modeling? Are their results similar to the observations? Do they move and turn in natural patterns? In other words, does it walk like a duck and quack like a duck?

To answer this question, we can look at how the models stand, how they move, and how they turn. By how the models stand, I mean the actual month by month temperatures that the GCMs hindcast. How the models move, on the other hand, means the monthly changes in those same hindcast temperatures. This is the month-to-month movement of the temperature.

And how the models turn means the monthly variation in how much the temperatures are changing, in other words how fast they can turn from warming to cooling, or cooling to warming.

In mathematical terms, these are the hindcast surface temperature (ST), the monthly change in temperature [written as ∆ST/month, where the “∆” is the Greek letter delta, meaning “change in” ], and the monthly change in ∆ST [ ∆(∆ST)/month ]. These are all calculated from the detrended temperatures, in order to remove the variations caused by the trend. In the same manner as presented in the Santer paper, these are all reduced anomalies (anomalies less average monthly anomalies) which have been low-pass filtered to average slight monthly variations

How The Models Stand

How the models stand means the actual temperatures they hindcast. The best way to see this is a “boxplot”. The interquartile “box” of the boxplot represents the central half of the the data (first to third quartiles). In other words, half the time the surface temperature is somewhere in the range delineated by the “box”. The “whiskers” at the top and bottom show the range of the rest of the data out to a maximum of 1.0 times the box height. “Outliers”, data points which are outside of the range of the whiskers, are shown as circles above or below the whiskers. Here are the observational data (orange and red for NOAA and HadCRUT2v surface temperatures), and the model results, for the hindcast temperatures. A list of the models and the abbreviations used is appended.

Figure 1. Santer Surface Temperature Observational Data and Model Hindcasts. Colored boxes show the range from the first (lower) quartile to the third (upper) quartile. NOAA and HadCRUT (red and orange) are observational data, the rest are model hindcasts. Notches show 95% confidence interval for the median. “Whiskers” (dotted lines going up and down from colored boxes)  show the range of data out to the size of the Inter Quartile Range (IQR, shown by box height). Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). Gray rectangles at top and bottom of colored boxes show 95% confidence intervals for quartiles. Hatched horizontal strips show 95% confidence intervals for quartiles and median of HadCRUT observational data. See References for list of models and data used.

Fig. 1 shows what is called a “notched” boxplot. The heavy dark horizontal lines show the median of each dataset. The notches on each side of each median show a 95% confidence interval for that median. If the notches of two datasets do not overlap vertically, we can say with 95% confidence that the two medians are significantly different. The same is true of the gray rectangles at the top and bottom of each colored box. These are 95% confidence intervals on the quartiles. If these do not overlap, once again we have 95% confidence that the quartile is significantly different. The three confidence ranges of the HadCRUT data are shown as hatched bands behind the boxplots, so we can compare models to the 95% confidence level of the data.

Now before we get to the numbers and confidence levels, which of these model hindcasts look “lifelike” and which don’t? It’s like one of those tests we used to hate to take in high school, “which of the boxplots on the right belong to the group on the left?”

I’d say the UKMO model is really the only “lifelike” one. The real world observational data (NOAA and HadCRUT) has a peculiar and distinctive shape. The colored boxes showing the interquartile range of the data are short. There are numerous widely spread outliers at the top, and a few outliers bunched up close to the bottom. This shows that the tropical ocean often gets anomalously hot, but it rarely gets anomalously cold. UKMO reproduces all of these aspects of the observations pretty well. M_medres is a distant second, and none of the others are even close. CCSM3, GISS-EH, and PCM often plunge low, way lower than anything in the observations. CM2.1 is all over the place, with no outliers. CM2.0 is only slightly better, with an oversize range and no cold outliers. GISS-ER has a high median, and only a couple outliers on the cold side.

Let me digress for a moment here and talk about one of the underlying assumptions of the climate modellers. In a widely-quoted paper explaining why climate models work , the author states (emphasis mine):

On both empirical and theoretical grounds it is thought that skilful weather forecasts are possible perhaps up to about 14 days ahead. At first sight the prospect for climate prediction, which aims to predict the average weather over timescales of hundreds of years into the future, if not more does not look good!

However the key is that climate predictions only require the average and statistics of the weather states to be described correctly and not their particular sequencing. It turns out that the way the average weather can be constrained on regional-to-global scales is to use a climate model that has at its core a weather forecast model. This is because climate is constrained by factors such as the incoming solar radiation, the atmospheric composition and the reflective and other properties of the atmosphere and the underlying surface. Some of these factors are external whilst others are determined by the climate itself and also by human activities. But the overall radiative budget is a powerful constraint on the climate possibilities. So whilst a climate forecast model could fail to describe the detailed sequence of the weather in any place, its climate is constrained by these factors if they are accurately represented in the model.

Well, that all sounds good, and if it worked, it would be good. But the huge differences between the model hindcasts and actual observations clearly demonstrate that in all except perhaps one of these models the average and statistics are not described correctly …

But I digress … the first thing I note about Fig. 1 is that the actual tropical temperatures (NOAA and HadCRUT) stay within a very narrow range, as shown by the height of the coloured interquartile boxes (red and orange).

Remember that the boxplot means that half of the time, the actual tropical surface temperature stayed in the box, which for the observations shows a +/- 0.1° temperature range. Much of the time the tropical temperature is quite stable. The models, on the other hand, generally show a very different pattern. They reflect much more unstable systems, with the temperatures moving in a much wider range.

The second thing I note is that the model errors tend to be on the hot side rather than the cold side. The PCM, GISS-EH, and CCSM3 models, for example, all agree with the observations at the first (cooler) quartile . But they are too hot at the median and the third (upper) quartile. This is evidence that upwards temperature feedbacks are being overestimated in the model, so that when the models heat up, they heat too far, and they don’t cool down either as fast or as far as the real tropics does. Again, of the nine models, only the UKMO model reproduces the observed behaviour. All the rest show a pattern that is too hot.

Third, I note that all of the model interquartile boxes (except UKMO) are taller than the actual data, regardless of the range of each model’s hindcast. Even models with smaller ranges have taller boxes. This suggests again that the models have either too much positive feedback, or too little negative feedback. Negative feedback tends to keep data bunched up around the median (short box compared to range, like the observational data), positive feedback pushes it away from the median (tall box, e.g. PCM, with range similar  to data, much taller box).

Mathematically, this can be expressed as an index of total data range/IQR (Inter Quartile Range). For the two actual temperature datasets, this index is about 5 and 5.3, meaning the data is spread over a range about five times the IQR. All the models have indices in the range of 2.7-3.6 except UKMO, which has an index of 4.7.

Some of the models are so different from the data that one wonders why these are considered “state of the art” models. The CM models, both 2.0 and 2.1, give hindcast results that go both way hotter and way colder than the observational data. And all of the models but two (GISS-ER and UKMO) give hindcast results that go colder than the data.

How the Models Stand – Summary

  • UKMO is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
  • M_medres did a distant second best. It only matched the lower quartile confidence interval, with both the median and upper quartile being too hot.
  • The rest of the models were all well out of the running, showing distributions which are strikingly different from the observational data.
  • Least lifelike I’d judge to be the two CM models, CM2.0 and 2.1, and the PCM model.

Since we’ve seen the boxplot, let’s take a look at the temperature data for two lifelike and two un-lifelike models, compared with the observational data. As mentioned above, there is no trend because the data is detrended so we can measure how it is distributed.

Figure 2. Surface Temperature Observational Data and Model Hindcasts. Two of the best on the left, two of the worst on the right. See References for list of models and data used.

Note how for long periods (1979-82, 1990-97) the actual tropical surface temperature hardly varied from zero. This is why the box in the boxplot is so short, with half the data within +/- 0.1°C of the average.

The most lifelike of the models (UKMO and M_medres), while not quite reproducing this behaviour, came close. Their hindcasts at least look possible. The CM2.1 and PCM models, on the other hand, are wildly unstable. They hindcast extreme temperatures, and spend hardly any of their time in the +/- 0.1° C range of the actual temperature.

How The Models Move

How the models move means the month-to-month changes in temperature. The tropical ocean has a huge thermal mass, and it doesn’t change temperature very fast. Here are the boxplots of the movement of the temperature from month to month:

Figure 3. Surface Temperature Month-to-Month Changes (∆ST/month), showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile (IQR, or inter-quartile range). Notches show 95% confidence interval for the median. Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). . See References for list of models and data used.

Here in the month-to-month temperature changes, we see the same unusual pattern we saw in Fig. 1 of the temperatures. The observations have a short interquartile box compared to the range of the data, and a number of outliers. In this case there are about equal numbers of outliers above and and below the box. The inter quartile range (IQR, the box height) of tropical temperature change is about +/- 0.015°C per month, indicating that half of the time the temperature changes that little or less. The total range of the temperature change is about +/- 0.07. It is worth noting that in the 21 year record, the tropical surface never warmed or cooled faster than .07°C per month, so the models predicting faster warming or cooling than that must be viewed with great suspicion.

Although all but two of the models (CM2.0 and CM2.1) matched all three confidence intervals, there are still significant differences in the distribution of the hindcasts and the observations. The most lifelike is M_medres, with UKMO second. GISS-ER (purple) is curious, in that the month-to-month movements are all very small, never more than +/- 0.03 degrees per month. It never hindcasts anything like the larger monthly changes that we see in the actual data.

Next, consider the speed at which the ocean heats and cools. In the real world, as shown by the data, the heating and cooling rates are about the same. This makes sense, as we would expect the tropical ocean to radiate heat at something like the same rate it gains it. It has to lose the heat it gains at night by the next morning for the temperature to stay the same over several days.

Now look at the data distribution for GISS-EH, CM2.0 or CM2.1. They rarely heat up fast, but they cool down very fast (short whisker on top, long whisker plus outliers on bottom). Slow heating and fast cooling, that doesn’t make physical sense. The maximum heating rate for GISS-EH (0.03°C/mo.) is less than half the maximum heating rate of the actual tropics. PCM has the same problem, but in the other direction, heating up much faster than it cools down.

How the Models Move – Summary

  • M_medres is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
  • UKMO did a credible second best. However, the ranges of UKMO and M_medres were too small
  • The rest of the models were well out of the running, showing distributions which are strikingly different from the observational data.
  • The least lifelike? I’d judge that to be the two CM models, CM2.0 and 2.1, and the CCSM3 model.

Let’s take a look at these winners and losers at reproducing the changes in temperature (∆ST). As mentioned above, the data is detrended so we can see how it is distributed.

Figure 4. Surface Temperature Observational Data and Model Hindcast Delta ST. Shows monthly changes in the temperature. Two of the best, two of the worst. See References for list of models and data used.

Once again, the boxplots correctly distinguish between lifelike and un-lifelike models. The large and extremely rapid temperature drops of the CM2.1 model are clearly unnatural. The GISS-ER model, on the other hand hardly moves from month to month and is unnatural in the other direction.

How the Models Turn

Acceleration is the rate of change of speed. In this context, speed is the rate at which the tropical temperatures warm and cool. Acceleration is how fast the warming or cooling rate changes. It measures how fast a rising temperature can turn to fall again, or how fast a falling temperature can turn into a rising temperature. Since acceleration is the rate of change (∆) of the change in temperature (∆ST), it is notated as ∆(∆ST). Here are the results.

Figure 5. Surface Temperature Month-to-Month Changes in ∆ST, showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile. Notches show 95% confidence interval for the median. Whiskers show the range of data out to the interquartile box height. Circles show outliers. See References for list of models and data used.

Using the 95% confidence interval of the median and the quartiles, we would reject CM2.1, CM2.0, CCSM3, GISS-ER, and UKMO. PCM and M_medres are the most lifelike of the models. UKMO and GISS-ER are the first results we have seen which have significantly smaller interquartile boxes than the observations.

CONCLUSIONS

The overall conclusion from looking at how the models stand, move, and turn is that the models give results that are quite different from the observational data. None of the models were within all three 95% confidence intervals (median and two quartiles) of all of the data (surface temperatures ST, change in surface temps ∆ST, and acceleration in surface temps ∆∆ST). UKMO and M_medres were within 95% confidence intervals for two of the three datasets.

A number of the models show results which are way too large, entirely outside the historical range of the observational data. Others show results that are much less than the range of observational data. Most show results which have a very different distribution from the observations.

These differences are extremely important. As the Thorpe quote above says, before we can trust a model to give us future results, it first needs to be able to give hindcasts that resemble the “average and statistics of the weather states”. None of these models are skillful at that. UKMO does the best job, and M_medres comes in third best, with nobody in second place. The rest of the models are radically different from the reality.

The claim of the modellers has been that, although their models are totally unable to predict the year-by-year temperature, they are able to predict the temperature trend over a number of years. And it is true that for their results to be believable, they don’t need to hindcast the actual temperatures ST, monthly temperature changes ∆ST, and monthly acceleration ∆∆ST.

However, they do need to hindcast believable temperatures, changes, and accelerations. Of these models, only UKMO, and to a much lesser extent M_medres, give results that by this very preliminary set of measures are at all lifelike. It is not believable that the tropics will cool as fast as hindcast by the CM2.1 model (Fig. 3). CM2.1 hindcasts the temperature cooling at three times the maximum observed rate. On the other hand, the GISS-ER model is not believable because it hindcasts the temperature changing at only half the range of changes shown by observation. Using these models in the IPCC projections is extremely bad scientific practice.

There is an ongoing project to collect satellite based spectrally resolved radiances as a common measure between models and data. Unfortunately, we will need a quarter century of records to even start analysing, so that doesn’t help us now.

What we need now is an agreed upon set of specifications that constitute the mathematical definition of “lifelike”. Certainly, at a first level, the model results should resemble the data and the derivatives of the data. As a minimum standard for the models, the hindcast temperature itself should be similar in quartiles, median, and distribution of outliers to the observational data. Before we look at more sophisticated measures such as the derivatives of the temperature, or the autocorrelation, or the Hurst exponent, or the amplification, before anything else the models need to match the “average and statistics” of the actual temperature data itself.

By the standards I have adopted here (overlap of the 95% confidence notches of the medians, overlap of the 95% confidence boxes of the quartiles, similar outlier distribution), only the UKMO model passed two of the three tests. Now you can say the test is too strict, that we should go for the 90% confidence intervals and include more models. But as we all know, before all the numbers and the percentages when we first looked at Figure 1, the only model that looked lifelike was the UKMO model. That suggests to me that the 95% standard might be a good one.

But I’m not insisting that this be the test. My main interest is that there be some test, some way of separating the wheat from the chaff. Some, indeed most of these models are clearly not ready for prime time — their output looks nothing like the real world. To make billion dollar decisions on an untested, unranked suite of un-lifelike models seems to me to be the height of foolishness.

OK, so you don’t like my tests. Then go ahead and propose your own, but we need some suite of tests to make sure that whether or not the output of the models is inaccurate, it is at least lifelike … because remember, being lifelike is a necessary but not sufficient condition for the accurate forecasting of temperature trends.

My best to everyone,

w.

DATA

The data used in this analysis is available here as an Excel workbook.

REFERENCES

B. D. Santer et. al., 2005, September 2, “Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere”, Science Magazine

Thorpe, Alan J., 2005, “Climate Change Prediction — A challenging scientific problem”, Institute for Physics, 76 Portland Place London W1B 1NT

MODELS USED IN THE STUDY

National Center for Atmospheric Research in Boulder (CCSM3, PCM)

Institute for Atmospheric Physics in China (FGOALS-g1.0)

Geophysical Fluid Dynamics Laboratory in Princeton (GFDL-CM2.0, GFDL-CM2.1)

Goddard Institute for Space Studies in New York (GISS-AOM, GISS-EH, GISS-ER)

Center for Climate System Research, National Institute for Environmental Studies, and Frontier Research Center for Global Change in Japan (MIROC-CGCM2.3.2(medres), MIROCCGCM2.3.2(hires))

Meteorological Research Institute in Japan (MRICGCM2.3.2).

Canadian Centre for Climate Modelling and Analysis (CCCma-CGCM3.1(T47))

Meteo-France/Centre National de Recherches Meteorologiques (CNRM-CM3)

Institute for Numerical Mathematics in Russia (INM-CM3.0)

Institute Pierre Simon Laplace in France (IPSL-CM4)

Hadley Centre for Climate Prediction and Research in the U.K. (UKMO-HadCM3 and UKMO-HadGEM1).

FORCINGS USED BY THE MODELS

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
136 Comments
Inline Feedbacks
View all comments
December 2, 2010 5:54 am

Murray Grainger says: “Apples and Oranges folks. The fact that UKMO is abysmally bad at predicting UK weather cannot he used as evidence that their model is bad.
But Murray, the fact is that they are equally bad at predicting global temperature … sorry that’s an overstatement, they get the weather right some time!
Nine out of nine years they predicted a warming of temperature and 8 out of those nine years the temperature was lower than the 50% confidence interval and something like 6 of those years it was lower than the 75% confidence interval.
And year after year after year they would make a grand press release about their great climate predictions for the next year … without so much as a hint how useless they had been in all the previous years.
It’s not so much that they got the forecast wrong, it is that they were so outspoken about their ability to forecast and even had the audicity to say they were “accurate” when it wasn’t significantly better than just saying: “next year will be the same as this”.
They weren’t forecasting, they were just taking last year’s temperature and adding 0.06C each and every year, getting it high every year and then issuing press releases saying: “our forecasts are wonderful because they are accurate to 0.06C”.
And we all know that sooner or later by pure statistical fluke, it would have been a warm year and then they’d have crowed to everyone how good their forecast was!
They are nothing more than charlatans when it comes to global temperature forecasts.

Baa Humbug
December 2, 2010 5:59 am

Richard M says:
December 2, 2010 at 5:24 am

Any reason this test was limited to 30 years? Why not start in 1880?

Richard, Vincent Gray had something to say about this…

“Validation of climate models based on the greenhouse effect requires that they be compared with a past temperature sequence assumed to be free of natural variability. The annual global surface temperature anomaly series before 1940 does not comply with this requirement as greenhouse gas emissions were unimportant over this period. “

Bernie
December 2, 2010 6:03 am

gnarf:
The name of this family of models, GCM, suggests to me that your first assumption that the models are designed to predict temperatures is incorrect.

December 2, 2010 6:04 am

Great post, Willis, but, as they say, you can take the boy out of the country but you can’t take the country out of the boy. Your ability to communicate abstract ideas is impressive and always reasonably clear to follow for a layman such as I.
But a comment made to me in 1955 by an old agricultural worker came back to me as |I read your post;
“Gonna be a hard winter, Mate. The ducks are nesting real high right now”
told me more about wisdom based on observations collected cumulatively over the lifetimes of generations than all the computer modeling that smart blokes in air-conditioned labs playing with expensive electronic devices can. I know that modeling has a place in science, but ignoring information built up over generations seems pretty silly to me. I guess if this accumulated wisdom was codified in a data base, that would make it more sorta scienc-y?

Lance Wallace
December 2, 2010 6:05 am

Your table shows two UKMO models, one with only 4 “forcings” and the other with 9. Which one was in the graphs? As you know, with 9 free parameters one can fit up to a 9th-order polynomial perfectly.

Bill Conley
December 2, 2010 6:06 am

I build trading systems for a living. These systems attemp to “predict” future market prices and execute trades based on those predications in order to make profits for their clients. The first question potential users of my systems ask is how they have performed in the past (in real and/or simulated trading). If the answer is poorly, you can bet they take their money and move on.
Here we are asking all the world to bet vast sums of money and make untold sacrifices based on models that have predicted very poorly on a historical basis. I’m sorry, but the world should take its money and “move on.”

December 2, 2010 6:06 am

Mods, small omission – I left out ‘it’ before ‘more’ in the last line – can you fix, PLEASE?
Sorry, unused to deep thoughts after good luncheon!
[OK…. done… bl57~mod]

NormD
December 2, 2010 6:10 am

I used to do modeling of transistors and circuits back in the day.
I always wonder how climate models work in conditions that we know occurred in the Earth’s history: ice ages, MWP, very high O2 and CO2, etc. Do they become unstable or do they properly show that the climate stabilized and returned to a norm? In other words do they contain all the correct variables and relationships?
It was easy to construct circuit models that modeled performance when inputs were only slightly tweaked (what happens if we changed the 10K resistor to 11K) but much harder to model what happened if we made major changes (10K to 100K), which paradoxically was exactly what users wanted to do and did. I cannot count all the design failures we had when users pushed models beyond their operational areas.
If we have only calibrated climate models over short times and relatively stable data, it seems improper to push the model well outside this range.

Bill Illis
December 2, 2010 6:16 am

Tropics temperatures are not actually stable, they vary by a large margin.
But they are dominated by the ENSO – which has an impact of +/-0.6C. After one accounts for this and the lessor impact from the AMO, there is NO global warming signal left (well, 0.2C by 2100). So, you can’t model the Tropics without having a dominant ENSO module.
Here is my latest reconstruction of UAH Tropics temperatures (and the forecast going out 14 months).
http://img338.imageshack.us/img338/1889/uahtropicsmodeloct10.png

December 2, 2010 6:34 am

In order to simulate anything you need a proven formula, an equation. However, does anybody know how climate work?
However the model used by the UN’s FAO organization, practically applied to fish catches has been proved to be a successful tool for such an economic and real (not imaginary) activity:
ftp://ftp.fao.org/docrep/fao/005/y2787e/
See the document:
Archive: y2787e08.pdf

Robert
December 2, 2010 6:41 am

I’m sure glad we aren’t betting real money on these horserace prediction systems. Oh wait. We are. Not as much money as requested, but we are indeed betting real money, as a society.
And as mentioned, averaging them all is obviously lunacy.
Have these people no standards?? Hard work is not the same thing as accurate, useful work. Or are these just the prototype models, to make a pitch for the real funding for a real model?
Excellent post on an excellent topic. Many thanks for taking these models out for a clear spin, with clear results

Pamela Gray
December 2, 2010 6:43 am

Willis, I quote the compliment from the President in the movie Independence Day: “Not bad, not bad at all!”

John from CA
December 2, 2010 6:45 am

Thanks, great post Willis
IMO, there appear to be a couple of very fundamental flaws to all the models.
The first, the models aren’t modular (module = 1 or more Peer Reviewed aspects of the Model). This makes the Models inefficient, expensive, and difficult to evaluate.
The second issue, the use of the term “Global Average” is a fallacy in a system as dynamic and regional as the Earth and its climate system.
It seems like the best way to fix the models is to fix Climate Science.

Robert R. Clough - Thorncraft
December 2, 2010 6:51 am

Interesting post, but if the models do not show what has actually happened, of what use are they? As a layman (Historian, retired), I want to know what the weather will be, hopefully, for the next 5 + days, and what the trend is forecast for the next 20 years.
From studies I have read (yes I can and do read scientific stuff), I’d say the quality of the climate models is less than poor, perhaps at the same level as social science or economics models!

Lance
December 2, 2010 6:51 am

Can’t see models working. When they have changed the temperatures in the past so many times, they don’t know what actually happened anymore, so their history is corrupted, how can they expect GI(garbagge in) to create precise projections for the future.

Hoser
December 2, 2010 7:02 am

“OK, so you don’t like my tests. Then go ahead and propose your own, but we need some suite of tests….”
No. We don’t. These models are toys. Expensive toys.
We should not base policy on any of these models. Sorry, but climate is fundamentally unpredictable, mainly because we have no ability to measure or predict basic drivers of climate, such as the cosmic ray flux. Even if we can understand the processes causing solar flux variations, we still won’t be able to predict variations in galactic cosmic ray flux at the heliosphere. Why should we assume that is a constant? Silly humans.
Therefore, adaptation is the answer, not mitigation. Mitigation assumes you know what is happening and can do something to improve the outcome by doing something ahead of time. Bureaucracies love controlling people and their money, so mitigation is fits right into their modus opperandi.
Humans are adaptable, that is our strength. Adaptation is a test, and is best accomplished on a small scale. There is no guarantee of success. When governments force common behavior accross large populations, adaptation tests succeed or fail for that whole group. When government forces common behavior in large populations, there are fewer opportunities to test different adaptation approaches. We may not be able to discover what works in time if government keeps taking more and more control.
Population geneticists know that diversity is the strength of any population. That is because the optimal individual type for a given set of conditions may not be the optimal type when the conditions change. Change is not necessarily predictable. A small subpopulation can become the winner. Without that subpopulation, the species could go extinct. By extension, that means a diversity of businesses, regulations, and strategies are more likely to provide answers than top-down mandates. When government gets it wrong, we all suffer. Look at our economies now.
The worst case scenario is one-world-government. Under these conditions, there is the least amount of social and technical diversity. Humans should beware becoming a monoculture. Our strength is adaptation, and fundamentally, it is an individual decision. As our freedom is more limited, and businesses become more regulated, our species becomes more endangered. Some people like that idea, but mass sui/homocide is a subject for another post.
Climate models are more a tool of government policy, than good science. That is because they are funded because they produce results the regulators need to support their agendas. We can be sure the models and modelers are selected by funding based on political need, not necessarily scientific merit.

GregO
December 2, 2010 7:06 am

Willis,
Thank you for your time and effort examining these climate models. Your passion for truth is an inspiration.

Alex the skeptic
December 2, 2010 7:08 am

Willis Eisenbach has, at least, started the ball rolling for the eventual creation of a benchmark for testing GCM’s. This could even be the benchmark, but I could be wrong. What comes out clearly is that GCM’s are mostly weak or very weak, if not totally wrong, in their predictive powers.
Considerning this, can one produce a graph showing the billions of dollars or euros spent per o.01C of error from and due to these models? I mean the cost of producing these super computers and software added to the cost that humanity has paid in trying to fight/mitigate a non existent enemy.
I would predict something in the region of $-€10 to 100 billion per 0.01C error.

Richard S Courtney
December 2, 2010 7:08 am

When considering ensemble model results it should always be remembered that
(a) there is only one Earth
(b) so at most only one of the models is right
(c) and average wrong is wrong.
Richard

December 2, 2010 7:14 am

What would an engineer do … reset the models for known already measure time, and run the predictions for that. Say predict the 2009 temperature, starting in 1970. That way we have known start and end. So how do they fare in 2009 would be the calibration. Otherwise, the models are useless, if not calibrated.

Alexander
December 2, 2010 7:19 am

A good summary paper but it fails to answer a key issue, the incorrect optical physics used in the models to predict cloud albedo from optical depth. Whilst the ‘two stream’ approximations originally from Sagan appear to give the right relationship, when used to predict change of albedo caused by aerosol pollution, the results go the wrong way for thicker clouds. It’s because the assumption of constant Mie asymmetry factor is wrong, also direct backscattering at the upper cloud boundary isn’t taken account of.
So, ‘cloud albedo effect’ cooling, 175% of the raw median net present AGW in AR4 is imaginary. Without it you have to reduce the IPCC’s predictions of future CO2-AGW by at least a factor of three. Furthermore, because aerosol pollution probably reduces direct backscattering, ‘cloud albedo effect’ cooling becomes heating, another AGW.
That’s a game changer because it’s self-limiting, possibly why ocean heat content has stopped rising, implying most AGW was the increase of low level tropical cloud albedo from Asian aerosol pollution, and it saturated in 2003.
So, I believe the models are physically wrong in key areas and the fit to real air temperatures is illusory. Hence, until they’re fixed, they can’t predict the future.

Roger Andrews
December 2, 2010 7:21 am

A common problem with analyses of this type is the assumption that the HadCRUT3 and NOAA records faithfully reflect actual temperature observations. They don’t. They are “corrected” records that often look nothing like the raw records they are derived from, and in some cases the “corrections” used to construct them are demonstrably invalid (viz. the 0.4-0.5C WWII stair-step cooling adjustment applied to HadSST2, which shifts HadCRUT3 artificially downwards by about 0.3C after 1946).
It would be nice to see a comparison of model output against unadjusted temperature data.

Ian W
December 2, 2010 7:23 am

Thanks Willis it is nice to see someone talking about ‘validation’ in climatology. Validation seems to be avoided by climate ‘scientists’ to the extent that when the real world observations don’t match the models it is the real world observations that are questioned.
One would have thought that with trillions of dollars and supposedly the survival of the human race at stake, that someone somewhere would have predeclared the validation tests for GCMs and those that failed them would have all funding terminated. This is what happens in other areas of science and engineering with any safety implications.
The lack of validation and acceptance of poor results on its own shows that no-one really believes that there is a real threat from ‘climate change’. Put this in another context – if the threat was a collision between the Earth and a large asteroid in ten years time and some modeling groups were producing models that couldn’t hindcast/forecast the trajectory of the asteroid within 95% accuracy – would they still be believed and funded?
Some issues I have are more climate related – I feel that the use of ‘de-trended’ averaging of anomalies hides a multitude of errors. As others have commented another test would be to have regional metrics perhaps each degree of latitude and longitude with a set of actual forecast values for each month – surface, mid-troposphere, tropopause, temperature, humidity ambient wind. The reason for this is that the ‘statistical-weather is climate’ argument seems to depend on the weather having some kind of Markov property – and it does not. The actual values for each degree of latitude and longitude could easily be checked against the analyzed atmosphere and provide a detailed level of model validation. This would also allow the modelers to see where their models were going awry.
Unfortunately, the efforts appear to be in the other direction to use validation metrics that are trends based on averaged anomalies of coarse low granularity data and hide model inaccuracies; despite the model output using spurious precisions of hundredths of a degree. This seems to be aimed more at receiving further funding than at validatable accuracy.

Baa Humbug
December 2, 2010 7:31 am

Alexander K says:
December 2, 2010 at 6:04 am

I guess if this accumulated wisdom was codified in a data base, that would make it more sorta scienc-y?

Not quite what you’re looking for but…..it’s called a “Farners Almanac”
Judging by the one I saw some years back, their predictive capability is orders of magnitude higher than UKMetO or Aussie BoM

Steve Keohane
December 2, 2010 7:37 am

Another great post Willis. I like what Alexander K says: December 2, 2010 at 6:04 am regarding observational wisdom. I noticed in July that a copper-colored hummingbird that migrates through annually, came and left two weeks early. Then in August, the ruby-throated and other hummingbird species packed up two weeks early as well, this in the heat of August. In November, three and four months later, we have snow cover two to three weeks early. WUWT?