Testing … testing … is this model powered up?

Guest Post by Willis Eschenbach

Over at Judith Curry’s excellent blog she has a post on how to test the climate models. In response I wrote a bit about some model testing I did four years ago, and I thought I should expand it into a full post for WUWT. We are being asked to bet billions of dollars on computer model forecasts of future climate catastrophe. These global climate models, known as GCMs, forecast that the globe will warm extensively over the next century. In this context, it is prudent to take a look at how well the models have done at “hindcasting” historical temperatures, when presented with actual data from historical records.

I analysed the hindcasts of the models that were used in Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere, (PDF, 3.1Mb) by B. D. Santer et al. (including Gavin Schmidt), Science, 2005 [hereinafter Santer05].

In that study, results were presented for the first time showing two sets of observational data plus 9 separate GCM temperature “hindcasts” for the temperatures at the surface, troposphere, and stratosphere of the tropical region (20°N to 20°S) from 1979 to 2000. These models were given the actual 1979-2000 data for a variety of forcings (e.g., volcanic eruptions, ozone levels, see below for a complete list). When fed with all of these forcings for 1979-2000, the GCMs produced their calculated temperature hindcasts. I have used the same observational data and the same model results used by Santer. Here’s what their results look like:

Results from Santer05 Analysis. Red and orange (overlapping) are observational data (NOAA and HadCRUv2). Data digitized from Santer05. See below for data availability.

The first question that people generally ask about GCM results like this is “what temperature trend did the models predict?”. This, however, is the wrong initial question.

The proper question is “are the model results life-like?” By lifelike, I mean do the models generally act like the real world that they are supposedly modeling? Are their results similar to the observations? Do they move and turn in natural patterns? In other words, does it walk like a duck and quack like a duck?

To answer this question, we can look at how the models stand, how they move, and how they turn. By how the models stand, I mean the actual month by month temperatures that the GCMs hindcast. How the models move, on the other hand, means the monthly changes in those same hindcast temperatures. This is the month-to-month movement of the temperature.

And how the models turn means the monthly variation in how much the temperatures are changing, in other words how fast they can turn from warming to cooling, or cooling to warming.

In mathematical terms, these are the hindcast surface temperature (ST), the monthly change in temperature [written as ∆ST/month, where the “∆” is the Greek letter delta, meaning “change in” ], and the monthly change in ∆ST [ ∆(∆ST)/month ]. These are all calculated from the detrended temperatures, in order to remove the variations caused by the trend. In the same manner as presented in the Santer paper, these are all reduced anomalies (anomalies less average monthly anomalies) which have been low-pass filtered to average slight monthly variations

How The Models Stand

How the models stand means the actual temperatures they hindcast. The best way to see this is a “boxplot”. The interquartile “box” of the boxplot represents the central half of the the data (first to third quartiles). In other words, half the time the surface temperature is somewhere in the range delineated by the “box”. The “whiskers” at the top and bottom show the range of the rest of the data out to a maximum of 1.0 times the box height. “Outliers”, data points which are outside of the range of the whiskers, are shown as circles above or below the whiskers. Here are the observational data (orange and red for NOAA and HadCRUT2v surface temperatures), and the model results, for the hindcast temperatures. A list of the models and the abbreviations used is appended.

Figure 1. Santer Surface Temperature Observational Data and Model Hindcasts. Colored boxes show the range from the first (lower) quartile to the third (upper) quartile. NOAA and HadCRUT (red and orange) are observational data, the rest are model hindcasts. Notches show 95% confidence interval for the median. “Whiskers” (dotted lines going up and down from colored boxes)  show the range of data out to the size of the Inter Quartile Range (IQR, shown by box height). Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). Gray rectangles at top and bottom of colored boxes show 95% confidence intervals for quartiles. Hatched horizontal strips show 95% confidence intervals for quartiles and median of HadCRUT observational data. See References for list of models and data used.

Fig. 1 shows what is called a “notched” boxplot. The heavy dark horizontal lines show the median of each dataset. The notches on each side of each median show a 95% confidence interval for that median. If the notches of two datasets do not overlap vertically, we can say with 95% confidence that the two medians are significantly different. The same is true of the gray rectangles at the top and bottom of each colored box. These are 95% confidence intervals on the quartiles. If these do not overlap, once again we have 95% confidence that the quartile is significantly different. The three confidence ranges of the HadCRUT data are shown as hatched bands behind the boxplots, so we can compare models to the 95% confidence level of the data.

Now before we get to the numbers and confidence levels, which of these model hindcasts look “lifelike” and which don’t? It’s like one of those tests we used to hate to take in high school, “which of the boxplots on the right belong to the group on the left?”

I’d say the UKMO model is really the only “lifelike” one. The real world observational data (NOAA and HadCRUT) has a peculiar and distinctive shape. The colored boxes showing the interquartile range of the data are short. There are numerous widely spread outliers at the top, and a few outliers bunched up close to the bottom. This shows that the tropical ocean often gets anomalously hot, but it rarely gets anomalously cold. UKMO reproduces all of these aspects of the observations pretty well. M_medres is a distant second, and none of the others are even close. CCSM3, GISS-EH, and PCM often plunge low, way lower than anything in the observations. CM2.1 is all over the place, with no outliers. CM2.0 is only slightly better, with an oversize range and no cold outliers. GISS-ER has a high median, and only a couple outliers on the cold side.

Let me digress for a moment here and talk about one of the underlying assumptions of the climate modellers. In a widely-quoted paper explaining why climate models work , the author states (emphasis mine):

On both empirical and theoretical grounds it is thought that skilful weather forecasts are possible perhaps up to about 14 days ahead. At first sight the prospect for climate prediction, which aims to predict the average weather over timescales of hundreds of years into the future, if not more does not look good!

However the key is that climate predictions only require the average and statistics of the weather states to be described correctly and not their particular sequencing. It turns out that the way the average weather can be constrained on regional-to-global scales is to use a climate model that has at its core a weather forecast model. This is because climate is constrained by factors such as the incoming solar radiation, the atmospheric composition and the reflective and other properties of the atmosphere and the underlying surface. Some of these factors are external whilst others are determined by the climate itself and also by human activities. But the overall radiative budget is a powerful constraint on the climate possibilities. So whilst a climate forecast model could fail to describe the detailed sequence of the weather in any place, its climate is constrained by these factors if they are accurately represented in the model.

Well, that all sounds good, and if it worked, it would be good. But the huge differences between the model hindcasts and actual observations clearly demonstrate that in all except perhaps one of these models the average and statistics are not described correctly …

But I digress … the first thing I note about Fig. 1 is that the actual tropical temperatures (NOAA and HadCRUT) stay within a very narrow range, as shown by the height of the coloured interquartile boxes (red and orange).

Remember that the boxplot means that half of the time, the actual tropical surface temperature stayed in the box, which for the observations shows a +/- 0.1° temperature range. Much of the time the tropical temperature is quite stable. The models, on the other hand, generally show a very different pattern. They reflect much more unstable systems, with the temperatures moving in a much wider range.

The second thing I note is that the model errors tend to be on the hot side rather than the cold side. The PCM, GISS-EH, and CCSM3 models, for example, all agree with the observations at the first (cooler) quartile . But they are too hot at the median and the third (upper) quartile. This is evidence that upwards temperature feedbacks are being overestimated in the model, so that when the models heat up, they heat too far, and they don’t cool down either as fast or as far as the real tropics does. Again, of the nine models, only the UKMO model reproduces the observed behaviour. All the rest show a pattern that is too hot.

Third, I note that all of the model interquartile boxes (except UKMO) are taller than the actual data, regardless of the range of each model’s hindcast. Even models with smaller ranges have taller boxes. This suggests again that the models have either too much positive feedback, or too little negative feedback. Negative feedback tends to keep data bunched up around the median (short box compared to range, like the observational data), positive feedback pushes it away from the median (tall box, e.g. PCM, with range similar  to data, much taller box).

Mathematically, this can be expressed as an index of total data range/IQR (Inter Quartile Range). For the two actual temperature datasets, this index is about 5 and 5.3, meaning the data is spread over a range about five times the IQR. All the models have indices in the range of 2.7-3.6 except UKMO, which has an index of 4.7.

Some of the models are so different from the data that one wonders why these are considered “state of the art” models. The CM models, both 2.0 and 2.1, give hindcast results that go both way hotter and way colder than the observational data. And all of the models but two (GISS-ER and UKMO) give hindcast results that go colder than the data.

How the Models Stand – Summary

  • UKMO is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
  • M_medres did a distant second best. It only matched the lower quartile confidence interval, with both the median and upper quartile being too hot.
  • The rest of the models were all well out of the running, showing distributions which are strikingly different from the observational data.
  • Least lifelike I’d judge to be the two CM models, CM2.0 and 2.1, and the PCM model.

Since we’ve seen the boxplot, let’s take a look at the temperature data for two lifelike and two un-lifelike models, compared with the observational data. As mentioned above, there is no trend because the data is detrended so we can measure how it is distributed.

Figure 2. Surface Temperature Observational Data and Model Hindcasts. Two of the best on the left, two of the worst on the right. See References for list of models and data used.

Note how for long periods (1979-82, 1990-97) the actual tropical surface temperature hardly varied from zero. This is why the box in the boxplot is so short, with half the data within +/- 0.1°C of the average.

The most lifelike of the models (UKMO and M_medres), while not quite reproducing this behaviour, came close. Their hindcasts at least look possible. The CM2.1 and PCM models, on the other hand, are wildly unstable. They hindcast extreme temperatures, and spend hardly any of their time in the +/- 0.1° C range of the actual temperature.

How The Models Move

How the models move means the month-to-month changes in temperature. The tropical ocean has a huge thermal mass, and it doesn’t change temperature very fast. Here are the boxplots of the movement of the temperature from month to month:

Figure 3. Surface Temperature Month-to-Month Changes (∆ST/month), showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile (IQR, or inter-quartile range). Notches show 95% confidence interval for the median. Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). . See References for list of models and data used.

Here in the month-to-month temperature changes, we see the same unusual pattern we saw in Fig. 1 of the temperatures. The observations have a short interquartile box compared to the range of the data, and a number of outliers. In this case there are about equal numbers of outliers above and and below the box. The inter quartile range (IQR, the box height) of tropical temperature change is about +/- 0.015°C per month, indicating that half of the time the temperature changes that little or less. The total range of the temperature change is about +/- 0.07. It is worth noting that in the 21 year record, the tropical surface never warmed or cooled faster than .07°C per month, so the models predicting faster warming or cooling than that must be viewed with great suspicion.

Although all but two of the models (CM2.0 and CM2.1) matched all three confidence intervals, there are still significant differences in the distribution of the hindcasts and the observations. The most lifelike is M_medres, with UKMO second. GISS-ER (purple) is curious, in that the month-to-month movements are all very small, never more than +/- 0.03 degrees per month. It never hindcasts anything like the larger monthly changes that we see in the actual data.

Next, consider the speed at which the ocean heats and cools. In the real world, as shown by the data, the heating and cooling rates are about the same. This makes sense, as we would expect the tropical ocean to radiate heat at something like the same rate it gains it. It has to lose the heat it gains at night by the next morning for the temperature to stay the same over several days.

Now look at the data distribution for GISS-EH, CM2.0 or CM2.1. They rarely heat up fast, but they cool down very fast (short whisker on top, long whisker plus outliers on bottom). Slow heating and fast cooling, that doesn’t make physical sense. The maximum heating rate for GISS-EH (0.03°C/mo.) is less than half the maximum heating rate of the actual tropics. PCM has the same problem, but in the other direction, heating up much faster than it cools down.

How the Models Move – Summary

  • M_medres is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
  • UKMO did a credible second best. However, the ranges of UKMO and M_medres were too small
  • The rest of the models were well out of the running, showing distributions which are strikingly different from the observational data.
  • The least lifelike? I’d judge that to be the two CM models, CM2.0 and 2.1, and the CCSM3 model.

Let’s take a look at these winners and losers at reproducing the changes in temperature (∆ST). As mentioned above, the data is detrended so we can see how it is distributed.

Figure 4. Surface Temperature Observational Data and Model Hindcast Delta ST. Shows monthly changes in the temperature. Two of the best, two of the worst. See References for list of models and data used.

Once again, the boxplots correctly distinguish between lifelike and un-lifelike models. The large and extremely rapid temperature drops of the CM2.1 model are clearly unnatural. The GISS-ER model, on the other hand hardly moves from month to month and is unnatural in the other direction.

How the Models Turn

Acceleration is the rate of change of speed. In this context, speed is the rate at which the tropical temperatures warm and cool. Acceleration is how fast the warming or cooling rate changes. It measures how fast a rising temperature can turn to fall again, or how fast a falling temperature can turn into a rising temperature. Since acceleration is the rate of change (∆) of the change in temperature (∆ST), it is notated as ∆(∆ST). Here are the results.

Figure 5. Surface Temperature Month-to-Month Changes in ∆ST, showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile. Notches show 95% confidence interval for the median. Whiskers show the range of data out to the interquartile box height. Circles show outliers. See References for list of models and data used.

Using the 95% confidence interval of the median and the quartiles, we would reject CM2.1, CM2.0, CCSM3, GISS-ER, and UKMO. PCM and M_medres are the most lifelike of the models. UKMO and GISS-ER are the first results we have seen which have significantly smaller interquartile boxes than the observations.

CONCLUSIONS

The overall conclusion from looking at how the models stand, move, and turn is that the models give results that are quite different from the observational data. None of the models were within all three 95% confidence intervals (median and two quartiles) of all of the data (surface temperatures ST, change in surface temps ∆ST, and acceleration in surface temps ∆∆ST). UKMO and M_medres were within 95% confidence intervals for two of the three datasets.

A number of the models show results which are way too large, entirely outside the historical range of the observational data. Others show results that are much less than the range of observational data. Most show results which have a very different distribution from the observations.

These differences are extremely important. As the Thorpe quote above says, before we can trust a model to give us future results, it first needs to be able to give hindcasts that resemble the “average and statistics of the weather states”. None of these models are skillful at that. UKMO does the best job, and M_medres comes in third best, with nobody in second place. The rest of the models are radically different from the reality.

The claim of the modellers has been that, although their models are totally unable to predict the year-by-year temperature, they are able to predict the temperature trend over a number of years. And it is true that for their results to be believable, they don’t need to hindcast the actual temperatures ST, monthly temperature changes ∆ST, and monthly acceleration ∆∆ST.

However, they do need to hindcast believable temperatures, changes, and accelerations. Of these models, only UKMO, and to a much lesser extent M_medres, give results that by this very preliminary set of measures are at all lifelike. It is not believable that the tropics will cool as fast as hindcast by the CM2.1 model (Fig. 3). CM2.1 hindcasts the temperature cooling at three times the maximum observed rate. On the other hand, the GISS-ER model is not believable because it hindcasts the temperature changing at only half the range of changes shown by observation. Using these models in the IPCC projections is extremely bad scientific practice.

There is an ongoing project to collect satellite based spectrally resolved radiances as a common measure between models and data. Unfortunately, we will need a quarter century of records to even start analysing, so that doesn’t help us now.

What we need now is an agreed upon set of specifications that constitute the mathematical definition of “lifelike”. Certainly, at a first level, the model results should resemble the data and the derivatives of the data. As a minimum standard for the models, the hindcast temperature itself should be similar in quartiles, median, and distribution of outliers to the observational data. Before we look at more sophisticated measures such as the derivatives of the temperature, or the autocorrelation, or the Hurst exponent, or the amplification, before anything else the models need to match the “average and statistics” of the actual temperature data itself.

By the standards I have adopted here (overlap of the 95% confidence notches of the medians, overlap of the 95% confidence boxes of the quartiles, similar outlier distribution), only the UKMO model passed two of the three tests. Now you can say the test is too strict, that we should go for the 90% confidence intervals and include more models. But as we all know, before all the numbers and the percentages when we first looked at Figure 1, the only model that looked lifelike was the UKMO model. That suggests to me that the 95% standard might be a good one.

But I’m not insisting that this be the test. My main interest is that there be some test, some way of separating the wheat from the chaff. Some, indeed most of these models are clearly not ready for prime time — their output looks nothing like the real world. To make billion dollar decisions on an untested, unranked suite of un-lifelike models seems to me to be the height of foolishness.

OK, so you don’t like my tests. Then go ahead and propose your own, but we need some suite of tests to make sure that whether or not the output of the models is inaccurate, it is at least lifelike … because remember, being lifelike is a necessary but not sufficient condition for the accurate forecasting of temperature trends.

My best to everyone,

w.

DATA

The data used in this analysis is available here as an Excel workbook.

REFERENCES

B. D. Santer et. al., 2005, September 2, “Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere”, Science Magazine

Thorpe, Alan J., 2005, “Climate Change Prediction — A challenging scientific problem”, Institute for Physics, 76 Portland Place London W1B 1NT

MODELS USED IN THE STUDY

National Center for Atmospheric Research in Boulder (CCSM3, PCM)

Institute for Atmospheric Physics in China (FGOALS-g1.0)

Geophysical Fluid Dynamics Laboratory in Princeton (GFDL-CM2.0, GFDL-CM2.1)

Goddard Institute for Space Studies in New York (GISS-AOM, GISS-EH, GISS-ER)

Center for Climate System Research, National Institute for Environmental Studies, and Frontier Research Center for Global Change in Japan (MIROC-CGCM2.3.2(medres), MIROCCGCM2.3.2(hires))

Meteorological Research Institute in Japan (MRICGCM2.3.2).

Canadian Centre for Climate Modelling and Analysis (CCCma-CGCM3.1(T47))

Meteo-France/Centre National de Recherches Meteorologiques (CNRM-CM3)

Institute for Numerical Mathematics in Russia (INM-CM3.0)

Institute Pierre Simon Laplace in France (IPSL-CM4)

Hadley Centre for Climate Prediction and Research in the U.K. (UKMO-HadCM3 and UKMO-HadGEM1).

FORCINGS USED BY THE MODELS

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

136 Comments
Inline Feedbacks
View all comments
Rob Z
December 2, 2010 10:24 pm

Steven Mosher replying to Willis’ comment:
W.E.“But if a model gives unphysical results in the tests I used above, if it shows monthly ocean temperatures changing by some huge never-before-seen amount in a single month … what does that say about the quality of their “fundamental physics”?”
S.M. “it says nothing on its face. Could be lots of issues.”
WRONG! It means the results from your model are crap! Only a modeler would somehow think it’s ok.
No one should be in favor of making life and death decisions using models giving “seemingly” accurate results even though the physical attributes to get the results are unrealistic. It may not be good science but it’s good policy! If a GCM model predicts an accurate temperature trend or sea level rise but calculated it using a negative CO2 concentration… or ANY OTHER physically impossible or never before observed situation, I would never use it. Why would anyone think it’s ok to trust that? Because if you’re wrong in predicting the weather, you just may get fired? If your wrong in predicting the climate? You just may get another grant.

anna v
December 2, 2010 10:27 pm

Willis,
your plots are anomalies comparisons. I agree that the climate modelers have chosen that as the playing field, but considering that energy goes according to T^4, not anomolies^4 the discrepancies are even worse in measure of what is truly going on with the system.
Have a look at the disagreement with data when temperatures are plotted , not for SST, by Lucia.
Also the numerical approximations used in the models on the solutions of coupled non linear equations, introduce discrepancies with reality due to the higher order terms that are excluded, once the time step gets large enough. One would expect a butterfly plot of disagreement in time, backwards and forwards, with center the time when the averages and the parameters were taken from the data to initialize the models.

anna v
December 3, 2010 1:16 am

steven mosher :
December 2, 2010 at 10:11 pm
I went to the link you provide. They are on the right track, but it is not mathematicians that they need only, they need solid theoretical physicists who can evaluate whether the mathematics is physically logical or not.

johanna
December 3, 2010 1:30 am

anna, thank you for your post, which explains why models of complex dynamic systems necessarily have a short shelf life, and why even tiny errors distort both predictions and backcasts (see my post about economic modelling).

December 3, 2010 2:17 am

Are climate modelers aware of chaos theory and sensitive dependence on initial conditions?
Now I have to go dig through my library for the book that describes an early discovery of chaotic systems and sensitivity to initial conditions, and if I recall correctly, the discoverer was running weather simulations (!).

Chris Wright
December 3, 2010 3:22 am

A very interesting post. However, I’m concerned that this is all based on hindcasting. It seems UKMO comes out as the best.But it might simply be because they put more effort into ‘adjusting’ their model to better match various aspects of historical data. To put it bluntly, I could write a computer program that performs far better than any of these models on historical data. In the extreme I would simply load in the historical data into the program, thus ensuring a perfect match!
The only real test is how well the models predict future climate, though unfortunately we have to wait many years before the results are in. But it does seem that models, since around 1980, all predicted warming that has not occurred.
One question. Do climate modellers do repeated runs with slightly different initial conditions? If the predictions are significantly different with slightly different initial conditions, then they would clearly be worthless.
Chris

DirkH
December 3, 2010 4:05 am

The people who write the GCM software wouldn’t stand a week in an industrial software project. They found themselves a nice cosy place where they can indulge in a mess of their own making years on end, patting each other on the back in peer reviews about how nice it all works out, and as long as the public is scared enough, maybe some more billions will be thrown their way. As scientists, they would at best fill mediocre roles; as coders, they are entirely worthless. Their product is never validated, never reviewed, they have no deadlines nor requirements to meet, as long as they can churn out one of their worthless papers about an “experiment” and its results, where experiment means model run, and it’s all tautological – the model will just show what’s been built into it plus the effects of a few tiny bugs. The bugs will make sure “scientists” will be “baffled” and “surprised” – oops, France melts down into a puddle of red hot lava in 2053, it’s worse than we thought! And they didn’t even have methane clathrate meltdown! My, my! Quick, write a paper, get a Nobel.
Has anyone ever seen an errata sheet of an older GCM version? Or a list of bugs fixed with a newer release?

Ron Cram
December 3, 2010 5:00 am

steven mosher says:
December 2, 2010 at 10:11 pm (Edit)
here Willis.
http://sms.cam.ac.uk/media/871991;jsessionid=6D22F432FAB6DF564481D7B3332FB58D?format=flv&quality=high&fetch_type=stream
Mosh, based on the text alone, this appears to be a pretty broad-based rejection of current climate models. I liked this bit:
“Firstly, climate model biases are still substantial, and may well be systemically related to the use of deterministic bulk-formula closure – this is an area where a much better basic understanding is needed. Secondly, deterministically formulated climate models are incapable of predicting the uncertainty in their predictions; and yet this is a crucially important prognostic variable for societal applications. Stochastic-dynamic closures can in principle provide this. Finally, the need to maintain worldwide a pool of quasi-independent deterministic models purely in order to have an ad hoc multi-model estimate of uncertainty, does not make efficient use of the limited human and computer resources available worldwide for climate model developement.”
It seems pretty clear, the author thinks all of the current models should be scrapped so people can focus on building something that works. I think it is a fool’s errand. The goal Cambridge is pursuing is higher resolution models so the models are less deterministic. A good goal is it was realistic. According to Hank Tennekes, the models will never match nature at higher resolutions. Nature is too chaotic. Orrin Pilkey agrees.

Ron Cram
December 3, 2010 5:05 am

Willis,
I appreciate your effort in writing this blog post. And I think you may be on to something. Your approach seems reasonable to me. I would love to see it repeated at a higher latitude band.
I also think Mosh had a good idea of looking at precipitation and sea ice.
And where is Judith Curry? I would be interested in seeing her comments on this post.

LazyTeenager
December 3, 2010 5:40 am

Tim says:
December 2, 2010 at 5:01 am
Would a simple way to describe GCM’s be: a technique that is easily manipulated to produce results that would achieve a desired
—————–
Tim needs to think about this more clearly.
If the results of GCMs are being manipulated then there would be no point in bench marking them for accuracy or skill would there?

LazyTeenager
December 3, 2010 5:44 am

This looks like a good article to me Willis.. The models need good tests for their skill. The more the better.

Ron Cram
December 3, 2010 6:23 am

LazyTeenager says:
December 3, 2010 at 5:40 am
Tim says:
December 2, 2010 at 5:01 am
Would a simple way to describe GCM’s be: a technique that is easily manipulated to produce results that would achieve a desired
—————–
Tim needs to think about this more clearly.
If the results of GCMs are being manipulated then there would be no point in bench marking them for accuracy or skill would there?
———-
Actually, Lazy Teenager, if the models are being manipulated (or constrained), testing their skill or validity is important because it will show they are not reliable. Science is supposed to be self-correcting. If a claim is not tested or replicated, it is not science. Benchmarking does not mean you expect future models to be any better. You may just be disproving the current model.

DirkH
December 3, 2010 7:14 am

LazyTeenager says:
December 3, 2010 at 5:40 am
“”If the results of GCMs are being manipulated then there would be no point in bench marking them for accuracy or skill would there?”
The manipulation happens on the input end through parametrization, other than that, you are very right – there is absolutely no need to benchmark them. Deleting them all would be the only sane action.
And firing all the parasites in the climate ivory towers.

Charlie A
December 3, 2010 10:08 am

anna v says: December 2, 2010 at 10:27 pm:

. …. the climate modelers have chosen that as the playing field, but considering that energy goes according to T^4, not anomolies^4 the discrepancies are even worse in measure of what is truly going on with the system.

I’ve alway felt that a more reasonable 1 number metric for “Global Average Temp” would be the 4th root of the area averaged T^4.

December 3, 2010 11:47 am

I apologize ahead of time for my ignorance, but the way I see it, if the model cannot “hindcast” (post-dict) the very data it was built with, something is wrong:
“My model predicts that the Colts will win the 2010 Super Bowl,” is laughable, because they didn’t.
Secondly, just because a model fits the data it was built with is no guarantee that the model is any good. The WHOLE POINT of the GCM’s is to predict the future; that is their ONLY utility and the ONLY real test of their “validity”. So if the model CANNOT predict the future with any skill, then it is a DUD model.
Oh ho, you say. The models predict the future 100 years hence and so cannot be validated according to the strict rules above, at least not for 100 years. Catch-22 and all that.
Sorry, but that sneaky little clause in the science contract means that the GCM CANNOT BE VALIDATED. It’s stupid to talk of validation when the thing is impossible to validate according to the Catch-22 limitations conveniently proffered by the model builders.
But logic never had a role in GCM building. It’s all a plot by nefarious irrationalists to drain the Treasuries of the world. And I can validate and verify that statement, in case anyone needs the hard, cold proof.

December 3, 2010 12:33 pm

Ron Cram
I think willis and others would be interested in the convective cloud examples.
There is a much more powerful and interesting skeptical position WITHIN the science of AGW than outside it. The cranks who deny the radiative effects of C02 or sun spot chasers or “its natural variation” shoulder shruggers, are missing the best skeptical argument.

December 3, 2010 12:39 pm

Rob Z.
Do you fly in airplanes?

Ben of Houston
December 3, 2010 1:12 pm

What I find most interesting is all the Brits that read this blog and then are railing against the Met Office. Think about this point – the UKMO is the only model with which they are confident of to put out quarterly predictions, and it is legendarily wrong. This shows that it is one of the best at showing lifelike behavior.
Think what would happen if they made these predictions with one of the worse models?
______
Tim says:
December 2, 2010 at 5:01 am
Would a simple way to describe GCM’s be: a technique that is easily manipulated to produce results that would achieve a desired
______
You just described mathematical modelling as a whole. As a Chemical Engineer, I had an entire class that boiled down to “your model is wrong and will always be wrong, here’s how you make it useful”. We call it control systems. It involves constantly taking real information and putting it into your model to correct it. Without the feedback, your model wouldn’t be able to control a coffee maker.

BlueIce2HotSea
December 3, 2010 1:30 pm

Mike D. December 3, 2010 at 11:47 am
“It’s all a plot by nefarious irrationalists to drain the Treasuries of the world.”
And sometimes serious scientists just come to believe in their models. Anthony Watts has noted that four years ago, scientists at NASA’s Marshall Space Flight Center announced their computer model had a 94% correlation coefficient with hindcasting. They confidently predicted that Solar Cycle 24 would have the greatest sunspot maximum in 400 years! Now the same scientists, are talking about a possible Dalton Minimum; the lowest sunspot maximum in 2 centuries!
The irony is that by predicting both extremes and everything in-between, the averaged predictions could turn out to be true. But I doubt they will. And the lesson is: we don’t yet have a model that reliably predicts good model performance.

December 3, 2010 2:11 pm

Mike D.
You dont understand how model validation works.
Let’s take a simple model F=MA. yes, all physical laws are MODELS.
This model predicts that if we know the mass and the acceleration we can predict
the force. Of course, we have to specify some things. Like how accurately we can
make the prediction, and we have to know how accurately we can measure the variables. Of course in the lab F never is exactly equal to M*A. We call this residual “error” that is, for all practical purposes we accept F=MA as giving good answers.
( there are cases where it may not give the best answer)
So when we set the validation criteria for a model we do not set perfection as a goal.
Let’s take dropping bombs. I have a model for how a bomb drops. A simple gravity model. Now I could include all sorts of effects, coriolis, a complex drag model, an atmospheric winds model. In the end, I’ll choose a simple model because:
A. my answer doesnt have to be good to the last millimeter. Its a bomb.
B. I have to drop the sucker in a split second, so I want a fast model.
So, for sea level increase I may decide.
A. it has to get the sign correct per region
B. a positive bias is preferred over a negative bias.
C. it has to be good to +- some number of mm per year.
D. it has to be globally correct
Valid doesnt mean “true” or perfect. It means good enough to do the job for which
it was intended.
here is another way to think about it.
For example. We may decide that for planning we want to know if a house will be safe in a 100 year flood. Well, we have empirical stats on 100 year floods. Those stats are always wrong. A 100 year model will predict that a flood is say 25% probable in the next 30 years. In reality the 100 year flood will happen or not. so the prediction that it is 25% probable is wrong one way or the other. But we use them.
Taking 1961to 1990 as a base period, do you think the future will be warmer or colder?
Why? I think it will be warmer. Our best science, limited as it is, say that warming of 2C over the next hundred years is more likely than not. So, I would plan for it being warmer. If you have a model that says its likely to be colder ( a math model not words) they lets put it to the test. What would that model predict from 1850 to today.

Frank
December 3, 2010 2:20 pm

Very interesting/devastating post. However, I’d like to propose another criteria for climate model utility. We need climate models that can accurately predict catastrophic climate change due to changing forcings (be they anthropogenic or natural). In theory, we have an excellent idea of how forcings were different during the early to mid Holocene: The biggest known change is that the earth was closest to the sun during summer in the Northern Hemisphere (unlike the present). The ice caps and therefore sea level were about the same. Ice cores give us a good idea of GHG’s and aerosols. We know that summer warmth allow forests reached the shores of the Arctic Ocean. Since that time, the Earth has experienced the kind of catastrophic climate change that truly useful climate models should be able to predict: the development of the Sahara, the largest desert on the planet. Since no one has reported that their model is capable of showing monsoon rains penetrating north to the Mediterranean during this period, climate models flunk this relatively unchallenging test.
I recommend some of Stainforth’s papers on how much parameters in climate models can be varied and still produce models that work about as well as the IPCC’s “ensemble of opportunity” – better described as an “ensemble shaped by convergent evolution under the pressure from natural/political selection”). http://www.cabrillo.edu/~ncrane/nature%20climate.pdf Even Stainforth hasn’t varied the parameters that control thermal diffusion in the oceans, parameters that Lindzen claims are grossly inconsistent with laboratory measurements. Until the Argo network, we had really poor information about heat transfer in the oceans and no way to judge whether our models are correct. The IPCC’s climate models aren’t going to be found to be consistent with the last five years of data from the Argo network.

December 3, 2010 4:37 pm

Valid doesn’t mean true? It does in my dictionary.
I guess in post-modern science “valid” means it looks like a duck. I am not that familiar with post-modern science, so I cannot dispute that contention. But if your model says the seas are going to boil away into outer space shortly after they become as sour as battery acid, then I would say your model doesn’t even look like a duck.
Seriously, if the GCMs are wrong in EVERY prediction, as they are, then they lack validity, and nobody but a duck should believe them. The “best” science in this arena may be a pig in a poke, and wrong as wrong could be, and I think it is. I don’t think the globe is going to warm 2 deg C in the next 100 years. My model says the opposite — that the globe is going to COOL 2 deg C. You may not like my model, but it looks like a duck from where I sit, and if that’s the only criteria for validity, then my model is eminently valid according to the new definition of that word.

Verified by MonsterInsights