Testing … testing … is this model powered up?

Guest Post by Willis Eschenbach

Over at Judith Curry’s excellent blog she has a post on how to test the climate models. In response I wrote a bit about some model testing I did four years ago, and I thought I should expand it into a full post for WUWT. We are being asked to bet billions of dollars on computer model forecasts of future climate catastrophe. These global climate models, known as GCMs, forecast that the globe will warm extensively over the next century. In this context, it is prudent to take a look at how well the models have done at “hindcasting” historical temperatures, when presented with actual data from historical records.

I analysed the hindcasts of the models that were used in Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere, (PDF, 3.1Mb) by B. D. Santer et al. (including Gavin Schmidt), Science, 2005 [hereinafter Santer05].

In that study, results were presented for the first time showing two sets of observational data plus 9 separate GCM temperature “hindcasts” for the temperatures at the surface, troposphere, and stratosphere of the tropical region (20°N to 20°S) from 1979 to 2000. These models were given the actual 1979-2000 data for a variety of forcings (e.g., volcanic eruptions, ozone levels, see below for a complete list). When fed with all of these forcings for 1979-2000, the GCMs produced their calculated temperature hindcasts. I have used the same observational data and the same model results used by Santer. Here’s what their results look like:

Results from Santer05 Analysis. Red and orange (overlapping) are observational data (NOAA and HadCRUv2). Data digitized from Santer05. See below for data availability.

The first question that people generally ask about GCM results like this is “what temperature trend did the models predict?”. This, however, is the wrong initial question.

The proper question is “are the model results life-like?” By lifelike, I mean do the models generally act like the real world that they are supposedly modeling? Are their results similar to the observations? Do they move and turn in natural patterns? In other words, does it walk like a duck and quack like a duck?

To answer this question, we can look at how the models stand, how they move, and how they turn. By how the models stand, I mean the actual month by month temperatures that the GCMs hindcast. How the models move, on the other hand, means the monthly changes in those same hindcast temperatures. This is the month-to-month movement of the temperature.

And how the models turn means the monthly variation in how much the temperatures are changing, in other words how fast they can turn from warming to cooling, or cooling to warming.

In mathematical terms, these are the hindcast surface temperature (ST), the monthly change in temperature [written as ∆ST/month, where the “∆” is the Greek letter delta, meaning “change in” ], and the monthly change in ∆ST [ ∆(∆ST)/month ]. These are all calculated from the detrended temperatures, in order to remove the variations caused by the trend. In the same manner as presented in the Santer paper, these are all reduced anomalies (anomalies less average monthly anomalies) which have been low-pass filtered to average slight monthly variations

How The Models Stand

How the models stand means the actual temperatures they hindcast. The best way to see this is a “boxplot”. The interquartile “box” of the boxplot represents the central half of the the data (first to third quartiles). In other words, half the time the surface temperature is somewhere in the range delineated by the “box”. The “whiskers” at the top and bottom show the range of the rest of the data out to a maximum of 1.0 times the box height. “Outliers”, data points which are outside of the range of the whiskers, are shown as circles above or below the whiskers. Here are the observational data (orange and red for NOAA and HadCRUT2v surface temperatures), and the model results, for the hindcast temperatures. A list of the models and the abbreviations used is appended.

Figure 1. Santer Surface Temperature Observational Data and Model Hindcasts. Colored boxes show the range from the first (lower) quartile to the third (upper) quartile. NOAA and HadCRUT (red and orange) are observational data, the rest are model hindcasts. Notches show 95% confidence interval for the median. “Whiskers” (dotted lines going up and down from colored boxes)  show the range of data out to the size of the Inter Quartile Range (IQR, shown by box height). Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). Gray rectangles at top and bottom of colored boxes show 95% confidence intervals for quartiles. Hatched horizontal strips show 95% confidence intervals for quartiles and median of HadCRUT observational data. See References for list of models and data used.

Fig. 1 shows what is called a “notched” boxplot. The heavy dark horizontal lines show the median of each dataset. The notches on each side of each median show a 95% confidence interval for that median. If the notches of two datasets do not overlap vertically, we can say with 95% confidence that the two medians are significantly different. The same is true of the gray rectangles at the top and bottom of each colored box. These are 95% confidence intervals on the quartiles. If these do not overlap, once again we have 95% confidence that the quartile is significantly different. The three confidence ranges of the HadCRUT data are shown as hatched bands behind the boxplots, so we can compare models to the 95% confidence level of the data.

Now before we get to the numbers and confidence levels, which of these model hindcasts look “lifelike” and which don’t? It’s like one of those tests we used to hate to take in high school, “which of the boxplots on the right belong to the group on the left?”

I’d say the UKMO model is really the only “lifelike” one. The real world observational data (NOAA and HadCRUT) has a peculiar and distinctive shape. The colored boxes showing the interquartile range of the data are short. There are numerous widely spread outliers at the top, and a few outliers bunched up close to the bottom. This shows that the tropical ocean often gets anomalously hot, but it rarely gets anomalously cold. UKMO reproduces all of these aspects of the observations pretty well. M_medres is a distant second, and none of the others are even close. CCSM3, GISS-EH, and PCM often plunge low, way lower than anything in the observations. CM2.1 is all over the place, with no outliers. CM2.0 is only slightly better, with an oversize range and no cold outliers. GISS-ER has a high median, and only a couple outliers on the cold side.

Let me digress for a moment here and talk about one of the underlying assumptions of the climate modellers. In a widely-quoted paper explaining why climate models work , the author states (emphasis mine):

On both empirical and theoretical grounds it is thought that skilful weather forecasts are possible perhaps up to about 14 days ahead. At first sight the prospect for climate prediction, which aims to predict the average weather over timescales of hundreds of years into the future, if not more does not look good!

However the key is that climate predictions only require the average and statistics of the weather states to be described correctly and not their particular sequencing. It turns out that the way the average weather can be constrained on regional-to-global scales is to use a climate model that has at its core a weather forecast model. This is because climate is constrained by factors such as the incoming solar radiation, the atmospheric composition and the reflective and other properties of the atmosphere and the underlying surface. Some of these factors are external whilst others are determined by the climate itself and also by human activities. But the overall radiative budget is a powerful constraint on the climate possibilities. So whilst a climate forecast model could fail to describe the detailed sequence of the weather in any place, its climate is constrained by these factors if they are accurately represented in the model.

Well, that all sounds good, and if it worked, it would be good. But the huge differences between the model hindcasts and actual observations clearly demonstrate that in all except perhaps one of these models the average and statistics are not described correctly …

But I digress … the first thing I note about Fig. 1 is that the actual tropical temperatures (NOAA and HadCRUT) stay within a very narrow range, as shown by the height of the coloured interquartile boxes (red and orange).

Remember that the boxplot means that half of the time, the actual tropical surface temperature stayed in the box, which for the observations shows a +/- 0.1° temperature range. Much of the time the tropical temperature is quite stable. The models, on the other hand, generally show a very different pattern. They reflect much more unstable systems, with the temperatures moving in a much wider range.

The second thing I note is that the model errors tend to be on the hot side rather than the cold side. The PCM, GISS-EH, and CCSM3 models, for example, all agree with the observations at the first (cooler) quartile . But they are too hot at the median and the third (upper) quartile. This is evidence that upwards temperature feedbacks are being overestimated in the model, so that when the models heat up, they heat too far, and they don’t cool down either as fast or as far as the real tropics does. Again, of the nine models, only the UKMO model reproduces the observed behaviour. All the rest show a pattern that is too hot.

Third, I note that all of the model interquartile boxes (except UKMO) are taller than the actual data, regardless of the range of each model’s hindcast. Even models with smaller ranges have taller boxes. This suggests again that the models have either too much positive feedback, or too little negative feedback. Negative feedback tends to keep data bunched up around the median (short box compared to range, like the observational data), positive feedback pushes it away from the median (tall box, e.g. PCM, with range similar  to data, much taller box).

Mathematically, this can be expressed as an index of total data range/IQR (Inter Quartile Range). For the two actual temperature datasets, this index is about 5 and 5.3, meaning the data is spread over a range about five times the IQR. All the models have indices in the range of 2.7-3.6 except UKMO, which has an index of 4.7.

Some of the models are so different from the data that one wonders why these are considered “state of the art” models. The CM models, both 2.0 and 2.1, give hindcast results that go both way hotter and way colder than the observational data. And all of the models but two (GISS-ER and UKMO) give hindcast results that go colder than the data.

How the Models Stand – Summary

  • UKMO is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
  • M_medres did a distant second best. It only matched the lower quartile confidence interval, with both the median and upper quartile being too hot.
  • The rest of the models were all well out of the running, showing distributions which are strikingly different from the observational data.
  • Least lifelike I’d judge to be the two CM models, CM2.0 and 2.1, and the PCM model.

Since we’ve seen the boxplot, let’s take a look at the temperature data for two lifelike and two un-lifelike models, compared with the observational data. As mentioned above, there is no trend because the data is detrended so we can measure how it is distributed.

Figure 2. Surface Temperature Observational Data and Model Hindcasts. Two of the best on the left, two of the worst on the right. See References for list of models and data used.

Note how for long periods (1979-82, 1990-97) the actual tropical surface temperature hardly varied from zero. This is why the box in the boxplot is so short, with half the data within +/- 0.1°C of the average.

The most lifelike of the models (UKMO and M_medres), while not quite reproducing this behaviour, came close. Their hindcasts at least look possible. The CM2.1 and PCM models, on the other hand, are wildly unstable. They hindcast extreme temperatures, and spend hardly any of their time in the +/- 0.1° C range of the actual temperature.

How The Models Move

How the models move means the month-to-month changes in temperature. The tropical ocean has a huge thermal mass, and it doesn’t change temperature very fast. Here are the boxplots of the movement of the temperature from month to month:

Figure 3. Surface Temperature Month-to-Month Changes (∆ST/month), showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile (IQR, or inter-quartile range). Notches show 95% confidence interval for the median. Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). . See References for list of models and data used.

Here in the month-to-month temperature changes, we see the same unusual pattern we saw in Fig. 1 of the temperatures. The observations have a short interquartile box compared to the range of the data, and a number of outliers. In this case there are about equal numbers of outliers above and and below the box. The inter quartile range (IQR, the box height) of tropical temperature change is about +/- 0.015°C per month, indicating that half of the time the temperature changes that little or less. The total range of the temperature change is about +/- 0.07. It is worth noting that in the 21 year record, the tropical surface never warmed or cooled faster than .07°C per month, so the models predicting faster warming or cooling than that must be viewed with great suspicion.

Although all but two of the models (CM2.0 and CM2.1) matched all three confidence intervals, there are still significant differences in the distribution of the hindcasts and the observations. The most lifelike is M_medres, with UKMO second. GISS-ER (purple) is curious, in that the month-to-month movements are all very small, never more than +/- 0.03 degrees per month. It never hindcasts anything like the larger monthly changes that we see in the actual data.

Next, consider the speed at which the ocean heats and cools. In the real world, as shown by the data, the heating and cooling rates are about the same. This makes sense, as we would expect the tropical ocean to radiate heat at something like the same rate it gains it. It has to lose the heat it gains at night by the next morning for the temperature to stay the same over several days.

Now look at the data distribution for GISS-EH, CM2.0 or CM2.1. They rarely heat up fast, but they cool down very fast (short whisker on top, long whisker plus outliers on bottom). Slow heating and fast cooling, that doesn’t make physical sense. The maximum heating rate for GISS-EH (0.03°C/mo.) is less than half the maximum heating rate of the actual tropics. PCM has the same problem, but in the other direction, heating up much faster than it cools down.

How the Models Move – Summary

  • M_medres is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
  • UKMO did a credible second best. However, the ranges of UKMO and M_medres were too small
  • The rest of the models were well out of the running, showing distributions which are strikingly different from the observational data.
  • The least lifelike? I’d judge that to be the two CM models, CM2.0 and 2.1, and the CCSM3 model.

Let’s take a look at these winners and losers at reproducing the changes in temperature (∆ST). As mentioned above, the data is detrended so we can see how it is distributed.

Figure 4. Surface Temperature Observational Data and Model Hindcast Delta ST. Shows monthly changes in the temperature. Two of the best, two of the worst. See References for list of models and data used.

Once again, the boxplots correctly distinguish between lifelike and un-lifelike models. The large and extremely rapid temperature drops of the CM2.1 model are clearly unnatural. The GISS-ER model, on the other hand hardly moves from month to month and is unnatural in the other direction.

How the Models Turn

Acceleration is the rate of change of speed. In this context, speed is the rate at which the tropical temperatures warm and cool. Acceleration is how fast the warming or cooling rate changes. It measures how fast a rising temperature can turn to fall again, or how fast a falling temperature can turn into a rising temperature. Since acceleration is the rate of change (∆) of the change in temperature (∆ST), it is notated as ∆(∆ST). Here are the results.

Figure 5. Surface Temperature Month-to-Month Changes in ∆ST, showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile. Notches show 95% confidence interval for the median. Whiskers show the range of data out to the interquartile box height. Circles show outliers. See References for list of models and data used.

Using the 95% confidence interval of the median and the quartiles, we would reject CM2.1, CM2.0, CCSM3, GISS-ER, and UKMO. PCM and M_medres are the most lifelike of the models. UKMO and GISS-ER are the first results we have seen which have significantly smaller interquartile boxes than the observations.

CONCLUSIONS

The overall conclusion from looking at how the models stand, move, and turn is that the models give results that are quite different from the observational data. None of the models were within all three 95% confidence intervals (median and two quartiles) of all of the data (surface temperatures ST, change in surface temps ∆ST, and acceleration in surface temps ∆∆ST). UKMO and M_medres were within 95% confidence intervals for two of the three datasets.

A number of the models show results which are way too large, entirely outside the historical range of the observational data. Others show results that are much less than the range of observational data. Most show results which have a very different distribution from the observations.

These differences are extremely important. As the Thorpe quote above says, before we can trust a model to give us future results, it first needs to be able to give hindcasts that resemble the “average and statistics of the weather states”. None of these models are skillful at that. UKMO does the best job, and M_medres comes in third best, with nobody in second place. The rest of the models are radically different from the reality.

The claim of the modellers has been that, although their models are totally unable to predict the year-by-year temperature, they are able to predict the temperature trend over a number of years. And it is true that for their results to be believable, they don’t need to hindcast the actual temperatures ST, monthly temperature changes ∆ST, and monthly acceleration ∆∆ST.

However, they do need to hindcast believable temperatures, changes, and accelerations. Of these models, only UKMO, and to a much lesser extent M_medres, give results that by this very preliminary set of measures are at all lifelike. It is not believable that the tropics will cool as fast as hindcast by the CM2.1 model (Fig. 3). CM2.1 hindcasts the temperature cooling at three times the maximum observed rate. On the other hand, the GISS-ER model is not believable because it hindcasts the temperature changing at only half the range of changes shown by observation. Using these models in the IPCC projections is extremely bad scientific practice.

There is an ongoing project to collect satellite based spectrally resolved radiances as a common measure between models and data. Unfortunately, we will need a quarter century of records to even start analysing, so that doesn’t help us now.

What we need now is an agreed upon set of specifications that constitute the mathematical definition of “lifelike”. Certainly, at a first level, the model results should resemble the data and the derivatives of the data. As a minimum standard for the models, the hindcast temperature itself should be similar in quartiles, median, and distribution of outliers to the observational data. Before we look at more sophisticated measures such as the derivatives of the temperature, or the autocorrelation, or the Hurst exponent, or the amplification, before anything else the models need to match the “average and statistics” of the actual temperature data itself.

By the standards I have adopted here (overlap of the 95% confidence notches of the medians, overlap of the 95% confidence boxes of the quartiles, similar outlier distribution), only the UKMO model passed two of the three tests. Now you can say the test is too strict, that we should go for the 90% confidence intervals and include more models. But as we all know, before all the numbers and the percentages when we first looked at Figure 1, the only model that looked lifelike was the UKMO model. That suggests to me that the 95% standard might be a good one.

But I’m not insisting that this be the test. My main interest is that there be some test, some way of separating the wheat from the chaff. Some, indeed most of these models are clearly not ready for prime time — their output looks nothing like the real world. To make billion dollar decisions on an untested, unranked suite of un-lifelike models seems to me to be the height of foolishness.

OK, so you don’t like my tests. Then go ahead and propose your own, but we need some suite of tests to make sure that whether or not the output of the models is inaccurate, it is at least lifelike … because remember, being lifelike is a necessary but not sufficient condition for the accurate forecasting of temperature trends.

My best to everyone,

w.

DATA

The data used in this analysis is available here as an Excel workbook.

REFERENCES

B. D. Santer et. al., 2005, September 2, “Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere”, Science Magazine

Thorpe, Alan J., 2005, “Climate Change Prediction — A challenging scientific problem”, Institute for Physics, 76 Portland Place London W1B 1NT

MODELS USED IN THE STUDY

National Center for Atmospheric Research in Boulder (CCSM3, PCM)

Institute for Atmospheric Physics in China (FGOALS-g1.0)

Geophysical Fluid Dynamics Laboratory in Princeton (GFDL-CM2.0, GFDL-CM2.1)

Goddard Institute for Space Studies in New York (GISS-AOM, GISS-EH, GISS-ER)

Center for Climate System Research, National Institute for Environmental Studies, and Frontier Research Center for Global Change in Japan (MIROC-CGCM2.3.2(medres), MIROCCGCM2.3.2(hires))

Meteorological Research Institute in Japan (MRICGCM2.3.2).

Canadian Centre for Climate Modelling and Analysis (CCCma-CGCM3.1(T47))

Meteo-France/Centre National de Recherches Meteorologiques (CNRM-CM3)

Institute for Numerical Mathematics in Russia (INM-CM3.0)

Institute Pierre Simon Laplace in France (IPSL-CM4)

Hadley Centre for Climate Prediction and Research in the U.K. (UKMO-HadCM3 and UKMO-HadGEM1).

FORCINGS USED BY THE MODELS

Advertisements

136 thoughts on “Testing … testing … is this model powered up?

  1. With the Kyoto Protocol running out at the end of 2012 and the frantic rush by some countries to provide a replacement, I would really like somebody such as Willis Eschenbach to provide a review of how the modelpredictions used to promote and facilitate the agreement of the Protocol have fared since 1998. A comparison of their end 2010 global temperature predictions based on 385ppm CO2 with the actual end 2010 global temperature. That in my opinion would be a worthwhile exercise. Furthermore, what global level of CO2 was the Kyoto Protocol supposed to achieve? Billions of £,$,EUR have been spent on implementation of this Protocol and to what purpose?

  2. The simple way to check models is: how well do they predict.
    The Met Office used to publicise their yearly global temperature prediction I compared that with “my” prediction that the temperature each year would be the same as the last and annoyingly, they were just a fractional bit better than pure chance (about 0.005C or something stupid).
    But even so they never had the honesty to admit their forecasts were useless.
    And the fact they will not admit when they are utterly useless and have the same arrogance about their ability to forecast in 100 years is why no one should ever believe such people.

  3. Always with the anomalies. When I want a weather forecast I don’t want to be told, “1.5 degrees warmer than this time twenty years ago, at least on average” I want degrees Celsius. And there are no anomalies in the real world. What do these models look like with actual temperatures rather than anomalies? True, if you want to know whether the world is warming or cooling or neither, anomalies are the numbers that serve you best but a model models the whole thing and should be compared to the whole thing.

  4. I do not think models are as good as presented. Weather forecasts over 5 days are invariably wrong in my experience. Mine come from the UKMO. So a forecast of 15 days being said to be correct is not true. The current cold spell in the UK, which have reached record lows, was not forecast by the UKMO who stated that this winter would be warmer than the norm. We have yet to reach winter, which starts on 21 Dec., and we have record low temperatures in autumn. So in spite of your claim above, the UKMO is poor at a forecast of over 5 days and certainly not for a couple of months.
    I do not hold out much hope for the 100 year forecast.

  5. Modeling the tropics is basically be able to forecast ENSO and no one can do this yet (well, J. Bastardi can do half year forward).
    I compared the Arctic temperature model runs with actual record backwards to 1900.
    http://i43.tinypic.com/14ihncp.jpg
    Alas, the models except mimicking the Keeling curve do not have much common with real life.

  6. Is the UK MO model the same one that keeps predicting a mild winters in the UK seasonal forecasts? It may be the best in your analysis but I bet a few snow bound Brits might have a different take on how well it predicts future weather.

  7. The UKMO model has a warming bias of 0.05C per annum or 5C per century so any predictions from this model should be adjusted accordingly.

  8. Willis,
    Very interesting stuff. Well done you finding a way to condense a very difficult assessment down to a couple of graphs that are easy to understand and assess by eyeball.
    One concern is that you’ve only used tropical temperature data, not global termperature data. My worry here is that the differing spreads of the models are not necessarily due to incorrect feedbacks; they might be due to differences in spatial heat transport between the models.
    Suppose that a model has relatively fast transfer of heat away from the tropics; as temperature builds up, the heat ‘escapes’ to other, cooler, neighbouring regions and the temperature is relatively stable. The same applies as a region cools; heat transfers in from neighbouring regions, limiting the temperature dip. But a model with relatively slow heat transfer will tend to over-emphasise local peaks and troughs in temperature, because it takes longer for heat to transfer in and out of neighbouring regions.
    This is of course a local negative feedback mechanism with a large impact on (eg) tropical temperatures, but its impact on global temperatures will be much smaller; even though the model with slow heat transfer might show big dips in tropical temperature, this is just because more heat is stored elsewhere and so the large-scale average is the same.
    I don’t know a lot about climate modelling, but I imagine the question of heat transfer between the tropics and sub-tropics is a fairly controversial and difficult one, as it is closely linked to tropical storm / hurricane / cyclone modelling, this being one of the main mechanisms for global heat transfer. It’d be interesting to see how the UKMO model hindcast the number of tropical storms; if it is much higher than reality, this would tell us quite a bit about why their temparatures are comparatively stable and also nullify their apparent advantage in forecasting.
    So I think your method is very good idea for assessing the performance of climate models, but needs to be applied globally rather than to one region.

  9. Willis:
    Congratulations on a very clear exposition of the characteristics of GCMs – Tufte would be impressed.

  10. Willis, great analysis and presentation. Thanks.
    Would you be able to find the time to test a surface temperature reconstruction for me if I sent you some data?

  11. Would a simple way to describe GCM’s be: a technique that is easily manipulated to produce results that would achieve a desired outcome? The obvious question would then be: Who benefits from a pre-programmed outcome and did they have a hand in the research?
    Big Pharma know all about this, and they are very good at it.

  12. Apples and Oranges folks. The fact that UKMO is abysmally bad at predicting UK weather cannot he used as evidence that their model is bad. The model is for the tropics “(20°N to 20°S)” and last time I checked the UK wasn’t in that band, London being 51°N.
    So, clearly the UKMO model is superb for the tropics (otherwise they would not have spent all that money and made all those brave predictions) but, like the rest of the UK infrastructure, it fails (ever so slightly) in winter and at higher latitudes.
    Simples!

  13. Even an accurate model (if it exists) will fail your tests. These models are supposed to forecast temperature trend-> they should pass your first test.
    But the only way to pass your 2nd and 3rd test is to have a statistically accurate temperature derivative and second derivative… very complex (increased chaos every time you derivate), not sure it can be done at all…and it should have very limited influence on the trend so I understand perfectly that most models don’t even try to do such thing and focus on forcasting the trend.
    For example imagine we use a model to calculate the flow of a river in various weather conditions. We can get very good results which allow to forecast floods, know how the flood spreads, the flow trend…but the model will give statistically different detrended results from measured ones…because it does not calculate local things like vortices, changes in river side configuration… derivatives of the model should be smoother than derivatives of the measures.

  14. We are being asked to bet billions of dollars on computer model forecasts
    =====================================
    Willis, we’ve already spent billions
    Try Trillions…………

  15. From a layman’s point of view (mine – I’m a retired mechanical engineer so what do I know) – these tests look pretty valid to me.
    Obvious question – which models are our ‘learned friends’ at UEA using – or are they using all of them and averaging the results..? Either way, it looks as if the models could be producing far greater variance (and implying far greater warming) than reality.

  16. “The first question that people generally ask about GCM results like this is “what temperature trend did the models predict?”.”
    Actually, the first questions I ask as someone who has worked professionally in Computational Fluid Dynamics for over 20 years are “(1) what equations were solved?, (2) what boundary conditions were used for those equations?, and (3) what numerical methods were employed to solve the equations?”.
    Unfortunately, many of the code development groups cited above fail miserably in this respect.
    “The claim of the modellers has been that, although their models are totally unable to predict the year-by-year temperature, they are able to predict the temperature trend over a number of years.”
    There’s a lot wrong with this statement (it’s the old “climate is a boundary value problem” BS), but one of the ramifications, if true, is that it is useless to try to run fine mesh (aka “high resolution”) climate models and expect improved results. What they are saying is that the short-term “details” don’t matter, only the long term influences (e.g. radiation balances and various forcings). Accordingly, they should be able to get along with coarse-mesh models and simply “tune” them with hindcasts. Instead, we have modeling groups crying for more and more expensive parallel computing facilities so they can run “high resolution” models. I’d like to know what they expect to gain from such model runs (besides taxpayer funded government contracts).

  17. Fascinating, thankyou Willis.
    A couple of questions if I may.
    * In the list of forcings, G well mixed GHGs, what sensitivity do the models use? each different or all the same? 1.5DegC per doubling or 4,5 or more?
    * One would think, once these models have been tested, only the best would do. But does anybody else recall (in the gate emails I think) a directive to the chapter authors that ALL models must be given equal weight due to political considerations?
    If I’m correct in the above, good quality top notch science to base global decisions on ha?

  18. One other observation.
    The tropical climate is relatively stable, one would think therefore easier to predict than middle and high lattitudes.
    How do these models go predicting those higher lattitudes?

  19. Willis
    Thank you for the hard work and great analysis. It is amazing that these thoughts and effort appear here on WUWT while the academic world is still circling the wagons trying to defend their grants.

  20. Willis,
    Thank you for this and your other articles which demonstrate that cyberspace is replacing or, at least complementing, the print journals for scholarly monographs. I have some general questions about modelling which I have not seen addressed.
    First, have any of the models been successful at hindcasting the 30 years of slight cooling that preceded the 1978-1998 warming? It seems to me that within a trend, it’s facile to predict it’s continuation. As you know, this is called persistance forecasting and is little more than extrapolation, no matter how many forcing variables are considered, or am I wrong?. Wouldn’t the real test of a model be to predict changes in trend? Second, aren’t the tropics less variable by month, season, and year than any other area and therefore the lowest hanging fruit for modellers? Third, looking at the three century temperature trend line from the Little Ice Age, hasn’t it been around 0.7 C per century, and don’t the AGW folk believe, with little evidence, that we’ve broken out of that trendline on the upside because of CO2 forcing. And don’t most of the skeptics believe, with little evidence, that we’re still within that 0.7 trendline requiring, as Judith Curry says, at least 30 more years to figure out who’s right?

  21. Re: Murray Grainger

    Apples and Oranges folks. The fact that UKMO is abysmally bad at predicting UK weather cannot he used as evidence that their model is bad.

    Yes it can. From Professor Slingo’s testimony to the House of Commons

    At least for the UK the codes that underpin our climate change projections are the same codes that we use to make our daily weather forecasts, so we test those codes twice a day for robustness.
    Q210 Graham Stringer: You do not always get it right though, do you?
    Professor Slingo: No, but that is not an error in the code; that is to do with the nature of the chaotic system that we are trying to forecast. Let us not confuse those. We test the code twice a day every day. We also share our code with the academic sector, so the model that we use for our climate prediction work and our weather forecasts, the unified model, is given out to academic institutions around the UK, and increasingly we licence it to several international met services: Australia, South Africa, South Korea and India. So these codes are being tested day in, day out, by a wide variety of users

    So the code is being used around the world and is supposedly the same for both the forecasting and for the climate model. Therefore if one is wrong then so is the other.
    From a BBC article we get the following quote from the Met Office

    The Met Office has now admitted to BBC News that its annual global mean forecast predicted temperatures higher than actual temperatures for nine years out of the last 10.
    This “warming bias” is very small – just 0.05C.

    So the code used for forecasting is same as for the models and is tested twice a day. The code for forecasting has a warming bias of 0.05C therefore, logically, the models have a warming bias of 0.05C per annum or 5C per century.

  22. Murray Grainger says: “Apples and Oranges folks. The fact that UKMO is abysmally bad at predicting UK weather cannot he used as evidence that their model is bad.
    But Murray, the fact is that they are equally bad at predicting global temperature … sorry that’s an overstatement, they get the weather right some time!
    Nine out of nine years they predicted a warming of temperature and 8 out of those nine years the temperature was lower than the 50% confidence interval and something like 6 of those years it was lower than the 75% confidence interval.
    And year after year after year they would make a grand press release about their great climate predictions for the next year … without so much as a hint how useless they had been in all the previous years.
    It’s not so much that they got the forecast wrong, it is that they were so outspoken about their ability to forecast and even had the audicity to say they were “accurate” when it wasn’t significantly better than just saying: “next year will be the same as this”.
    They weren’t forecasting, they were just taking last year’s temperature and adding 0.06C each and every year, getting it high every year and then issuing press releases saying: “our forecasts are wonderful because they are accurate to 0.06C”.
    And we all know that sooner or later by pure statistical fluke, it would have been a warm year and then they’d have crowed to everyone how good their forecast was!
    They are nothing more than charlatans when it comes to global temperature forecasts.

  23. Richard M says:
    December 2, 2010 at 5:24 am

    Any reason this test was limited to 30 years? Why not start in 1880?

    Richard, Vincent Gray had something to say about this…

    “Validation of climate models based on the greenhouse effect requires that they be compared with a past temperature sequence assumed to be free of natural variability. The annual global surface temperature anomaly series before 1940 does not comply with this requirement as greenhouse gas emissions were unimportant over this period. “

  24. gnarf:
    The name of this family of models, GCM, suggests to me that your first assumption that the models are designed to predict temperatures is incorrect.

  25. Great post, Willis, but, as they say, you can take the boy out of the country but you can’t take the country out of the boy. Your ability to communicate abstract ideas is impressive and always reasonably clear to follow for a layman such as I.
    But a comment made to me in 1955 by an old agricultural worker came back to me as |I read your post;
    “Gonna be a hard winter, Mate. The ducks are nesting real high right now”
    told me more about wisdom based on observations collected cumulatively over the lifetimes of generations than all the computer modeling that smart blokes in air-conditioned labs playing with expensive electronic devices can. I know that modeling has a place in science, but ignoring information built up over generations seems pretty silly to me. I guess if this accumulated wisdom was codified in a data base, that would make it more sorta scienc-y?

  26. Your table shows two UKMO models, one with only 4 “forcings” and the other with 9. Which one was in the graphs? As you know, with 9 free parameters one can fit up to a 9th-order polynomial perfectly.

  27. I build trading systems for a living. These systems attemp to “predict” future market prices and execute trades based on those predications in order to make profits for their clients. The first question potential users of my systems ask is how they have performed in the past (in real and/or simulated trading). If the answer is poorly, you can bet they take their money and move on.
    Here we are asking all the world to bet vast sums of money and make untold sacrifices based on models that have predicted very poorly on a historical basis. I’m sorry, but the world should take its money and “move on.”

  28. Mods, small omission – I left out ‘it’ before ‘more’ in the last line – can you fix, PLEASE?
    Sorry, unused to deep thoughts after good luncheon!
    [OK…. done… bl57~mod]

  29. I used to do modeling of transistors and circuits back in the day.
    I always wonder how climate models work in conditions that we know occurred in the Earth’s history: ice ages, MWP, very high O2 and CO2, etc. Do they become unstable or do they properly show that the climate stabilized and returned to a norm? In other words do they contain all the correct variables and relationships?
    It was easy to construct circuit models that modeled performance when inputs were only slightly tweaked (what happens if we changed the 10K resistor to 11K) but much harder to model what happened if we made major changes (10K to 100K), which paradoxically was exactly what users wanted to do and did. I cannot count all the design failures we had when users pushed models beyond their operational areas.
    If we have only calibrated climate models over short times and relatively stable data, it seems improper to push the model well outside this range.

  30. Tropics temperatures are not actually stable, they vary by a large margin.
    But they are dominated by the ENSO – which has an impact of +/-0.6C. After one accounts for this and the lessor impact from the AMO, there is NO global warming signal left (well, 0.2C by 2100). So, you can’t model the Tropics without having a dominant ENSO module.
    Here is my latest reconstruction of UAH Tropics temperatures (and the forecast going out 14 months).
    http://img338.imageshack.us/img338/1889/uahtropicsmodeloct10.png

  31. In order to simulate anything you need a proven formula, an equation. However, does anybody know how climate work?
    However the model used by the UN’s FAO organization, practically applied to fish catches has been proved to be a successful tool for such an economic and real (not imaginary) activity:
    ftp://ftp.fao.org/docrep/fao/005/y2787e/
    See the document:
    Archive: y2787e08.pdf

  32. I’m sure glad we aren’t betting real money on these horserace prediction systems. Oh wait. We are. Not as much money as requested, but we are indeed betting real money, as a society.
    And as mentioned, averaging them all is obviously lunacy.
    Have these people no standards?? Hard work is not the same thing as accurate, useful work. Or are these just the prototype models, to make a pitch for the real funding for a real model?
    Excellent post on an excellent topic. Many thanks for taking these models out for a clear spin, with clear results

  33. Willis, I quote the compliment from the President in the movie Independence Day: “Not bad, not bad at all!”

  34. Thanks, great post Willis
    IMO, there appear to be a couple of very fundamental flaws to all the models.
    The first, the models aren’t modular (module = 1 or more Peer Reviewed aspects of the Model). This makes the Models inefficient, expensive, and difficult to evaluate.
    The second issue, the use of the term “Global Average” is a fallacy in a system as dynamic and regional as the Earth and its climate system.
    It seems like the best way to fix the models is to fix Climate Science.

  35. Interesting post, but if the models do not show what has actually happened, of what use are they? As a layman (Historian, retired), I want to know what the weather will be, hopefully, for the next 5 + days, and what the trend is forecast for the next 20 years.
    From studies I have read (yes I can and do read scientific stuff), I’d say the quality of the climate models is less than poor, perhaps at the same level as social science or economics models!

  36. Can’t see models working. When they have changed the temperatures in the past so many times, they don’t know what actually happened anymore, so their history is corrupted, how can they expect GI(garbagge in) to create precise projections for the future.

  37. “OK, so you don’t like my tests. Then go ahead and propose your own, but we need some suite of tests….”
    No. We don’t. These models are toys. Expensive toys.
    We should not base policy on any of these models. Sorry, but climate is fundamentally unpredictable, mainly because we have no ability to measure or predict basic drivers of climate, such as the cosmic ray flux. Even if we can understand the processes causing solar flux variations, we still won’t be able to predict variations in galactic cosmic ray flux at the heliosphere. Why should we assume that is a constant? Silly humans.
    Therefore, adaptation is the answer, not mitigation. Mitigation assumes you know what is happening and can do something to improve the outcome by doing something ahead of time. Bureaucracies love controlling people and their money, so mitigation is fits right into their modus opperandi.
    Humans are adaptable, that is our strength. Adaptation is a test, and is best accomplished on a small scale. There is no guarantee of success. When governments force common behavior accross large populations, adaptation tests succeed or fail for that whole group. When government forces common behavior in large populations, there are fewer opportunities to test different adaptation approaches. We may not be able to discover what works in time if government keeps taking more and more control.
    Population geneticists know that diversity is the strength of any population. That is because the optimal individual type for a given set of conditions may not be the optimal type when the conditions change. Change is not necessarily predictable. A small subpopulation can become the winner. Without that subpopulation, the species could go extinct. By extension, that means a diversity of businesses, regulations, and strategies are more likely to provide answers than top-down mandates. When government gets it wrong, we all suffer. Look at our economies now.
    The worst case scenario is one-world-government. Under these conditions, there is the least amount of social and technical diversity. Humans should beware becoming a monoculture. Our strength is adaptation, and fundamentally, it is an individual decision. As our freedom is more limited, and businesses become more regulated, our species becomes more endangered. Some people like that idea, but mass sui/homocide is a subject for another post.
    Climate models are more a tool of government policy, than good science. That is because they are funded because they produce results the regulators need to support their agendas. We can be sure the models and modelers are selected by funding based on political need, not necessarily scientific merit.

  38. Willis,
    Thank you for your time and effort examining these climate models. Your passion for truth is an inspiration.

  39. Willis Eisenbach has, at least, started the ball rolling for the eventual creation of a benchmark for testing GCM’s. This could even be the benchmark, but I could be wrong. What comes out clearly is that GCM’s are mostly weak or very weak, if not totally wrong, in their predictive powers.
    Considerning this, can one produce a graph showing the billions of dollars or euros spent per o.01C of error from and due to these models? I mean the cost of producing these super computers and software added to the cost that humanity has paid in trying to fight/mitigate a non existent enemy.
    I would predict something in the region of $-€10 to 100 billion per 0.01C error.

  40. When considering ensemble model results it should always be remembered that
    (a) there is only one Earth
    (b) so at most only one of the models is right
    (c) and average wrong is wrong.
    Richard

  41. What would an engineer do … reset the models for known already measure time, and run the predictions for that. Say predict the 2009 temperature, starting in 1970. That way we have known start and end. So how do they fare in 2009 would be the calibration. Otherwise, the models are useless, if not calibrated.

  42. A good summary paper but it fails to answer a key issue, the incorrect optical physics used in the models to predict cloud albedo from optical depth. Whilst the ‘two stream’ approximations originally from Sagan appear to give the right relationship, when used to predict change of albedo caused by aerosol pollution, the results go the wrong way for thicker clouds. It’s because the assumption of constant Mie asymmetry factor is wrong, also direct backscattering at the upper cloud boundary isn’t taken account of.
    So, ‘cloud albedo effect’ cooling, 175% of the raw median net present AGW in AR4 is imaginary. Without it you have to reduce the IPCC’s predictions of future CO2-AGW by at least a factor of three. Furthermore, because aerosol pollution probably reduces direct backscattering, ‘cloud albedo effect’ cooling becomes heating, another AGW.
    That’s a game changer because it’s self-limiting, possibly why ocean heat content has stopped rising, implying most AGW was the increase of low level tropical cloud albedo from Asian aerosol pollution, and it saturated in 2003.
    So, I believe the models are physically wrong in key areas and the fit to real air temperatures is illusory. Hence, until they’re fixed, they can’t predict the future.

  43. A common problem with analyses of this type is the assumption that the HadCRUT3 and NOAA records faithfully reflect actual temperature observations. They don’t. They are “corrected” records that often look nothing like the raw records they are derived from, and in some cases the “corrections” used to construct them are demonstrably invalid (viz. the 0.4-0.5C WWII stair-step cooling adjustment applied to HadSST2, which shifts HadCRUT3 artificially downwards by about 0.3C after 1946).
    It would be nice to see a comparison of model output against unadjusted temperature data.

  44. Thanks Willis it is nice to see someone talking about ‘validation’ in climatology. Validation seems to be avoided by climate ‘scientists’ to the extent that when the real world observations don’t match the models it is the real world observations that are questioned.
    One would have thought that with trillions of dollars and supposedly the survival of the human race at stake, that someone somewhere would have predeclared the validation tests for GCMs and those that failed them would have all funding terminated. This is what happens in other areas of science and engineering with any safety implications.
    The lack of validation and acceptance of poor results on its own shows that no-one really believes that there is a real threat from ‘climate change’. Put this in another context – if the threat was a collision between the Earth and a large asteroid in ten years time and some modeling groups were producing models that couldn’t hindcast/forecast the trajectory of the asteroid within 95% accuracy – would they still be believed and funded?
    Some issues I have are more climate related – I feel that the use of ‘de-trended’ averaging of anomalies hides a multitude of errors. As others have commented another test would be to have regional metrics perhaps each degree of latitude and longitude with a set of actual forecast values for each month – surface, mid-troposphere, tropopause, temperature, humidity ambient wind. The reason for this is that the ‘statistical-weather is climate’ argument seems to depend on the weather having some kind of Markov property – and it does not. The actual values for each degree of latitude and longitude could easily be checked against the analyzed atmosphere and provide a detailed level of model validation. This would also allow the modelers to see where their models were going awry.
    Unfortunately, the efforts appear to be in the other direction to use validation metrics that are trends based on averaged anomalies of coarse low granularity data and hide model inaccuracies; despite the model output using spurious precisions of hundredths of a degree. This seems to be aimed more at receiving further funding than at validatable accuracy.

  45. Alexander K says:
    December 2, 2010 at 6:04 am

    I guess if this accumulated wisdom was codified in a data base, that would make it more sorta scienc-y?

    Not quite what you’re looking for but…..it’s called a “Farners Almanac”
    Judging by the one I saw some years back, their predictive capability is orders of magnitude higher than UKMetO or Aussie BoM

  46. Another great post Willis. I like what Alexander K says: December 2, 2010 at 6:04 am regarding observational wisdom. I noticed in July that a copper-colored hummingbird that migrates through annually, came and left two weeks early. Then in August, the ruby-throated and other hummingbird species packed up two weeks early as well, this in the heat of August. In November, three and four months later, we have snow cover two to three weeks early. WUWT?

  47. Good work, Willis, hats off to you! This post deserves to be placed in Rick Werme’s “WUWT Classics”.

  48. At this stage in their development it is premature to call these programs “Models”; they do not meet the basic requirements to be so classified or cited.
    PS: I doubt that anyone would object to their being called billion dollar “Gigos”. Perhaps we’ll develope a first, true “Model” by the mid-Century mark in 40 years. We have been progressing rather well of late. (Or were, before The Great Recession.)

  49. One single number representing the temperature of a whole planet?
    Meaningless.
    Predicting a meaningless number 100 years from now?
    With feedback-loops and couplings noone have the full picture of?
    Convection? Jet streams? Sea-currents? Clouds?
    Even more meaningless.

  50. The confidence levels referenced, whether 50%, 75%, 90%, or 95%, are ludicrously low for such crucial predictions. Given the numerous known and un-counted unknown sources of bias and contamination, a far more robust standard is required. That models of this type can almost never meet such standards is not an excuse: it’s a reflection of their nature and utility.

  51. Nature Admits Scientists are Computer Illiterate (Nature)

    Researchers are spending more and more time writing computer software to model biological structures, simulate the early evolution of the Universe and analyse past climate data, among other topics. But programming experts have little faith that most scientists are up to the task. […]
    …as computers and programming tools have grown more complex, scientists have hit a “steep learning curve”, says James Hack, director of the US National Center for Computational Sciences at Oak Ridge National Laboratory in Tennessee. “The level of effort and skills needed to keep up aren’t in the wheelhouse of the average scientist.”
    As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists.
    […]
    Greg Wilson, a computer scientist in Toronto, Canada, who heads Software Carpentry — an online course aimed at improving the computing skills of scientists — says that he woke up to the problem in the 1980s, when he was working at a physics supercomputing facility at the University of Edinburgh, UK. After a series of small mishaps, he realized that, without formal training in programming, it was easy for scientists trying to address some of the Universe’s biggest questions to inadvertently introduce errors into their codes, potentially “doing more harm than good”. […]
    “There are terrifying statistics showing that almost all of what scientists know about coding is self-taught,” says Wilson. “They just don’t know how bad they are.”
    As a result, codes may be riddled with tiny errors that do not cause the program to break down, but may drastically change the scientific results that it spits out.

    The fact that every single climate model is different is irrefutable proof that the science is not settled.

  52. Standards? A litmus test of sorts? Willis, you’re suggesting climate scientists adhere to standards, when you and I both know they have none. There is no ice extent too low to predict, no temperature too high to forecast, no alarm too loud to sound. They could no more adopt self-restraining standards than a zebra could change his stripes or a leopard his spots. A good thought though.

  53. Backcasting tells you nothing about the utility of any model, except if the model can’t backcast your model is demonstrably wrong. This is the garden path financial modelers walk down every day: if you add enough corrections, adjustments, smoothing, etc., you can replicate *any* complex historical wave form. If a model was completely useless at backcasting, it would never be published. The thing is, data mining and heuristics can back cast as well as deterministic physics.
    When you think of backcasting, you have to ask ‘how accurate is the historical data and how many accurate data points do we have for that data at that time?’ Financial and economic models have an enormous amount of long terms, exactly accurate, data, and they have never been shown to have any predictive value, despite billions of dollars spent on the task every year.
    If there were a model which showed, for example, an anomaly in the past that wasn’t known, then further research showed that, indeed, that anomaly actually occurred, that would be interesting, but not proof. If I recall correctly, when historic climate data is refined, the models are ‘tweaked’ and rerun to backcast correctly – if nothing else, this shows they are heuristic, not deterministic.
    Regardless, the only way of testing a model is whether the model is capable of making accurate (within uncertainty bounds) predictions of the future. Most natural systems, and clearly the climate, is a complex, non-linear, chaotic system. (By the way, chaotic does not actually mean random). The climate has a huge number of unknowns, even if you take as gospel all the ‘true facts’ of the AGW hypothesis. Even the knowns have limited precision.
    The nice thing about a climate model is that it is a long, long term prediction. Which means any discrepancy between the model and reality can be explained away as confounding weather with climate. Nice – remember the Jehovah’s Witnesses predicted the end of the world in 1976. Didn’t happen, but it didn’t hurt business. If you are going to make predictions, try make sure they don’t happen within a human lifetime. This would not matter, if this was simply some interesting scientific theory. Climate models drive policy despite the fact there is no reason whatsoever to assume they have any predictive ability whatsoever. I’ve been told that ‘it’s the best information we’ve got’. This misses the point: bad information is worse than no information.
    By the way – I do not understand the point of running statistics (average, mean, etc.) on a group of climate models. I think I understand the analysis presented about, but the first graph shows a ‘median’ prediction. I don’t understand the mathematical relevance of these figures any more than averaging the number of times a chicken clucks with random noise. Its a bit like averaging the guesses (sorry – estimates) of economists regarding unemployment statistics.
    That being said, if there was science behind climate models, then I would expect they would all predict (and backcast) the same thing. ‘Models’ of gravity do not arrive at divergent conclusions.

  54. Thanks Willis for your excellent interpretation of a difficult topic. From your analysis it seems that the current generation of computer climate models don’t even get to first base by passing your ‘reasonableness’ tests for the bounds of even a possible Earth climate.
    Another interesting question is ‘Does a Global Temperature Exist?’, and this quote from an article published article in Journal of Non-Equilibrium Thermodynamics by Christopher Essex, Ross McKitrick, and Bjarne Andresen, indicates that the topic is still open to debate:-
    “There is no global temperature. The reasons lie in the properties of the equation of state governing local thermodynamic equilibrium, and the implications cannot be avoided by substituting statistics for physics.
    Since temperature is an intensive variable, the total temperature is meaningless in terms of the system being measured, and hence any one simple average has no necessary meaning. Neither does temperature have a constant proportional relationship with energy or other extensive thermodynamic properties…”
    Full paper here:-
    http://www.uoguelph.ca/~rmckitri/research/globaltemp/GlobTemp.JNET.pdf
    Perhaps no surprise the results of all the global climate models are so poor!

  55. Willis is too diplomatic. When you take a model, algorithm or whatever, and tune the parameters until you get the best possible hindcast, what you have done is known as curve fitting. A complex enough system can be made to hindcast the stock market, but will have very little predictive power. The fact that the best that could be achieved at ‘predicting the past’ is an uninspiring ‘ok’ from one single model, is all I need to know about their (lack of) predictive skill.

  56. Surely the accuracy of these models is even worse than has been suggested above. The baseline they are being measured against has been demonstrated to be radically flawed. It has been ‘fixed’. The numbers for the early 20th century and before have been statistically adjusted downwards, with the recent past statistically adjusted upwards. This is being done to create an enhanced sense of ‘warming’.
    Surely this will affect the accuracy of this model’s comparison even more.

  57. Willis,
    Has this been used to evaluate models:
    Model outputs should be evaluated on how parallel they run to measurements.
    Please show the model outputs and actual measurements integrated (delta T)
    This will help comparison of when the model is trending opposite the actual temperature.
    What might appear to be the better model with one evaluation, might not be the best with another.

  58. This looks like a nice, careful, piece of work but it has, I think, some conceptual and practical problems.
    On the conceptual side we need to be concerned that these models are highly incestuous with respect to the data. Specifically, they’re continually adjusted to better fit the data – so checking their reliability through hindcasting merely tells you directly how good the various maintainers are at tinkering and indirectly which data set, with which adjustments, were preferred by the people involved for each period during which they worked on it.
    On the practical side the reality is that we don’t have reliable retrospective data – so a model which hindcast some average with near perfect precision over a period of some years is obviously inaccurate because we know the data is wrong. Since we don’t have good data, we don’t know how far off the data we do have is, but in this context that doesn’t really matter: whether x is 0.0001 degrees f per acre/year or plus or minus 2 degrees C per continent/hour the point is that the better the model does at predicting bad data, the worse we should think it to be.

  59. “”””” “Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere“, “””””
    So I haven’t read the paper yet; or any of the posts, but I was wondering just what the blazes this title means.
    It would appear the paper deals with two different subjects:- Tropical Atmosphere Variability; and Surface Temperature Trend Amplification.
    Do all Universities have departments that teach how to write gobbledegook titles for the publish or perish papers; because you only have to read any weekly issue of SCIENCE to see that that skill is very wide spread.
    So I wonder just what aspects of the Tropical Atmosphere are varying enough to bother studying. Is the N2/O2 proportion changing enough to comment on or maybe the isotopic ratios of those constituents of the tropical atmosphere? It would seem to me that global mixing works well enough that monitoring how the atmosphere in just one region varies from time to time, would not be of much use.
    And then there’s that Surface Temperature Trend and its amplification. Well do they really mean “changes” in the surface temperature trend; it seems odd to talk of amplification of a trend; well unless it is simply that the trend line slope changes.
    Well it’s all very curious; so I guess I’ll take a chance and read the paper, and see what comments other readers had to say.

  60. TerryS says:
    December 2, 2010 at 5:52 am
    Re: Murray Grainger
    Mike Haseler says:
    December 2, 2010 at 5:54 am
    Murray Grainger says: “Apples and Oranges folks.
    Sorry chaps, I shall make my sarcasm more obvious in future.

  61. The claim of the modellers has been that, although their models are totally unable to predict the year-by-year temperature, they are able to predict the temperature trend over a number of years

    For this to be true, the errors over time would have to average out to zero. In other, the cumulative effects of the errors in the model would have to cancel themselves out.
    It is far more likely that the cumulative effects of the model errors over time are to amplify the errors. The errors are compounded, not averaged out.
    A model makes its run for a year. Then it takes that output and uses it as input for the next run. Each time, the errors get larger.

  62. very nice work, Willis …
    may I suggest you and some of the more statistical guys here go over it for errors/changes/clarifications, and you submit it to some AGW minded journal … ?
    that might help the discussion, not with the fanatics, but with the reasonable …

  63. A serious issue with hindcasting is the quality and methodology of the historical surface data.
    The surface station and satellite data needs to be cross-calibrated on a station-by-station basis to widen the period of high-quality data. Doing a correlation study on the end product (GMST) is nowhere near the same thing as figuring out exactly the mathematical relationship between the HCN Station #123456 and the corresponding satellite gridcell.
    It is nice when one can say the surface stations and the satellite data agree in general on the averaged data. It is far better when one can say: “Station #123456 reads 2.4±0.03C higher than the satellite data over 1978-now.” This examination would also allow the methodical evaluation of UHI effects and the other adjustments that occur during the satellite period.

  64. “”””” Tenuc says:
    December 2, 2010 at 8:25 am
    …………………………
    Another interesting question is ‘Does a Global Temperature Exist?’, and this quote from an article published article in Journal of Non-Equilibrium Thermodynamics by Christopher Essex, Ross McKitrick, and Bjarne Andresen, indicates that the topic is still open to debate:-
    “There is no global temperature. The reasons lie in the properties of the equation of state governing local thermodynamic equilibrium, and the implications cannot be avoided by substituting statistics for physics. “””””
    Well this is an often posed question; usually drawing the answer above most often; in fact no less a luminary that Prof Richard Lindzen recently stated that in his five minutes of commentary to that pitiful lame duck Congressional committee hearing; that insulted all of us.
    Now I agree with the sentiment; but not pedantically with the answer.
    To me it is quite obvious. Mother Gaia, has a thermometer in each and every single atom or molecule on the planet; or shall we say just the near surface area. So she knows what each thermometer says, and she can read every one of them instantaneously; every atto second if she wants to. Then it is trivial for her to add them all up, and divide by the number of thermometers, and, voilla ! she has the average temperature of that limited near surface region at that time.
    So it exists; but of course, the real issue, is can WE measure it, since MG is never going to tell us the answer.
    And of course we can’t measure it; and it’s a waste of time anyway; becasue as you say Tenuc it has no connection to energies or anything else we would want to know.
    By the way; for the legal disclaimer:- I sure hope Y’alls don’t figgah me as one of those Gaia kooks. I just subjected myself last night to watching my son’s Blue-Ray copy of AVATAR; he’s a video/movie student, so he studies such films from their artistic, and technique points of view. It was certainly a marvel of science fiction out of the box writing, and special effects wonderment; and I was impressed with all of that.
    Apart from that it is the most blatant piece of political propaganda crap I have ever seen; and of course the whole GAIA concept of an integrated network of interdependent cells of a single organism is central to the totally nude message. Unobtainium is of course the evil oil under the desert sands of Arabia; and the Arabs are the set upon innocents trying to eke out a peaceful living by just eating sand, and minding their own business; which is hard to do given the constant assaults by the evil Americans.
    So Cameron is evidently a spokesperosn for the one world movement; and I’m surprised that he is able to get away with such blatant plagiarism; without giving credit to the Gaia worshipers.
    No; my Mother Gaia, is just a Super Maxwell’s Demon; that is able to observe and note that which the laws of physics don’t really allow US to observe and note; and of course she can never tell us her findings; but we can take comfort in the knowledge that the actual state of the planet, or say its climate, is ALWAYS exactly that, which we could (in principle) compute if we were so fortunate (or maybe unfortunate), to have all that information that Mother Gaia has but we can never know.
    So don’t bother with cleaning out a room at the funny farm for me; I am quite sane; I just use a different toolbox from some others.
    But back to the subject; the global mean temperature if we could measure it, carries no more scientific knowledge or significance, than does the mean telephone number contained in the latest edition of the Manhattan Telephone directory; so Lindzen and other s are correct in saying there’s no such thing; and it would add nothing to our knowledge of energy flows in our climate system, if we DID know such a number. And if you don’t like the number; what would you change it to, if we had that power; which fortunately we do not.

  65. gnarf says:
    December 2, 2010 at 5:15 am
    Even an accurate model (if it exists) will fail your tests. These models are supposed to forecast temperature trend-> they should pass your first test.
    But the only way to pass your 2nd and 3rd test is to have a statistically accurate temperature derivative and second derivative… very complex (increased chaos every time you derivate), not sure it can be done at all…and it should have very limited influence on the trend so I understand perfectly that most models don’t even try to do such thing and focus on forcasting the trend.
    ###################
    Agreed. However as Willis said if we want to set our own list of tests we should.
    1. There should be a suite of tests.
    2. These tests should be related to the USE of the models.
    3. Criteria need to be established prior to the testing.
    The concept is simple. When we look at the damage mechanisms of climate change
    we can for example call out these three. Increased temperature, drought and floods, increased sea level. Simply, temperature trend, precipitation extremes, and sea level rise.
    It would of course be nice to get the variables correct that Willis chose. But one can get the trend correct and miss on all the variables that Willis selected. especially in a 21 year period, especially with some actual climate cycles being longer than the observation period.
    it’s silly to continue to use models that perform badly. it also weird that the models dont use the same inputs.

  66. Poptech says:
    December 2, 2010 at 8:14 am

    As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists. […]

    Thanks Poptech – this observation has been my view for some time. Code documentation is, by and large, a foreign concept to many of these GCM development groups. Some do make an effort though (like NCAR)…
    The whole idea of “reproducability” brings up a good question in light of Willis’ present article. Each of the codes cited above purports to solve the basic thermo-fluid dynamics equations and related submodels required to “accurately” simulate “climate”. Each starts with the SAME initial conditions, uses the SAME boundary conditions, and the SAME forcings. Yet, the solutions plotted in the first graph show wildly different time histories (different amplitudes and phase). Why is that? Well, more than likely each code has different assumptions for how they assimilate data for initial conditions, different submodels for everything from turbulence to sea ice, different ways of discretizing and solving the basic equations (e.g. finite difference versus spectral methods), different ways of handling the boundary conditions and forcings, etc. Moreover, the codes are so logically complex (and likely full of bugs unique to a given code) that it is very unlikely that an independent model development group could take the methods described by another group and “reproduce” their results precisely.

  67. Great Post.
    The underlying issue is that the whole concept of radiative forcing is invalid. The so called ‘radiative forcing constant’ for CO2, 2/3 C increase in ‘surface temperature’ per each increase of 1 W in the downward LWIR flux is just nonsense. It is simply the effect of ocean surface temperatures, urban heat islands and a lot of temperature data ‘fixing’ on the meteorlogical surface temperature record.
    Until these models are upgraded to include realistic solar ocean heating effects and raditive forcing is removed, they will be incapable of predicting any kind of climate change.
    Garbage in, Gospel out.

  68. @Mingy
    > Backcasting tells you nothing about the utility of any model, except if the
    > model can’t backcast your model is demonstrably wrong. This is the
    > garden path financial modelers walk down every day: if you add
    > enough corrections, adjustments, smoothing, etc., you can
    > replicate *any* complex historical wave form.
    You’re throwing out the baby with the bath. Backcasting is very useful for validating a model, because you don’t have to “wait for the future to happen” to do the validation.
    It’s still “blind testing” in the sense that no modelers can possibly train their models on every piece of existing data, In any case, most of us set aside some “blind data” for testing anyway. And those corrections you mention apply to the future too, which we’re also “blind” to.
    They’re using a small fraction of the data to predict the rest. That’s utility in my book.

  69. Steve Mosher:
    Well said. Willis has identified and explained some very reasonable and “necessary” tests for evaluating models – tests which most GCMs seem to fail. By their very nature these types of numerical tests are not “sufficient” to evaluate a model because they say little about the actual physical processes embedded in the model. In other words, Willis makes no statement as to the scientific reasonableness of the UKMET model and only states that its hindcast compares favorably with actual observations. The UKMET model may, for example, boil down to an extrapolation of past temperatures plus some highly constrained noise. Without a rigorous assessment of the model any analysis is incomplete. However, models that fail to meet Willis’ tests are inherently suspect.
    It seems to me to be perfectly reasonable to require that any proposed model state the results of Willis’ type of tests in a clear way so that those using its outputs should be aware of its limitations.

  70. This is just an example of an element of the validation suite that should be applied to each model before it is allowed to be used.

  71. As a modellor (hydrological / hydrualic) my work has to be calibrated against observed data (numerous locations) and must be within stated tolerances, the model must then be tested for discreet events not used for calibration (to check it hasnt been force fitted), run with historical events and compared with known observations / photos etc… and I must then run a long period of Time Series Rainfall (TSR) to check that the model can account for seasonal changes such as Soil Moisture Deficit (SMD) or Evapotranspiration changes and correctly predict the right response for each rainfall event during this period.
    This still doesnt ensure a completely “robust” model, but it gives us reasonable confidence that its fit for purpose. The model is then used to calculate flood extents for a design rainfall event, or develop solutions, sometimes costing millions of $.
    As I do this commercial and often millions of $ can ride of the model results, we have to carefully test and demonstrate the model is of use.
    The main things we test when “hind casting” is peak flow rates match, peak depths match, volumes match, the timing of the peaks is correct and they must have a good VISUAL fit. I would expect the same criteria to be applied to GCM’s giving whats riding on them, unfortunetly govt funded scientific modellors are often only interested in knocking out a paper rather than gathering months of real life data and spending months calibrating and testing a model, they dont have the same commercial pressures and I think thats the issue with GCM’s.
    I would expect them to select at least 1000 temperature stations (all well sited with long records) for calibration, and I would expect for most of them to match the timing of ups and downs, the peaks, the dips, the overall increase or decrease in temp / energy and provide a good visual fit. The same should also be done for SST based on a number of grid boxes, and for rainfall and incoming radiation. If a model can not replicate the current or know temperature history on a regional scale, they are of little use to anyone and I will remain unconvinced.
    Its like me telling my client that the peak of the flood, the depth and the duration dont match observed at all, but its ok as the overall volumes match ok, so its “robust”, despite that fact it would lead to poorly design defences and the deaths of inocent people!!!
    I would be happy to write an article for WUWT on hydrological / hydraulic model calibration if of interest to anyone – for comparison of the GCM approach with an alternative well established modelling field.

  72. @John Day
    “You’re throwing out the baby with the bath. Backcasting is very useful for validating a model, because you don’t have to “wait for the future to happen” to do the validation.
    It’s still “blind testing” in the sense that no modelers can possibly train their models on every piece of existing data, In any case, most of us set aside some “blind data” for testing anyway. And those corrections you mention apply to the future too, which we’re also “blind” to.
    They’re using a small fraction of the data to predict the rest. That’s utility in my book.”
    I don’t know how much experience you have with models, but there are models and there are models.
    All ‘models’ of the financial system work well in back casting. All of them which get published, in any event. But they have no ability to predict the future – they diverge from the moment you start predicting.
    When you look at, let say, planetary orbits, you have a deterministic model with high resolution inputs. When you look at something like climate, you have a chaotic system – like water flow. I know people don’t think its chaotic (and most folks don’t know what the word means in the context of modeling) but can we agree that water flow, air flow, and clouds have an impact? Its a matter of scientific fact that life modified (and modifies) the climate and vice versa. Can we confidentially model the biosphere and carbon cycle?
    What about backcasting gives you any confidence about a model’s predictive ability? The only thing you can say is that if it can’t backcast it can’t forecast!
    Of course, the fact no two models seem to agree is another vote of confidence …

  73. Take any seven factors correlatable to time (i.e., ketchup sales in Tennessee, the number of bald eagles, etc.), and one can write a model (in fact, an infinite number of models) that fits past temperature data, whether good data or highly corrected data based on many arbitrary assumptions. There’s really no difficulty in that. A model must not only stand to reason, but also be testable by reliable data under controlled conditions. So far, science has come no where near achieving this.
    Assumptions can be reasonable but biased, if they are chosen from among multiple equally reasonable assumptions. Today’s “climate science” comprises little more than creating mathematical models based on biased and untestable assumptions, testing the models using poorly collected data that has been adjusted by biased and untestable assumptions, and predicting disasters based on many more arbitrary and untestable assumptions. Equally reasonable, biased and untestable assumptions would predict future cooling and global prosperity. But this activity would not be good science either.
    Physics can be used to calculate the heat that an increase of X ppm of certain gases would absorb or reflect (before heat removal from the earth by convection is accounted for), which predicts a tiny undetectable temperature increase. But in most models, positive (amplifying) feedbacks of CO2 are biasedly emphasized over negative (dampening) feedbacks, and other model factors are used to explain why CO2 has not yet created the chaos that it will soon create. And it’s all “validated” by poorly collected and poorly controlled data.

  74. Lance Wallace says:
    December 2, 2010 at 6:05 am

    Your table shows two UKMO models, one with only 4 “forcings” and the other with 9. Which one was in the graphs? As you know, with 9 free parameters one can fit up to a 9th-order polynomial perfectly.

    I believe that it is the one with nine forcings …

  75. Can you Willis or someone else point me to a reference to the radiance of the sun. That is for a solar cell how many watts per square meter do you get at the earth’s surface as a maximum. Activists play with figures saying thing like you can get 600 watts or 1400 watts and that it is only ineffiencies of the collectors that is the problem. Currently I think it is 370 watts maximum and cells collect about 100 watts but cannot find an adequate reference. Is it a conspiracy could wikileaks help?

  76. steven mosher says:
    December 2, 2010 at 9:42 am

    gnarf says:
    December 2, 2010 at 5:15 am

    Even an accurate model (if it exists) will fail your tests. These models are supposed to forecast temperature trend-> they should pass your first test.
    But the only way to pass your 2nd and 3rd test is to have a statistically accurate temperature derivative and second derivative… very complex (increased chaos every time you derivate), not sure it can be done at all…and it should have very limited influence on the trend so I understand perfectly that most models don’t even try to do such thing and focus on forcasting the trend.

    ###################
    Agreed. However as Willis said if we want to set our own list of tests we should.
    1. There should be a suite of tests.
    2. These tests should be related to the USE of the models.
    3. Criteria need to be established prior to the testing.
    The concept is simple. When we look at the damage mechanisms of climate change
    we can for example call out these three. Increased temperature, drought and floods, increased sea level. Simply, temperature trend, precipitation extremes, and sea level rise.
    It would of course be nice to get the variables correct that Willis chose. But one can get the trend correct and miss on all the variables that Willis selected. especially in a 21 year period, especially with some actual climate cycles being longer than the observation period.
    it’s silly to continue to use models that perform badly. it also weird that the models dont use the same inputs.

    Hey, Mosh, as always, good to hear from your.
    First, let me say that it is not just silly to use models that “perform badly” (whatever that means). It is the antithesis of the scientific method. And I agree that comparing models that use different inputs, as do the IPCC models, is meaningless.
    Next, yes, it is theoretically possible to forecast the future trend of the climate without getting the small stuff (e.g. daily, monthly, and annual temperatures and swings) right. For example, the CM2.1 model (purple) in Fig. 1 shows huge inter-annual swings that do not appear in either the observations or the other models. But it still does a passable job of hindcasting the trend … sorry, that don’t impress me much. A model that claimed that the daily temperature varied over a 60°C range and that the globe varied 30° from summer to winter could get the trend right too … but that doesn’t mean I would trust that model. Would you?
    This is particularly true of “iterative” models like GCMs. These take the output for one instant of model time, and use that output as input for the next instant of model time. Since the big stuff (years and centuries) is literally built up hour by hour from the small stuff, we can’t “skip steps” as we can do in more theoretical, physics-based, non-iterative models.
    Next, we are supposed to trust these iterative models because they claim to be based on “fundamental physics”. And unfortunately our testing options are limited, since it takes fifty years to actually test a fifty-year prediction.
    But if a model gives unphysical results in the tests I used above, if it shows monthly ocean temperatures changing by some huge never-before-seen amount in a single month … what does that say about the quality of their “fundamental physics”?
    Mosh, I was interested in your choice of things to test (temperature trend, precipitation extremes, and sea level rise). I say that because I haven’t seen any observations of an increase in precipitation extremes. Nor is there any observed increase in the rate of sea level rise. In fact, the rate of rise has decreased slightly since about 2006.
    Given that, it’s not clear to me exactly how you would use those to test the models.
    Finally, gnarf, I totally disagree with your claim that “Even an accurate model (if it exists) will fail your tests.” The UKMO model did a reasonable job, and that’s only looking at the nine models used by Santer. When one of the first nine models grabbed at random does all right, I’m not buying the ‘but it’s toooo hard’ excuse.
    Thanks for your thoughts,
    w.

  77. John Day says:
    December 2, 2010 at 10:03 am

    @Mingy
    > Backcasting tells you nothing about the utility of any model, except if the
    > model can’t backcast your model is demonstrably wrong. This is the
    > garden path financial modelers walk down every day: if you add
    > enough corrections, adjustments, smoothing, etc., you can
    > replicate *any* complex historical wave form.
    You’re throwing out the baby with the bath. Backcasting is very useful for validating a model, because you don’t have to “wait for the future to happen” to do the validation.
    It’s still “blind testing” in the sense that no modelers can possibly train their models on every piece of existing data, In any case, most of us set aside some “blind data” for testing anyway. And those corrections you mention apply to the future too, which we’re also “blind” to.
    They’re using a small fraction of the data to predict the rest. That’s utility in my book.

    John, as you indicate, only “out of sample” testing is of use for a tuned model. Testing the model against the data used to tune it means nothing.
    Unfortunately, you assume that climate science works like other fields when you say “They’re using a small fraction of the data to predict the rest. That’s utility in my book.” I have found no evidence that this is the case for any of the climate models. They are trained on all available data, leaving nothing for out of sample testing. I’d be happy to be shown wrong on that, but I’ve never seen a climate model that was built using a “calibration/verification” method involving half the available data and then the other half.

  78. A very interesting post.
    Willis — you compared the statistical characteristics of 1979-2000 data to models and found the models to have very different statistics.
    Would it be possible for you to do the same sort of comparison between the 1979-2000 data and the 2000-2010 or the 1950 to 1980 data?
    It would be interesting to see whether the statistical measures you chose to look at are relatively constant in the temperature data record.
    Charlie

  79. stumpy says:
    December 2, 2010 at 10:45 am

    The main things we test when “hind casting” is peak flow rates match, peak depths match, volumes match, the timing of the peaks is correct and they must have a good VISUAL fit.

    Stumpy, thank you for your excellent post from the point of view of someone who builds and tests models for a living. I was particularly struck by your quote above regarding a good “visual fit”. This is the quality that I referred to several times above as “lifelike”.

  80. @Willis Eschenbach
    Out of sample data might be useful for testing in certain contexts – however, if the in sample and out of sample data happen to be correlated (which I would imagine would apply to climate data), not so much.
    The challenge remains: how can we ensure a model has predictive value. Well, if it is a deterministic model with known inputs and a robustly demonstrated theoretical framework works on past past, then it might work on future data as well. We don’t know whether it has predictive value, but it might. Then we look at prediction vs. reality and (think tree-rings). If reality disagrees with prediction, then the model is obviously useless.
    Then again, if climate models were deterministic with known inputs and a robustly demonstrated theoretical framework they would agree with one another.
    There is a great quote from M&M’s book, something along the lines of “A model of a mouse, no matter how good, tells you nothing about a mouse”. That perfectly aligns with the courses on modeling I took: don’t believe your model tells you anything about nature, nature tells you what wrong with your model.

  81. Excellent analysis Willis. Thanks.
    It would be interesting to know more about the details and assumptions of how each alleged climate model works, er to be more accurate, doesn’t work.
    As an aside, I looked up “climate forcing” and found two definitions using a Google search:
    (1) “Climate Forcing: The Earth’s climate changes when the amount of energy stored by the climate system is varied. The most significant changes occur when the global energy balance between incoming energy from the Sun and outgoing heat from the Earth is upset. There are a number of natural mechanisms that can upset this balance, for example fluctuations in the Earth’s orbit, variations in ocean circulation and changes in the composition of the Earth’s atmosphere. In recent times, the latter has been evident as a consequence not of natural processes but of man-made pollution, through emissions of greenhouse gases. By altering the global energy balance, such mechanisms “force” the climate to change. Consequently, scientists call them “climate forcing” mechanisms.”
    http://www.ace.mmu.ac.uk/eae/climate_change/older/Climate_Forcing.html
    That seems like a clear definition and talks about the actual planet.
    The second definition is kind of different.
    (2) “Forcings: Forcings in the climate sense are external boundary conditions or inputs to a climate model. Obviously changes to the sun’s radiation are external, and so that is always a forcing. The same is true for changes to the Earth’s orbit (“Milankovitch cycles”). Things get a little more ambigous as you get closer to the surface. In models that do not contain a carbon cycle (and that is most of them), the level of CO2 is set externally, and so that can be considered a forcing too. However, in models that contain a carbon cycle, changes in CO2 concentrations will occur as a function of the climate itself and in changes in emissions from industrial activity. In that case, CO2 levels will be a feedback, and not a forcing. Almost all of the elements that make up the atmosphere can be considered feedbacks on some timescale, and so defining the forcing is really a function of what feedbacks you allow in the model and for what purpose you are using it. A good discussion of recent forcings can be found in Hansen et al (2002) and in Schmidt et al (2004).
    Filed under: * Glossary — group @ 28 November 2004 -”
    http://www.realclimate.org/index.php/archives/2004/11/forcings/
    This second definition isn’t clear, certainly not as clear as the first, but that’s ok, not everyone writes with the same level of clarity, which is why I generally look for more than one source for definitions.
    The main point that occurred to me as I read the second definition is that they are defining it in terms of “climate models” rather than the actual atmosphere. I find this very peculiar. Aren’t they being paid to study the ACTUAL atmosphere and climate? I gather not for their definition clearly shows them to be defining their terms in terms of climate models rather than the actual planet.
    Now maybe I’m just splitting hairs, but also maybe it reveals a profound difference in the mind set of the two sets of scientists? One focused on the actual planet and the others focused on models to the extent that they define their world in terms of computer models! Strange indeed, and I’m a computer scientist!
    Anyway, thanks for your illuminating findings Willis.

  82. Willis:
    First, let me say that it is not just silly to use models that “perform badly” (whatever that means). It is the antithesis of the scientific method. And I agree that comparing models that use different inputs, as do the IPCC models, is meaningless.
    ############
    before we had accurate models of RCS we used models we knew were wrong. We picked the least wrong. thats far better than letting the perfect be the enemy of the good enough. This is not really about the scientific method
    “Next, yes, it is theoretically possible to forecast the future trend of the climate without getting the small stuff (e.g. daily, monthly, and annual temperatures and swings) right. For example, the CM2.1 model (purple) in Fig. 1 shows huge inter-annual swings that do not appear in either the observations or the other models. But it still does a passable job of hindcasting the trend … sorry, that don’t impress me much. A model that claimed that the daily temperature varied over a 60°C range and that the globe varied 30° from summer to winter could get the trend right too … but that doesn’t mean I would trust that model. Would you?”
    Of course one has a cascade of tests. first order MOM, second order MOM.
    getting the trend correct is important. if to get the trend correct and one gets
    the wiggles right, you of course pick the best model.
    Next, we are supposed to trust these iterative models because they claim to be based on “fundamental physics”. And unfortunately our testing options are limited, since it takes fifty years to actually test a fifty-year prediction.
    # we can test a 50 year prediction at any time. its just math.
    “But if a model gives unphysical results in the tests I used above, if it shows monthly ocean temperatures changing by some huge never-before-seen amount in a single month … what does that say about the quality of their “fundamental physics”?”
    it says nothing on its face. Could be lots of issues.
    “Mosh, I was interested in your choice of things to test (temperature trend, precipitation extremes, and sea level rise). I say that because I haven’t seen any observations of an increase in precipitation extremes. Nor is there any observed increase in the rate of sea level rise. In fact, the rate of rise has decreased slightly since about 2006.
    Given that, it’s not clear to me exactly how you would use those to test the models.”
    easy. if the model fails a test of sea level rise, you dont use its predictions to drive policy. Same for droughts and floods or hurricanes.

  83. I openly confess to not really understanding the methodology of producing GCM’s. What I find somewhat perplexing – is the fact that, with all the up to date real data, of many different parameters, and the best computer programs, etc – local (i.e. countrywise) weather forecasts cannot accurately predict more than a few days ahead.
    Now, I’m sure someone will correct me if I am wrong – but if the above is not demonstrated to be possible, with ALL the available carefully obtained meteo data and climate knowledge – how the heck can anyone expect to produce a model (no matter how general and simplistic) that looks back over time (the last 150 yrs) from a load of assumptions, and without knowing detailed climate variabilities and sensitivities (which forcings, etc!) or detailed actual observations? – and even expect to get ‘close’?
    I know most people accept that weather forecasting is difficult because real-life actual prediction of the ‘chaotic’ system cannot be achieved. So why does anyone think hindcasts from GCM models will ever show anything other than possible ‘general’ indications?
    I am not saying that trying to model past climate won’t help to understand forcings and sensitivity, etc – but they can never really be fully verified! As a really simplistic example, what if the model uses a given forcing for a given parameter and a given feedback – and then gives a reasonable ‘fit’. That does not mean the model is correct – what it really means is that the modeller chose values that seem to fit. (or in the back analysis process, the values were altered to GET it to fit!) In practise, a completely different forcing and sensitivity (from a non-considered or indeed unknown source) could have caused the same effect? Or, put another way, a combination of said forcings and senstivities – it only takes half a dozen ifs – and (IMHO) you are wasting your time!
    Perhaps more simply – I suggest the following thought experiment –
    Imagine standing at one side of a see-saw sticking out of a wall, but you don’t know it’s a see-saw! (it could be a simple balance beam, or a cranked beam, or some form of geared system, etc – it doesn’t matter – the point is you do not know!) – You see only a seat which inexplicably moves up and down and you cannot see behind the wall to the other side of the see-saw. You get on the seat – it goes down, you get off it goes up – but sometimes an imaginary bloke around the other side gets on and off at random, or his child does – but all you ever SEE, is the effect at YOUR side, i.e. sometimes it doesn’t go down, or stays down when you get off, or its goes down or up slowly! You don’t know when this will happen, you don’t know why it happens, so you invent explanations, how many kids he has, how obese he or his kids are, or how long his side of the ‘beam’ is, etc, etc.
    Now, how the heck are you going to work out how the system it works?? You have to invent all the possibilities for the fact that sometimes your seat goes up or down, but you can never model it or use your model to predict what will happen the next time you get on or off the seat!
    I cannot see that trying to model the massive climate system and biosphere we call home could ever be understood to a level where signifcant predictive capabilities are possible. IMO, the biosphere/ecosphere/climate system is the equivalent of an awful
    lot of see-saws, but unlike my example, they are inextricably intertwined together.
    I don’t wish to be defeatist as such – just realistic about the scale of the problem regarding GCM’s!

  84. steven mosher says:
    December 2, 2010 at 1:43 pm (Edit)
    Willis:

    First, let me say that it is not just silly to use models that “perform badly” (whatever that means). It is the antithesis of the scientific method. And I agree that comparing models that use different inputs, as do the IPCC models, is meaningless.

    ############
    before we had accurate models of RCS we used models we knew were wrong. We picked the least wrong. thats far better than letting the perfect be the enemy of the good enough. This is not really about the scientific method

    The scientific method involves testing. Using all of the models without testing them and eliminating those that don’t work, what you have called “model democracy”, is anti-scientific.
    Next, people always drag out the old saw about “don’t let the perfect be the enemy of good enough”. I’m not asking for perfection. I’m specifically asking if the models are good enough.
    And what is RCS when it is at home?

    “Next, yes, it is theoretically possible to forecast the future trend of the climate without getting the small stuff (e.g. daily, monthly, and annual temperatures and swings) right. For example, the CM2.1 model (purple) in Fig. 1 shows huge inter-annual swings that do not appear in either the observations or the other models. But it still does a passable job of hindcasting the trend … sorry, that don’t impress me much. A model that claimed that the daily temperature varied over a 60°C range and that the globe varied 30° from summer to winter could get the trend right too … but that doesn’t mean I would trust that model. Would you?”

    Of course one has a cascade of tests. first order MOM, second order MOM.
    getting the trend correct is important. if to get the trend correct and one gets
    the wiggles right, you of course pick the best model.

    Exactly. I am just suggesting that the first order of testing be the comparison against data I detailed in my post.

    Next, we are supposed to trust these iterative models because they claim to be based on “fundamental physics”. And unfortunately our testing options are limited, since it takes fifty years to actually test a fifty-year prediction.

    # we can test a 50 year prediction at any time. its just math.

    Huh? What am I missing here? I predict that the world will disappear in 2060. Please tell me how I can test that prediction in less than fifty years. Use all the math you want.

    “But if a model gives unphysical results in the tests I used above, if it shows monthly ocean temperatures changing by some huge never-before-seen amount in a single month … what does that say about the quality of their “fundamental physics”?”

    it says nothing on its face. Could be lots of issues.

    Please don’t play word games. Their claim is that the “fundamental physics” are correctly installed and represented in such a way as to enable them to do century-long forecasts. Any issues that prevent that happening are part of their claim.

    “Mosh, I was interested in your choice of things to test (temperature trend, precipitation extremes, and sea level rise). I say that because I haven’t seen any observations of an increase in precipitation extremes. Nor is there any observed increase in the rate of sea level rise. In fact, the rate of rise has decreased slightly since about 2006.
    Given that, it’s not clear to me exactly how you would use those to test the models.”

    easy. if the model fails a test of sea level rise, you dont use its predictions to drive policy. Same for droughts and floods or hurricanes.

    Ah. Ok, that could work. Are you aware of specific predictions of sea level rise, weather extremes, or sea level rise by a given model that we could use to test the models?
    And how long will we have to wait to get the answers? Your three proposed phenomena are long-term, slow-changing, and noisy. So how soon would your method be able to give us answers?
    That’s why I’ve proposed comparing model results to data and its derivatives. That at least we have a lot of data on, and can test now rather than in decades.
    All the best,
    w.

  85. I would like to discuss another utilization and results of some GCMs (by the way, I’m probably showing my ignorance here, but doesn’t the “C” in GCM actually stand for Circulation?).
    During the recent oil spill in the Gulf of Mexico, a model was displayed that showed the oil being entrained into the loop current, into and out of the Florida Straits, into the Gulf Stream, up the East Coast, and finally dispersing over England. I believe this model was produced by the NCAR folks although it was only identified as coming from Boulder, Colorado. It was very smooth and detailed so somebody with a lot of computing horsepower produced it.
    Unfortunately, I never saw a legend to actually know what the colors meant but I assume that it was meant to convey different concentration of oil on the surface.
    The reason I brought this up is that I believe it is another example of how poorly a GCM model can perform when supplied with bad variables and forcing parameters.
    The cell sizes were minute as were the step sizes and the resulting model was impressive and to many shockingly believeable. The only problem was that it was just plain “wrong”. The oil, to my knowledge, never actaully made it into the loop current, much less into (in order) The Florida Strait, Gulf Stream, and the waters off of Coastal Great Britain.

  86. I have a question for Willis – apologies if this appears simple. but the GCM models you describe; how are they actually ‘run’? I mean, do they start at time X and run forwards to time Y – or do they start at time Y and work back to time X?
    For my money – surely the best model would ‘start’ at some point in the middle of available observational data, and then be able to accurately hindcast the known data – and then, without any adjustment or tuning, could be run forward and accurately predict the ‘later’ observed data. Would this be a good test of a model? Similarly, dropping a ‘shortened time sequence’ of data ‘into’ a model should enable a robust model to work forward or backwards and still come up with reasonable results when compared to the longer observational values?

  87. >> Finally, gnarf, I totally disagree with your claim that “Even an accurate model (if it exists) will fail your tests.” The UKMO model did a reasonable job, and that’s only looking at the nine models used by Santer. When one of the first nine models grabbed at random does all right, I’m not buying the ‘but it’s toooo hard’ excuse.
    I would agree more with your tests if they used as input some climate relevant data, like 30 years moving average, detrended, on a significant period of time, but here detrended temp is more like weather, and it’s derivatives even more.
    To make another comparison, I can make an iterative model which will predict quite reliably how the smoke climbs in a complex system of pipes and what is the average flow (climate) at each corner, but the model will give results with very narrow distribution, while measures have wide distribution…because the model does not try to model short term irregularities like vortices in the pipes(weather). I can add a random serie to simulate the chaotic part of the flow, but what is the point?
    I totally agree with you that model results have to be compared with what they are supposed to model…but I am not sure comparing derivatives distribution is the right thing to do. Weather is chaotic and weather equations can’t be integrated there is no weather forecast possible after few days, climate is less chaotic.
    Maybe comparing some self-correlation of the temperature series, fourier decomposition or some fractale measure. With fourier you can maybe spot the too high amplitude waves some model have, with self-correlation you may spot the feedbacks, when they come and how long they last….I am not a specialist at all of course.

  88. When I wrote that an accurate model would fail your tests, I wanted to say that it will pass the first one if measure frequency is not too high, but when you compare first and second derivatives, results should be worse and worse.

  89. My main interest is that there be some test, some way of separating the wheat from the chaff.
    On the other hand, comparing Model output to actual data would certainly not be in the interests of any ‘quality’ ipcc Climate Science Propaganda Operation. Seriously, once an observer starts to see ipcc Climate Science as only a giant Propaganda Op., its “method” makes total sense – provided that we should also keep M. Stanton Evans’ “law of inadequate paranoia” very firmly in mind: ~”no matter how bad things look, it just gets worse.”

  90. “”””” MikeO says:
    December 2, 2010 at 12:14 pm
    Can you Willis or someone else point me to a reference to the radiance of the sun. That is for a solar cell how many watts per square meter do you get at the earth’s surface as a maximum. Activists play with figures saying thing like you can get 600 watts or 1400 watts and that it is only ineffiencies of the collectors that is the problem. Currently I think it is 370 watts maximum and cells collect about 100 watts but cannot find an adequate reference. Is it a conspiracy could wikileaks help? “””””
    Well MikeO, first let’s get the units correct. “Radiance” is the radient energy equivalent of “Luminance” which relates only to human eye response to light. The (very) loose colloquial term would be “Brightness”; which should be avoided like the plague in scientific writings.
    But it is the wrong unit to use anyway since the units of Radiance are Watts per steradian per square metre; and it applies only to sources.
    The unit I am sure you were meaning is “Irradiance” and that truly does refer to the energy falling on a target surface in Watts per square metre, and as you can see it has no angular factor. The visual equivalent would be lumens per Square metre.
    So for the sun, the most often cited unit is often referred to as simply TSI, which is total Solar Irradiance; and it’s value, averaged over all the earth orbit locations is about 1366 W/m^2; based on the best satellite measurements over about three sunspot cycles. That is the value that solar cells would react to in earth orbit.
    On the earth’s surface with the sun directly overhead (zenith), atmospheric absorption, reduces that number down to something pretty close to 1000 W/m^2, and that is what earth bound solar cells would be limited to. That is often referred to as the “air mass one” irradiance, since one atmospheric mass of air stands in the path. If the sun were 60 degrees from the zenith or 30 degrees above the horizon, the slant range air path is twice as long so we would call that air mass two, and the solar cell output will be even less.
    In addition the ground level sunlight is suceptible to water vapor in the atmosphere which can absorb sunlight in the long visible to near infrared range from about 0.75 microns to about 4.0 microns. About 45% of the solar energy resides in that range, and water may be capable of absorbing about half of that range or about 20% of total sunlight with high tropical humidities.
    That 1000W/m^2 number is a good one to hang onto; but remeber that is per unit of area perpendicular to the sun beam; so it assumes that solar arrays will be pointed with their normal towards the sun; and hopefully track that in some way.
    So 1400 W/m^2 is not real; but as a matter of course, a lot of radiation engineers do use 1400 as anumber for just rough calculations. I don’t know why, becasue computers can deal with 1366 just as easily as 1400. When I went to school, the value used was 1353 W/m^2 but that was pre-sputnik days’ so based on balloon or rocket borne data.

  91. Huh? What am I missing here? I predict that the world will disappear in 2060. Please tell me how I can test that prediction in less than fifty years. Use all the math you want.
    ##### what I mean is simply this. A gcm makes billions of predictions. we may predict that temps will be 2C higher in 100 years, but we need not wait 100 years
    to evaluate that prediction. Your prediction of the world ending in 2060, could be disconfirmed by the world disappearing next week. It couldnt of course be confirmed.
    no math required.
    “Please don’t play word games. Their claim is that the “fundamental physics” are correctly installed and represented in such a way as to enable them to do century-long forecasts. Any issues that prevent that happening are part of their claim.”
    It’s not word games. The fundamental physics could be correct but incomplete. Nobody has said the physics is complete. Just that GCM are based on fundamental physics. Further they could be wrong but fundamental. What is meant is this. Its a physics simulation. So, with our best understanding of physics constrained to run on the best available hardware in a constrained amout of time we get answers that are
    roughly consistent with observations. As a decision maker I would weigh all this. I would acknowledge the flaws, the uncertainty, and weigh the evidence accordingly. I’ve give it more weight than a blog post.
    “Ah. Ok, that could work. Are you aware of specific predictions of sea level rise, weather extremes, or sea level rise by a given model that we could use to test the models?”
    Models that get precipitation correct often get temperature wrong. Google taylor Diagrams. You might find sea levels in the outputs.. not sure
    And how long will we have to wait to get the answers? Your three proposed phenomena are long-term, slow-changing, and noisy. So how soon would your method be able to give us answers?
    That’s why I’ve proposed comparing model results to data and its derivatives. That at least we have a lot of data on, and can test now rather than in decades.
    All the best,
    w.

  92. I know a bit about economic forecasting models, which others have referred to. I grant that there are significant differences in terms of the subject matter – but where there are important similarities is in the inherent limitations of forecasting modelling of complex systems.
    Paradoxically, testing against past events is not as useful as it intuitively seems. As others have pointed out, it is not difficult to construct a model that perfectly backcasts, say, interest rates for the last 30 years, but that proves nothing about its predictive value. It is just an artifact of statistics which happens to spit out the right results. While this is an over-simplification of how economic (and climate) models work and are constructed, I trust that the point is clear.
    In fact, the better interest rate predictor model might not work very well in retrospect, because it includes variables that were not measured (or measurable) in the past and therefore cannot be used for backcasting. No doubt this is an issue for climate modelling as well. That is leaving aside the whole issue of data integrity, which is even more of an issue in climate modelling than it is in economics.
    The other point I would like to make is that the reliability of every other kind of forecasting model I have ever come across degenerates rapidly as the time horizon gets longer. As others have pointed out, tiny errors magnify quickly over time. And there are always (at a minimum) tiny errors. I am amazed at the hubris of climate modellers with regard to this issue.

  93. Bill ILLIS
    Your model seems to mimic the history very accurately.
    However, I do not see any allowance for UHI which is a major factor in inhabited regions.
    Have I missed something?
    Or is there no appreciable UHI in the tropics?

  94. Murray Grainger says:
    December 2, 2010 at 8:58 am

    TerryS says:
    December 2, 2010 at 5:52 am
    Re: Murray Grainger
    Mike Haseler says:
    December 2, 2010 at 5:54 am
    Murray Grainger says: “Apples and Oranges folks.
    Sorry chaps, I shall make my sarcasm more obvious in future.

    Please don’t. I enjoy reading the responses because they didn’t get it LOL
    DaveE.

  95. Gut reactions are often wrong, but my gut reaction is that the above assessment might be complicated by smoothing concepts. For example, if a sea temperature can change from normal to abnormal and back again within a week, it is hard to capture this event faithfully in monthly or longer-period data.
    Similarly with confidence bars. The daily variation might be smaller than the weekly, smaller than the monthly, smaller than the annual, smaller than the decadal. You have to compare horses with courses. All of which says, if we have hourly data, then use it as the input of the model unless the demands on computer power become too large. If so, go to daily, etc.
    Are we confident that the statistical data distributions at each time scale have been studied to confirm if either traditional or non-customary methods of confidence calculation are applicable? If, as you note, there is a difference between ocean water cooling and heating rates, then one might not validly apply Poisson statistics to rates of change or rates of rates of change.
    Hopefully these factors have been considered and my misapprehension is misplaced. But then I wonder how they all missed that 1998 was a hot year globally.

  96. Steven Mosher replying to Willis’ comment:
    W.E.“But if a model gives unphysical results in the tests I used above, if it shows monthly ocean temperatures changing by some huge never-before-seen amount in a single month … what does that say about the quality of their “fundamental physics”?”
    S.M. “it says nothing on its face. Could be lots of issues.”
    WRONG! It means the results from your model are crap! Only a modeler would somehow think it’s ok.
    No one should be in favor of making life and death decisions using models giving “seemingly” accurate results even though the physical attributes to get the results are unrealistic. It may not be good science but it’s good policy! If a GCM model predicts an accurate temperature trend or sea level rise but calculated it using a negative CO2 concentration… or ANY OTHER physically impossible or never before observed situation, I would never use it. Why would anyone think it’s ok to trust that? Because if you’re wrong in predicting the weather, you just may get fired? If your wrong in predicting the climate? You just may get another grant.

  97. Willis,
    your plots are anomalies comparisons. I agree that the climate modelers have chosen that as the playing field, but considering that energy goes according to T^4, not anomolies^4 the discrepancies are even worse in measure of what is truly going on with the system.
    Have a look at the disagreement with data when temperatures are plotted , not for SST, by Lucia.
    Also the numerical approximations used in the models on the solutions of coupled non linear equations, introduce discrepancies with reality due to the higher order terms that are excluded, once the time step gets large enough. One would expect a butterfly plot of disagreement in time, backwards and forwards, with center the time when the averages and the parameters were taken from the data to initialize the models.

  98. steven mosher :
    December 2, 2010 at 10:11 pm
    I went to the link you provide. They are on the right track, but it is not mathematicians that they need only, they need solid theoretical physicists who can evaluate whether the mathematics is physically logical or not.

  99. anna, thank you for your post, which explains why models of complex dynamic systems necessarily have a short shelf life, and why even tiny errors distort both predictions and backcasts (see my post about economic modelling).

  100. anna v says:
    December 2, 2010 at 10:27 pm
    Willis,
    your plots are anomalies comparisons. I agree that the climate modelers have chosen that as the playing field, but considering that energy goes according to T^4, not anomolies^4 the discrepancies are even worse in measure of what is truly going on with the system.

    anna v, good to hear from you as always. You are, of course, correct. However, for the small temperature displacements we are looking at here (±0.5K at a temperature of about 300K) both T and T^4 are linear. So the boxplots and other plots would look nearly the same.

    Have a look at the disagreement with data when temperatures are plotted , not for SST, by Lucia.

    Hadn’t seen that, and I hadn’t realized it was that bad. That’s pathetic.

    Also the numerical approximations used in the models on the solutions of coupled non linear equations, introduce discrepancies with reality due to the higher order terms that are excluded, once the time step gets large enough. One would expect a butterfly plot of disagreement in time, backwards and forwards, with center the time when the averages and the parameters were taken from the data to initialize the models.

    This is an issue with iterative models in general, not just climate models. I would think that the way to tackle it would be to do a multi-period analysis of climate observations. By that I mean that I would look at the boxplots of one year trends of say temperature, and two year trends, and three year trends, and so on for all possible time periods contained in the record.
    Then I’d plot them out. At that point, we have a plot of the universe of all known trends at all timescales.
    Then I’d do the same for the models and see how that turned out. That would give us a basis for comparison … if I can find time I’ll give it a try.

  101. Are climate modelers aware of chaos theory and sensitive dependence on initial conditions?
    Now I have to go dig through my library for the book that describes an early discovery of chaotic systems and sensitivity to initial conditions, and if I recall correctly, the discoverer was running weather simulations (!).

  102. A very interesting post. However, I’m concerned that this is all based on hindcasting. It seems UKMO comes out as the best.But it might simply be because they put more effort into ‘adjusting’ their model to better match various aspects of historical data. To put it bluntly, I could write a computer program that performs far better than any of these models on historical data. In the extreme I would simply load in the historical data into the program, thus ensuring a perfect match!
    The only real test is how well the models predict future climate, though unfortunately we have to wait many years before the results are in. But it does seem that models, since around 1980, all predicted warming that has not occurred.
    One question. Do climate modellers do repeated runs with slightly different initial conditions? If the predictions are significantly different with slightly different initial conditions, then they would clearly be worthless.
    Chris

  103. The people who write the GCM software wouldn’t stand a week in an industrial software project. They found themselves a nice cosy place where they can indulge in a mess of their own making years on end, patting each other on the back in peer reviews about how nice it all works out, and as long as the public is scared enough, maybe some more billions will be thrown their way. As scientists, they would at best fill mediocre roles; as coders, they are entirely worthless. Their product is never validated, never reviewed, they have no deadlines nor requirements to meet, as long as they can churn out one of their worthless papers about an “experiment” and its results, where experiment means model run, and it’s all tautological – the model will just show what’s been built into it plus the effects of a few tiny bugs. The bugs will make sure “scientists” will be “baffled” and “surprised” – oops, France melts down into a puddle of red hot lava in 2053, it’s worse than we thought! And they didn’t even have methane clathrate meltdown! My, my! Quick, write a paper, get a Nobel.
    Has anyone ever seen an errata sheet of an older GCM version? Or a list of bugs fixed with a newer release?

  104. steven mosher says:
    December 2, 2010 at 10:11 pm (Edit)
    here Willis.
    http://sms.cam.ac.uk/media/871991;jsessionid=6D22F432FAB6DF564481D7B3332FB58D?format=flv&quality=high&fetch_type=stream
    Mosh, based on the text alone, this appears to be a pretty broad-based rejection of current climate models. I liked this bit:
    “Firstly, climate model biases are still substantial, and may well be systemically related to the use of deterministic bulk-formula closure – this is an area where a much better basic understanding is needed. Secondly, deterministically formulated climate models are incapable of predicting the uncertainty in their predictions; and yet this is a crucially important prognostic variable for societal applications. Stochastic-dynamic closures can in principle provide this. Finally, the need to maintain worldwide a pool of quasi-independent deterministic models purely in order to have an ad hoc multi-model estimate of uncertainty, does not make efficient use of the limited human and computer resources available worldwide for climate model developement.”
    It seems pretty clear, the author thinks all of the current models should be scrapped so people can focus on building something that works. I think it is a fool’s errand. The goal Cambridge is pursuing is higher resolution models so the models are less deterministic. A good goal is it was realistic. According to Hank Tennekes, the models will never match nature at higher resolutions. Nature is too chaotic. Orrin Pilkey agrees.

  105. Willis,
    I appreciate your effort in writing this blog post. And I think you may be on to something. Your approach seems reasonable to me. I would love to see it repeated at a higher latitude band.
    I also think Mosh had a good idea of looking at precipitation and sea ice.
    And where is Judith Curry? I would be interested in seeing her comments on this post.

  106. Tim says:
    December 2, 2010 at 5:01 am
    Would a simple way to describe GCM’s be: a technique that is easily manipulated to produce results that would achieve a desired
    —————–
    Tim needs to think about this more clearly.
    If the results of GCMs are being manipulated then there would be no point in bench marking them for accuracy or skill would there?

  107. This looks like a good article to me Willis.. The models need good tests for their skill. The more the better.

  108. LazyTeenager says:
    December 3, 2010 at 5:40 am
    Tim says:
    December 2, 2010 at 5:01 am
    Would a simple way to describe GCM’s be: a technique that is easily manipulated to produce results that would achieve a desired
    —————–
    Tim needs to think about this more clearly.
    If the results of GCMs are being manipulated then there would be no point in bench marking them for accuracy or skill would there?
    ———-
    Actually, Lazy Teenager, if the models are being manipulated (or constrained), testing their skill or validity is important because it will show they are not reliable. Science is supposed to be self-correcting. If a claim is not tested or replicated, it is not science. Benchmarking does not mean you expect future models to be any better. You may just be disproving the current model.

  109. LazyTeenager says:
    December 3, 2010 at 5:40 am
    “”If the results of GCMs are being manipulated then there would be no point in bench marking them for accuracy or skill would there?”
    The manipulation happens on the input end through parametrization, other than that, you are very right – there is absolutely no need to benchmark them. Deleting them all would be the only sane action.
    And firing all the parasites in the climate ivory towers.

  110. anna v says: December 2, 2010 at 10:27 pm:

    . …. the climate modelers have chosen that as the playing field, but considering that energy goes according to T^4, not anomolies^4 the discrepancies are even worse in measure of what is truly going on with the system.

    I’ve alway felt that a more reasonable 1 number metric for “Global Average Temp” would be the 4th root of the area averaged T^4.

  111. I apologize ahead of time for my ignorance, but the way I see it, if the model cannot “hindcast” (post-dict) the very data it was built with, something is wrong:
    “My model predicts that the Colts will win the 2010 Super Bowl,” is laughable, because they didn’t.
    Secondly, just because a model fits the data it was built with is no guarantee that the model is any good. The WHOLE POINT of the GCM’s is to predict the future; that is their ONLY utility and the ONLY real test of their “validity”. So if the model CANNOT predict the future with any skill, then it is a DUD model.
    Oh ho, you say. The models predict the future 100 years hence and so cannot be validated according to the strict rules above, at least not for 100 years. Catch-22 and all that.
    Sorry, but that sneaky little clause in the science contract means that the GCM CANNOT BE VALIDATED. It’s stupid to talk of validation when the thing is impossible to validate according to the Catch-22 limitations conveniently proffered by the model builders.
    But logic never had a role in GCM building. It’s all a plot by nefarious irrationalists to drain the Treasuries of the world. And I can validate and verify that statement, in case anyone needs the hard, cold proof.

  112. Ron Cram
    I think willis and others would be interested in the convective cloud examples.
    There is a much more powerful and interesting skeptical position WITHIN the science of AGW than outside it. The cranks who deny the radiative effects of C02 or sun spot chasers or “its natural variation” shoulder shruggers, are missing the best skeptical argument.

  113. What I find most interesting is all the Brits that read this blog and then are railing against the Met Office. Think about this point – the UKMO is the only model with which they are confident of to put out quarterly predictions, and it is legendarily wrong. This shows that it is one of the best at showing lifelike behavior.
    Think what would happen if they made these predictions with one of the worse models?
    ______
    Tim says:
    December 2, 2010 at 5:01 am
    Would a simple way to describe GCM’s be: a technique that is easily manipulated to produce results that would achieve a desired
    ______
    You just described mathematical modelling as a whole. As a Chemical Engineer, I had an entire class that boiled down to “your model is wrong and will always be wrong, here’s how you make it useful”. We call it control systems. It involves constantly taking real information and putting it into your model to correct it. Without the feedback, your model wouldn’t be able to control a coffee maker.

  114. Mike D. December 3, 2010 at 11:47 am
    “It’s all a plot by nefarious irrationalists to drain the Treasuries of the world.”
    And sometimes serious scientists just come to believe in their models. Anthony Watts has noted that four years ago, scientists at NASA’s Marshall Space Flight Center announced their computer model had a 94% correlation coefficient with hindcasting. They confidently predicted that Solar Cycle 24 would have the greatest sunspot maximum in 400 years! Now the same scientists, are talking about a possible Dalton Minimum; the lowest sunspot maximum in 2 centuries!
    The irony is that by predicting both extremes and everything in-between, the averaged predictions could turn out to be true. But I doubt they will. And the lesson is: we don’t yet have a model that reliably predicts good model performance.

  115. Mike D.
    You dont understand how model validation works.
    Let’s take a simple model F=MA. yes, all physical laws are MODELS.
    This model predicts that if we know the mass and the acceleration we can predict
    the force. Of course, we have to specify some things. Like how accurately we can
    make the prediction, and we have to know how accurately we can measure the variables. Of course in the lab F never is exactly equal to M*A. We call this residual “error” that is, for all practical purposes we accept F=MA as giving good answers.
    ( there are cases where it may not give the best answer)
    So when we set the validation criteria for a model we do not set perfection as a goal.
    Let’s take dropping bombs. I have a model for how a bomb drops. A simple gravity model. Now I could include all sorts of effects, coriolis, a complex drag model, an atmospheric winds model. In the end, I’ll choose a simple model because:
    A. my answer doesnt have to be good to the last millimeter. Its a bomb.
    B. I have to drop the sucker in a split second, so I want a fast model.
    So, for sea level increase I may decide.
    A. it has to get the sign correct per region
    B. a positive bias is preferred over a negative bias.
    C. it has to be good to +- some number of mm per year.
    D. it has to be globally correct
    Valid doesnt mean “true” or perfect. It means good enough to do the job for which
    it was intended.
    here is another way to think about it.
    For example. We may decide that for planning we want to know if a house will be safe in a 100 year flood. Well, we have empirical stats on 100 year floods. Those stats are always wrong. A 100 year model will predict that a flood is say 25% probable in the next 30 years. In reality the 100 year flood will happen or not. so the prediction that it is 25% probable is wrong one way or the other. But we use them.
    Taking 1961to 1990 as a base period, do you think the future will be warmer or colder?
    Why? I think it will be warmer. Our best science, limited as it is, say that warming of 2C over the next hundred years is more likely than not. So, I would plan for it being warmer. If you have a model that says its likely to be colder ( a math model not words) they lets put it to the test. What would that model predict from 1850 to today.

  116. Very interesting/devastating post. However, I’d like to propose another criteria for climate model utility. We need climate models that can accurately predict catastrophic climate change due to changing forcings (be they anthropogenic or natural). In theory, we have an excellent idea of how forcings were different during the early to mid Holocene: The biggest known change is that the earth was closest to the sun during summer in the Northern Hemisphere (unlike the present). The ice caps and therefore sea level were about the same. Ice cores give us a good idea of GHG’s and aerosols. We know that summer warmth allow forests reached the shores of the Arctic Ocean. Since that time, the Earth has experienced the kind of catastrophic climate change that truly useful climate models should be able to predict: the development of the Sahara, the largest desert on the planet. Since no one has reported that their model is capable of showing monsoon rains penetrating north to the Mediterranean during this period, climate models flunk this relatively unchallenging test.
    I recommend some of Stainforth’s papers on how much parameters in climate models can be varied and still produce models that work about as well as the IPCC’s “ensemble of opportunity” – better described as an “ensemble shaped by convergent evolution under the pressure from natural/political selection”). http://www.cabrillo.edu/~ncrane/nature%20climate.pdf Even Stainforth hasn’t varied the parameters that control thermal diffusion in the oceans, parameters that Lindzen claims are grossly inconsistent with laboratory measurements. Until the Argo network, we had really poor information about heat transfer in the oceans and no way to judge whether our models are correct. The IPCC’s climate models aren’t going to be found to be consistent with the last five years of data from the Argo network.

  117. Valid doesn’t mean true? It does in my dictionary.
    I guess in post-modern science “valid” means it looks like a duck. I am not that familiar with post-modern science, so I cannot dispute that contention. But if your model says the seas are going to boil away into outer space shortly after they become as sour as battery acid, then I would say your model doesn’t even look like a duck.
    Seriously, if the GCMs are wrong in EVERY prediction, as they are, then they lack validity, and nobody but a duck should believe them. The “best” science in this arena may be a pig in a poke, and wrong as wrong could be, and I think it is. I don’t think the globe is going to warm 2 deg C in the next 100 years. My model says the opposite — that the globe is going to COOL 2 deg C. You may not like my model, but it looks like a duck from where I sit, and if that’s the only criteria for validity, then my model is eminently valid according to the new definition of that word.

  118. And by the way, the model that says heavy objects like bombs will fall earthward due to gravity has ENORMOUS validity, based on all of human experience, and is not comparable in any way to GCM’s, which have never predicted the future with the slightest amount of accuracy or precision.

  119. Chris Wright says:
    December 3, 2010 at 3:22 am
    A very interesting post. However, I’m concerned that this is all based on hindcasting. It seems UKMO comes out as the best.But it might simply be because they put more effort into ‘adjusting’ their model to better match various aspects of historical data. To put it bluntly, I could write a computer program that performs far better than any of these models on historical data. In the extreme I would simply load in the historical data into the program, thus ensuring a perfect match!

    Thanks, Chris. One advantage of the method I have used is that it doesn’t matter how well the models fit the general shape of the historical data.
    I’m just looking at how the models do month by month, rather than looking at a time series over many years.
    As a result, my method offers the possibility of testing a model which has been tuned to a historical dataset, despite the tuning.
    w.

  120. Willis,
    Are you sure you’re not bouncing into some sort of reverse nyquist type problem where the model’s useful working resolution is meaningful only (e.g.) annually and you’re looking monthly detail?
    Mosher,
    Are you referring to the JC website’s GHG thread?

  121. steven mosher says:
    December 3, 2010 at 2:11 pm
    … So when we set the validation criteria for a model we do not set perfection as a goal.
    Let’s take dropping bombs. I have a model for how a bomb drops. A simple gravity model. Now I could include all sorts of effects, coriolis, a complex drag model, an atmospheric winds model. In the end, I’ll choose a simple model because:
    A. my answer doesnt have to be good to the last millimeter. Its a bomb.
    B. I have to drop the sucker in a split second, so I want a fast model.
    So, for sea level increase I may decide.
    A. it has to get the sign correct per region
    B. a positive bias is preferred over a negative bias.
    C. it has to be good to +- some number of mm per year.
    D. it has to be globally correct
    Valid doesnt mean “true” or perfect. It means good enough to do the job for which
    it was intended.

    Mosh, I don’t know why you keep returning to the the concepts of “perfect” and “perfection”. Recall that I didn’t ask for the models to be either of those. Instead, and for the reasons you list, I ask that the models be “lifelike”. It’s the same thing stumpy was talking about above as “visual fit”. Doesn’t have anything to do with perfection. Just has to be lifelike.

  122. Ben of Houston said:
    You just described mathematical modelling as a whole. As a Chemical Engineer, I had an entire class that boiled down to “your model is wrong and will always be wrong, here’s how you make it useful”. We call it control systems. It involves constantly taking real information and putting it into your model to correct it. Without the feedback, your model wouldn’t be able to control a coffee maker.
    ——————————————————————————-
    Thank you Ben. I wish that more modellers took your class.
    ModelMania is out of control, not just in climatology but in many other fields as well. In the areas I work in (economics, health policy and a few others) the notion that models are anything more than fallible tools is firmly entrenched – especially since anyone with an axe to grind can find a model or modeller that can give them the results they want.
    Good models exist, but their limitations need to be understood. Bad models are a dime a dozen.

  123. G.L. Alston says:
    December 3, 2010 at 9:20 pm
    Willis,
    Are you sure you’re not bouncing into some sort of reverse nyquist type problem where the model’s useful working resolution is meaningful only (e.g.) annually and you’re looking monthly detail?

    Possible I suppose, altho’ I’m not sure what a “reverse nyquist problem” might be … sample window too narrow, I suppose. But in any case, the models are iterative. That means that their days are assembled hour after hour, their months are created day after day, and years the same. This mean that it is very likely that it’s kinda all or nothing, that if the hour to hour calculations don’t work, nothing works.

  124. John Murphy says:
    December 4, 2010 at 3:26 pm (Edit)
    Willis
    What trends do the models hindcast over that interval of time?

    See the very first figure in my head post. Mild warming.
    w.

  125. The only value that ANY model has; if it has any at all, is in its ability to explain with some level of accuracy, the outcome of ANY related experiment; whether already performed, or yet to be performed.
    The latter of course amounts to future prediction; since to this point in time the contemplated experiment has not taken place.
    And the rest of course is already recorded history.
    Experiments yet to happen, could be interpolations between already known results; or extapolatins beyond some known point. If the expected results fail to emerge when the new experiment is performed; then that model is of no value; since any known interval of a reasonably well behaved function can be approximated to almost any desired level of accuracy with simple functions (curve fitting); none of which establishes ANY cause and effect relationship.
    And you rigorous mathematicians know that I am being a little bit colloquial with my terminology; in order to not scare off the lay folks that aren’t rigorous mathematicians. Many real world systems are well behaved in the sense referred to here.
    Climate (or weather) is NOT among those.
    Models are especially worthless, when they not only fail to predict; excuse me, that’s project, the outcome of as yet un performed experiments; but they don’t even postdict the already known outcomes of experiments already performed.
    Climate and weather DO fit into that category.
    The sun/earth weather/climate system is not a system in equilibrium; nor is it in a steady state; nor do the principal elements of the system respond to the average values of the variables; the system only responds to the actual instantaneous values of all the variables; so any model that fails to run in real time (analogue) mode is of little value.
    For the earth climate system; I believe it is true to say, that the entire data base of historically measured and recorded observational data about that system, is the BEST representation of the earth climate system that we have.
    And finally; that recorded data base, is itself corrupted by improper sampling regimens; and plain simple recording of the wrong things.
    For example the 150 year old, or older data of somewhat arbitrary measurements of randomly chosen ocean water samples, being used as an identity to lower tropospheric atmospheric Temperatures; in light of subsequent investigations that show the two are NOT identical; and they are not even correlated to each other.
    So if you have a model that doesn’t even approximately represent the actual physical behavior of any real planet; and you input data that is itself total BS, then it is not surprising, that your projections (got it right this time) about the future, have so far not panned out very well.

Comments are closed.