Testing … testing … is this model powered up?

Guest Post by Willis Eschenbach

Over at Judith Curry’s excellent blog she has a post on how to test the climate models. In response I wrote a bit about some model testing I did four years ago, and I thought I should expand it into a full post for WUWT. We are being asked to bet billions of dollars on computer model forecasts of future climate catastrophe. These global climate models, known as GCMs, forecast that the globe will warm extensively over the next century. In this context, it is prudent to take a look at how well the models have done at “hindcasting” historical temperatures, when presented with actual data from historical records.

I analysed the hindcasts of the models that were used in Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere, (PDF, 3.1Mb) by B. D. Santer et al. (including Gavin Schmidt), Science, 2005 [hereinafter Santer05].

In that study, results were presented for the first time showing two sets of observational data plus 9 separate GCM temperature “hindcasts” for the temperatures at the surface, troposphere, and stratosphere of the tropical region (20°N to 20°S) from 1979 to 2000. These models were given the actual 1979-2000 data for a variety of forcings (e.g., volcanic eruptions, ozone levels, see below for a complete list). When fed with all of these forcings for 1979-2000, the GCMs produced their calculated temperature hindcasts. I have used the same observational data and the same model results used by Santer. Here’s what their results look like:

Results from Santer05 Analysis. Red and orange (overlapping) are observational data (NOAA and HadCRUv2). Data digitized from Santer05. See below for data availability.

The first question that people generally ask about GCM results like this is “what temperature trend did the models predict?”. This, however, is the wrong initial question.

The proper question is “are the model results life-like?” By lifelike, I mean do the models generally act like the real world that they are supposedly modeling? Are their results similar to the observations? Do they move and turn in natural patterns? In other words, does it walk like a duck and quack like a duck?

To answer this question, we can look at how the models stand, how they move, and how they turn. By how the models stand, I mean the actual month by month temperatures that the GCMs hindcast. How the models move, on the other hand, means the monthly changes in those same hindcast temperatures. This is the month-to-month movement of the temperature.

And how the models turn means the monthly variation in how much the temperatures are changing, in other words how fast they can turn from warming to cooling, or cooling to warming.

In mathematical terms, these are the hindcast surface temperature (ST), the monthly change in temperature [written as ∆ST/month, where the “∆” is the Greek letter delta, meaning “change in” ], and the monthly change in ∆ST [ ∆(∆ST)/month ]. These are all calculated from the detrended temperatures, in order to remove the variations caused by the trend. In the same manner as presented in the Santer paper, these are all reduced anomalies (anomalies less average monthly anomalies) which have been low-pass filtered to average slight monthly variations

How The Models Stand

How the models stand means the actual temperatures they hindcast. The best way to see this is a “boxplot”. The interquartile “box” of the boxplot represents the central half of the the data (first to third quartiles). In other words, half the time the surface temperature is somewhere in the range delineated by the “box”. The “whiskers” at the top and bottom show the range of the rest of the data out to a maximum of 1.0 times the box height. “Outliers”, data points which are outside of the range of the whiskers, are shown as circles above or below the whiskers. Here are the observational data (orange and red for NOAA and HadCRUT2v surface temperatures), and the model results, for the hindcast temperatures. A list of the models and the abbreviations used is appended.

Figure 1. Santer Surface Temperature Observational Data and Model Hindcasts. Colored boxes show the range from the first (lower) quartile to the third (upper) quartile. NOAA and HadCRUT (red and orange) are observational data, the rest are model hindcasts. Notches show 95% confidence interval for the median. “Whiskers” (dotted lines going up and down from colored boxes)  show the range of data out to the size of the Inter Quartile Range (IQR, shown by box height). Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). Gray rectangles at top and bottom of colored boxes show 95% confidence intervals for quartiles. Hatched horizontal strips show 95% confidence intervals for quartiles and median of HadCRUT observational data. See References for list of models and data used.

Fig. 1 shows what is called a “notched” boxplot. The heavy dark horizontal lines show the median of each dataset. The notches on each side of each median show a 95% confidence interval for that median. If the notches of two datasets do not overlap vertically, we can say with 95% confidence that the two medians are significantly different. The same is true of the gray rectangles at the top and bottom of each colored box. These are 95% confidence intervals on the quartiles. If these do not overlap, once again we have 95% confidence that the quartile is significantly different. The three confidence ranges of the HadCRUT data are shown as hatched bands behind the boxplots, so we can compare models to the 95% confidence level of the data.

Now before we get to the numbers and confidence levels, which of these model hindcasts look “lifelike” and which don’t? It’s like one of those tests we used to hate to take in high school, “which of the boxplots on the right belong to the group on the left?”

I’d say the UKMO model is really the only “lifelike” one. The real world observational data (NOAA and HadCRUT) has a peculiar and distinctive shape. The colored boxes showing the interquartile range of the data are short. There are numerous widely spread outliers at the top, and a few outliers bunched up close to the bottom. This shows that the tropical ocean often gets anomalously hot, but it rarely gets anomalously cold. UKMO reproduces all of these aspects of the observations pretty well. M_medres is a distant second, and none of the others are even close. CCSM3, GISS-EH, and PCM often plunge low, way lower than anything in the observations. CM2.1 is all over the place, with no outliers. CM2.0 is only slightly better, with an oversize range and no cold outliers. GISS-ER has a high median, and only a couple outliers on the cold side.

Let me digress for a moment here and talk about one of the underlying assumptions of the climate modellers. In a widely-quoted paper explaining why climate models work , the author states (emphasis mine):

On both empirical and theoretical grounds it is thought that skilful weather forecasts are possible perhaps up to about 14 days ahead. At first sight the prospect for climate prediction, which aims to predict the average weather over timescales of hundreds of years into the future, if not more does not look good!

However the key is that climate predictions only require the average and statistics of the weather states to be described correctly and not their particular sequencing. It turns out that the way the average weather can be constrained on regional-to-global scales is to use a climate model that has at its core a weather forecast model. This is because climate is constrained by factors such as the incoming solar radiation, the atmospheric composition and the reflective and other properties of the atmosphere and the underlying surface. Some of these factors are external whilst others are determined by the climate itself and also by human activities. But the overall radiative budget is a powerful constraint on the climate possibilities. So whilst a climate forecast model could fail to describe the detailed sequence of the weather in any place, its climate is constrained by these factors if they are accurately represented in the model.

Well, that all sounds good, and if it worked, it would be good. But the huge differences between the model hindcasts and actual observations clearly demonstrate that in all except perhaps one of these models the average and statistics are not described correctly …

But I digress … the first thing I note about Fig. 1 is that the actual tropical temperatures (NOAA and HadCRUT) stay within a very narrow range, as shown by the height of the coloured interquartile boxes (red and orange).

Remember that the boxplot means that half of the time, the actual tropical surface temperature stayed in the box, which for the observations shows a +/- 0.1° temperature range. Much of the time the tropical temperature is quite stable. The models, on the other hand, generally show a very different pattern. They reflect much more unstable systems, with the temperatures moving in a much wider range.

The second thing I note is that the model errors tend to be on the hot side rather than the cold side. The PCM, GISS-EH, and CCSM3 models, for example, all agree with the observations at the first (cooler) quartile . But they are too hot at the median and the third (upper) quartile. This is evidence that upwards temperature feedbacks are being overestimated in the model, so that when the models heat up, they heat too far, and they don’t cool down either as fast or as far as the real tropics does. Again, of the nine models, only the UKMO model reproduces the observed behaviour. All the rest show a pattern that is too hot.

Third, I note that all of the model interquartile boxes (except UKMO) are taller than the actual data, regardless of the range of each model’s hindcast. Even models with smaller ranges have taller boxes. This suggests again that the models have either too much positive feedback, or too little negative feedback. Negative feedback tends to keep data bunched up around the median (short box compared to range, like the observational data), positive feedback pushes it away from the median (tall box, e.g. PCM, with range similar  to data, much taller box).

Mathematically, this can be expressed as an index of total data range/IQR (Inter Quartile Range). For the two actual temperature datasets, this index is about 5 and 5.3, meaning the data is spread over a range about five times the IQR. All the models have indices in the range of 2.7-3.6 except UKMO, which has an index of 4.7.

Some of the models are so different from the data that one wonders why these are considered “state of the art” models. The CM models, both 2.0 and 2.1, give hindcast results that go both way hotter and way colder than the observational data. And all of the models but two (GISS-ER and UKMO) give hindcast results that go colder than the data.

How the Models Stand – Summary

  • UKMO is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
  • M_medres did a distant second best. It only matched the lower quartile confidence interval, with both the median and upper quartile being too hot.
  • The rest of the models were all well out of the running, showing distributions which are strikingly different from the observational data.
  • Least lifelike I’d judge to be the two CM models, CM2.0 and 2.1, and the PCM model.

Since we’ve seen the boxplot, let’s take a look at the temperature data for two lifelike and two un-lifelike models, compared with the observational data. As mentioned above, there is no trend because the data is detrended so we can measure how it is distributed.

Figure 2. Surface Temperature Observational Data and Model Hindcasts. Two of the best on the left, two of the worst on the right. See References for list of models and data used.

Note how for long periods (1979-82, 1990-97) the actual tropical surface temperature hardly varied from zero. This is why the box in the boxplot is so short, with half the data within +/- 0.1°C of the average.

The most lifelike of the models (UKMO and M_medres), while not quite reproducing this behaviour, came close. Their hindcasts at least look possible. The CM2.1 and PCM models, on the other hand, are wildly unstable. They hindcast extreme temperatures, and spend hardly any of their time in the +/- 0.1° C range of the actual temperature.

How The Models Move

How the models move means the month-to-month changes in temperature. The tropical ocean has a huge thermal mass, and it doesn’t change temperature very fast. Here are the boxplots of the movement of the temperature from month to month:

Figure 3. Surface Temperature Month-to-Month Changes (∆ST/month), showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile (IQR, or inter-quartile range). Notches show 95% confidence interval for the median. Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). . See References for list of models and data used.

Here in the month-to-month temperature changes, we see the same unusual pattern we saw in Fig. 1 of the temperatures. The observations have a short interquartile box compared to the range of the data, and a number of outliers. In this case there are about equal numbers of outliers above and and below the box. The inter quartile range (IQR, the box height) of tropical temperature change is about +/- 0.015°C per month, indicating that half of the time the temperature changes that little or less. The total range of the temperature change is about +/- 0.07. It is worth noting that in the 21 year record, the tropical surface never warmed or cooled faster than .07°C per month, so the models predicting faster warming or cooling than that must be viewed with great suspicion.

Although all but two of the models (CM2.0 and CM2.1) matched all three confidence intervals, there are still significant differences in the distribution of the hindcasts and the observations. The most lifelike is M_medres, with UKMO second. GISS-ER (purple) is curious, in that the month-to-month movements are all very small, never more than +/- 0.03 degrees per month. It never hindcasts anything like the larger monthly changes that we see in the actual data.

Next, consider the speed at which the ocean heats and cools. In the real world, as shown by the data, the heating and cooling rates are about the same. This makes sense, as we would expect the tropical ocean to radiate heat at something like the same rate it gains it. It has to lose the heat it gains at night by the next morning for the temperature to stay the same over several days.

Now look at the data distribution for GISS-EH, CM2.0 or CM2.1. They rarely heat up fast, but they cool down very fast (short whisker on top, long whisker plus outliers on bottom). Slow heating and fast cooling, that doesn’t make physical sense. The maximum heating rate for GISS-EH (0.03°C/mo.) is less than half the maximum heating rate of the actual tropics. PCM has the same problem, but in the other direction, heating up much faster than it cools down.

How the Models Move – Summary

  • M_medres is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
  • UKMO did a credible second best. However, the ranges of UKMO and M_medres were too small
  • The rest of the models were well out of the running, showing distributions which are strikingly different from the observational data.
  • The least lifelike? I’d judge that to be the two CM models, CM2.0 and 2.1, and the CCSM3 model.

Let’s take a look at these winners and losers at reproducing the changes in temperature (∆ST). As mentioned above, the data is detrended so we can see how it is distributed.

Figure 4. Surface Temperature Observational Data and Model Hindcast Delta ST. Shows monthly changes in the temperature. Two of the best, two of the worst. See References for list of models and data used.

Once again, the boxplots correctly distinguish between lifelike and un-lifelike models. The large and extremely rapid temperature drops of the CM2.1 model are clearly unnatural. The GISS-ER model, on the other hand hardly moves from month to month and is unnatural in the other direction.

How the Models Turn

Acceleration is the rate of change of speed. In this context, speed is the rate at which the tropical temperatures warm and cool. Acceleration is how fast the warming or cooling rate changes. It measures how fast a rising temperature can turn to fall again, or how fast a falling temperature can turn into a rising temperature. Since acceleration is the rate of change (∆) of the change in temperature (∆ST), it is notated as ∆(∆ST). Here are the results.

Figure 5. Surface Temperature Month-to-Month Changes in ∆ST, showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile. Notches show 95% confidence interval for the median. Whiskers show the range of data out to the interquartile box height. Circles show outliers. See References for list of models and data used.

Using the 95% confidence interval of the median and the quartiles, we would reject CM2.1, CM2.0, CCSM3, GISS-ER, and UKMO. PCM and M_medres are the most lifelike of the models. UKMO and GISS-ER are the first results we have seen which have significantly smaller interquartile boxes than the observations.

CONCLUSIONS

The overall conclusion from looking at how the models stand, move, and turn is that the models give results that are quite different from the observational data. None of the models were within all three 95% confidence intervals (median and two quartiles) of all of the data (surface temperatures ST, change in surface temps ∆ST, and acceleration in surface temps ∆∆ST). UKMO and M_medres were within 95% confidence intervals for two of the three datasets.

A number of the models show results which are way too large, entirely outside the historical range of the observational data. Others show results that are much less than the range of observational data. Most show results which have a very different distribution from the observations.

These differences are extremely important. As the Thorpe quote above says, before we can trust a model to give us future results, it first needs to be able to give hindcasts that resemble the “average and statistics of the weather states”. None of these models are skillful at that. UKMO does the best job, and M_medres comes in third best, with nobody in second place. The rest of the models are radically different from the reality.

The claim of the modellers has been that, although their models are totally unable to predict the year-by-year temperature, they are able to predict the temperature trend over a number of years. And it is true that for their results to be believable, they don’t need to hindcast the actual temperatures ST, monthly temperature changes ∆ST, and monthly acceleration ∆∆ST.

However, they do need to hindcast believable temperatures, changes, and accelerations. Of these models, only UKMO, and to a much lesser extent M_medres, give results that by this very preliminary set of measures are at all lifelike. It is not believable that the tropics will cool as fast as hindcast by the CM2.1 model (Fig. 3). CM2.1 hindcasts the temperature cooling at three times the maximum observed rate. On the other hand, the GISS-ER model is not believable because it hindcasts the temperature changing at only half the range of changes shown by observation. Using these models in the IPCC projections is extremely bad scientific practice.

There is an ongoing project to collect satellite based spectrally resolved radiances as a common measure between models and data. Unfortunately, we will need a quarter century of records to even start analysing, so that doesn’t help us now.

What we need now is an agreed upon set of specifications that constitute the mathematical definition of “lifelike”. Certainly, at a first level, the model results should resemble the data and the derivatives of the data. As a minimum standard for the models, the hindcast temperature itself should be similar in quartiles, median, and distribution of outliers to the observational data. Before we look at more sophisticated measures such as the derivatives of the temperature, or the autocorrelation, or the Hurst exponent, or the amplification, before anything else the models need to match the “average and statistics” of the actual temperature data itself.

By the standards I have adopted here (overlap of the 95% confidence notches of the medians, overlap of the 95% confidence boxes of the quartiles, similar outlier distribution), only the UKMO model passed two of the three tests. Now you can say the test is too strict, that we should go for the 90% confidence intervals and include more models. But as we all know, before all the numbers and the percentages when we first looked at Figure 1, the only model that looked lifelike was the UKMO model. That suggests to me that the 95% standard might be a good one.

But I’m not insisting that this be the test. My main interest is that there be some test, some way of separating the wheat from the chaff. Some, indeed most of these models are clearly not ready for prime time — their output looks nothing like the real world. To make billion dollar decisions on an untested, unranked suite of un-lifelike models seems to me to be the height of foolishness.

OK, so you don’t like my tests. Then go ahead and propose your own, but we need some suite of tests to make sure that whether or not the output of the models is inaccurate, it is at least lifelike … because remember, being lifelike is a necessary but not sufficient condition for the accurate forecasting of temperature trends.

My best to everyone,

w.

DATA

The data used in this analysis is available here as an Excel workbook.

REFERENCES

B. D. Santer et. al., 2005, September 2, “Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere”, Science Magazine

Thorpe, Alan J., 2005, “Climate Change Prediction — A challenging scientific problem”, Institute for Physics, 76 Portland Place London W1B 1NT

MODELS USED IN THE STUDY

National Center for Atmospheric Research in Boulder (CCSM3, PCM)

Institute for Atmospheric Physics in China (FGOALS-g1.0)

Geophysical Fluid Dynamics Laboratory in Princeton (GFDL-CM2.0, GFDL-CM2.1)

Goddard Institute for Space Studies in New York (GISS-AOM, GISS-EH, GISS-ER)

Center for Climate System Research, National Institute for Environmental Studies, and Frontier Research Center for Global Change in Japan (MIROC-CGCM2.3.2(medres), MIROCCGCM2.3.2(hires))

Meteorological Research Institute in Japan (MRICGCM2.3.2).

Canadian Centre for Climate Modelling and Analysis (CCCma-CGCM3.1(T47))

Meteo-France/Centre National de Recherches Meteorologiques (CNRM-CM3)

Institute for Numerical Mathematics in Russia (INM-CM3.0)

Institute Pierre Simon Laplace in France (IPSL-CM4)

Hadley Centre for Climate Prediction and Research in the U.K. (UKMO-HadCM3 and UKMO-HadGEM1).

FORCINGS USED BY THE MODELS

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
136 Comments
Inline Feedbacks
View all comments
December 2, 2010 11:25 am

I would have thought that it would be important for the models to get the ‘absolute’ temperature of the earth about right, too.
Some time ago, Lucia plotted the IPCC models ‘absolute’ temperatures :
http://rankexploits.com/musings/2009/fact-6a-model-simulations-dont-match-average-surface-temperature-of-the-earth/
Quite an eye openner.

Mingy
December 2, 2010 11:27 am

Day
“You’re throwing out the baby with the bath. Backcasting is very useful for validating a model, because you don’t have to “wait for the future to happen” to do the validation.
It’s still “blind testing” in the sense that no modelers can possibly train their models on every piece of existing data, In any case, most of us set aside some “blind data” for testing anyway. And those corrections you mention apply to the future too, which we’re also “blind” to.
They’re using a small fraction of the data to predict the rest. That’s utility in my book.”
I don’t know how much experience you have with models, but there are models and there are models.
All ‘models’ of the financial system work well in back casting. All of them which get published, in any event. But they have no ability to predict the future – they diverge from the moment you start predicting.
When you look at, let say, planetary orbits, you have a deterministic model with high resolution inputs. When you look at something like climate, you have a chaotic system – like water flow. I know people don’t think its chaotic (and most folks don’t know what the word means in the context of modeling) but can we agree that water flow, air flow, and clouds have an impact? Its a matter of scientific fact that life modified (and modifies) the climate and vice versa. Can we confidentially model the biosphere and carbon cycle?
What about backcasting gives you any confidence about a model’s predictive ability? The only thing you can say is that if it can’t backcast it can’t forecast!
Of course, the fact no two models seem to agree is another vote of confidence …

Gina
December 2, 2010 11:52 am

Take any seven factors correlatable to time (i.e., ketchup sales in Tennessee, the number of bald eagles, etc.), and one can write a model (in fact, an infinite number of models) that fits past temperature data, whether good data or highly corrected data based on many arbitrary assumptions. There’s really no difficulty in that. A model must not only stand to reason, but also be testable by reliable data under controlled conditions. So far, science has come no where near achieving this.
Assumptions can be reasonable but biased, if they are chosen from among multiple equally reasonable assumptions. Today’s “climate science” comprises little more than creating mathematical models based on biased and untestable assumptions, testing the models using poorly collected data that has been adjusted by biased and untestable assumptions, and predicting disasters based on many more arbitrary and untestable assumptions. Equally reasonable, biased and untestable assumptions would predict future cooling and global prosperity. But this activity would not be good science either.
Physics can be used to calculate the heat that an increase of X ppm of certain gases would absorb or reflect (before heat removal from the earth by convection is accounted for), which predicts a tiny undetectable temperature increase. But in most models, positive (amplifying) feedbacks of CO2 are biasedly emphasized over negative (dampening) feedbacks, and other model factors are used to explain why CO2 has not yet created the chaos that it will soon create. And it’s all “validated” by poorly collected and poorly controlled data.

MikeO
December 2, 2010 12:14 pm

Can you Willis or someone else point me to a reference to the radiance of the sun. That is for a solar cell how many watts per square meter do you get at the earth’s surface as a maximum. Activists play with figures saying thing like you can get 600 watts or 1400 watts and that it is only ineffiencies of the collectors that is the problem. Currently I think it is 370 watts maximum and cells collect about 100 watts but cannot find an adequate reference. Is it a conspiracy could wikileaks help?

Charlie A
December 2, 2010 1:11 pm

A very interesting post.
Willis — you compared the statistical characteristics of 1979-2000 data to models and found the models to have very different statistics.
Would it be possible for you to do the same sort of comparison between the 1979-2000 data and the 2000-2010 or the 1950 to 1980 data?
It would be interesting to see whether the statistical measures you chose to look at are relatively constant in the temperature data record.
Charlie

Mingy
December 2, 2010 1:15 pm

@Willis Eschenbach
Out of sample data might be useful for testing in certain contexts – however, if the in sample and out of sample data happen to be correlated (which I would imagine would apply to climate data), not so much.
The challenge remains: how can we ensure a model has predictive value. Well, if it is a deterministic model with known inputs and a robustly demonstrated theoretical framework works on past past, then it might work on future data as well. We don’t know whether it has predictive value, but it might. Then we look at prediction vs. reality and (think tree-rings). If reality disagrees with prediction, then the model is obviously useless.
Then again, if climate models were deterministic with known inputs and a robustly demonstrated theoretical framework they would agree with one another.
There is a great quote from M&M’s book, something along the lines of “A model of a mouse, no matter how good, tells you nothing about a mouse”. That perfectly aligns with the courses on modeling I took: don’t believe your model tells you anything about nature, nature tells you what wrong with your model.

pwl
December 2, 2010 1:27 pm

Excellent analysis Willis. Thanks.
It would be interesting to know more about the details and assumptions of how each alleged climate model works, er to be more accurate, doesn’t work.
As an aside, I looked up “climate forcing” and found two definitions using a Google search:
(1) “Climate Forcing: The Earth’s climate changes when the amount of energy stored by the climate system is varied. The most significant changes occur when the global energy balance between incoming energy from the Sun and outgoing heat from the Earth is upset. There are a number of natural mechanisms that can upset this balance, for example fluctuations in the Earth’s orbit, variations in ocean circulation and changes in the composition of the Earth’s atmosphere. In recent times, the latter has been evident as a consequence not of natural processes but of man-made pollution, through emissions of greenhouse gases. By altering the global energy balance, such mechanisms “force” the climate to change. Consequently, scientists call them “climate forcing” mechanisms.”
http://www.ace.mmu.ac.uk/eae/climate_change/older/Climate_Forcing.html
That seems like a clear definition and talks about the actual planet.
The second definition is kind of different.
(2) “Forcings: Forcings in the climate sense are external boundary conditions or inputs to a climate model. Obviously changes to the sun’s radiation are external, and so that is always a forcing. The same is true for changes to the Earth’s orbit (“Milankovitch cycles”). Things get a little more ambigous as you get closer to the surface. In models that do not contain a carbon cycle (and that is most of them), the level of CO2 is set externally, and so that can be considered a forcing too. However, in models that contain a carbon cycle, changes in CO2 concentrations will occur as a function of the climate itself and in changes in emissions from industrial activity. In that case, CO2 levels will be a feedback, and not a forcing. Almost all of the elements that make up the atmosphere can be considered feedbacks on some timescale, and so defining the forcing is really a function of what feedbacks you allow in the model and for what purpose you are using it. A good discussion of recent forcings can be found in Hansen et al (2002) and in Schmidt et al (2004).
Filed under: * Glossary — group 28 November 2004 -”
http://www.realclimate.org/index.php/archives/2004/11/forcings/
This second definition isn’t clear, certainly not as clear as the first, but that’s ok, not everyone writes with the same level of clarity, which is why I generally look for more than one source for definitions.
The main point that occurred to me as I read the second definition is that they are defining it in terms of “climate models” rather than the actual atmosphere. I find this very peculiar. Aren’t they being paid to study the ACTUAL atmosphere and climate? I gather not for their definition clearly shows them to be defining their terms in terms of climate models rather than the actual planet.
Now maybe I’m just splitting hairs, but also maybe it reveals a profound difference in the mind set of the two sets of scientists? One focused on the actual planet and the others focused on models to the extent that they define their world in terms of computer models! Strange indeed, and I’m a computer scientist!
Anyway, thanks for your illuminating findings Willis.

December 2, 2010 1:43 pm

Willis:
First, let me say that it is not just silly to use models that “perform badly” (whatever that means). It is the antithesis of the scientific method. And I agree that comparing models that use different inputs, as do the IPCC models, is meaningless.
############
before we had accurate models of RCS we used models we knew were wrong. We picked the least wrong. thats far better than letting the perfect be the enemy of the good enough. This is not really about the scientific method
“Next, yes, it is theoretically possible to forecast the future trend of the climate without getting the small stuff (e.g. daily, monthly, and annual temperatures and swings) right. For example, the CM2.1 model (purple) in Fig. 1 shows huge inter-annual swings that do not appear in either the observations or the other models. But it still does a passable job of hindcasting the trend … sorry, that don’t impress me much. A model that claimed that the daily temperature varied over a 60°C range and that the globe varied 30° from summer to winter could get the trend right too … but that doesn’t mean I would trust that model. Would you?”
Of course one has a cascade of tests. first order MOM, second order MOM.
getting the trend correct is important. if to get the trend correct and one gets
the wiggles right, you of course pick the best model.
Next, we are supposed to trust these iterative models because they claim to be based on “fundamental physics”. And unfortunately our testing options are limited, since it takes fifty years to actually test a fifty-year prediction.
# we can test a 50 year prediction at any time. its just math.
“But if a model gives unphysical results in the tests I used above, if it shows monthly ocean temperatures changing by some huge never-before-seen amount in a single month … what does that say about the quality of their “fundamental physics”?”
it says nothing on its face. Could be lots of issues.
“Mosh, I was interested in your choice of things to test (temperature trend, precipitation extremes, and sea level rise). I say that because I haven’t seen any observations of an increase in precipitation extremes. Nor is there any observed increase in the rate of sea level rise. In fact, the rate of rise has decreased slightly since about 2006.
Given that, it’s not clear to me exactly how you would use those to test the models.”
easy. if the model fails a test of sea level rise, you dont use its predictions to drive policy. Same for droughts and floods or hurricanes.

Kev-in-UK
December 2, 2010 1:59 pm

I openly confess to not really understanding the methodology of producing GCM’s. What I find somewhat perplexing – is the fact that, with all the up to date real data, of many different parameters, and the best computer programs, etc – local (i.e. countrywise) weather forecasts cannot accurately predict more than a few days ahead.
Now, I’m sure someone will correct me if I am wrong – but if the above is not demonstrated to be possible, with ALL the available carefully obtained meteo data and climate knowledge – how the heck can anyone expect to produce a model (no matter how general and simplistic) that looks back over time (the last 150 yrs) from a load of assumptions, and without knowing detailed climate variabilities and sensitivities (which forcings, etc!) or detailed actual observations? – and even expect to get ‘close’?
I know most people accept that weather forecasting is difficult because real-life actual prediction of the ‘chaotic’ system cannot be achieved. So why does anyone think hindcasts from GCM models will ever show anything other than possible ‘general’ indications?
I am not saying that trying to model past climate won’t help to understand forcings and sensitivity, etc – but they can never really be fully verified! As a really simplistic example, what if the model uses a given forcing for a given parameter and a given feedback – and then gives a reasonable ‘fit’. That does not mean the model is correct – what it really means is that the modeller chose values that seem to fit. (or in the back analysis process, the values were altered to GET it to fit!) In practise, a completely different forcing and sensitivity (from a non-considered or indeed unknown source) could have caused the same effect? Or, put another way, a combination of said forcings and senstivities – it only takes half a dozen ifs – and (IMHO) you are wasting your time!
Perhaps more simply – I suggest the following thought experiment –
Imagine standing at one side of a see-saw sticking out of a wall, but you don’t know it’s a see-saw! (it could be a simple balance beam, or a cranked beam, or some form of geared system, etc – it doesn’t matter – the point is you do not know!) – You see only a seat which inexplicably moves up and down and you cannot see behind the wall to the other side of the see-saw. You get on the seat – it goes down, you get off it goes up – but sometimes an imaginary bloke around the other side gets on and off at random, or his child does – but all you ever SEE, is the effect at YOUR side, i.e. sometimes it doesn’t go down, or stays down when you get off, or its goes down or up slowly! You don’t know when this will happen, you don’t know why it happens, so you invent explanations, how many kids he has, how obese he or his kids are, or how long his side of the ‘beam’ is, etc, etc.
Now, how the heck are you going to work out how the system it works?? You have to invent all the possibilities for the fact that sometimes your seat goes up or down, but you can never model it or use your model to predict what will happen the next time you get on or off the seat!
I cannot see that trying to model the massive climate system and biosphere we call home could ever be understood to a level where signifcant predictive capabilities are possible. IMO, the biosphere/ecosphere/climate system is the equivalent of an awful
lot of see-saws, but unlike my example, they are inextricably intertwined together.
I don’t wish to be defeatist as such – just realistic about the scale of the problem regarding GCM’s!

Mark Betts
December 2, 2010 2:51 pm

I would like to discuss another utilization and results of some GCMs (by the way, I’m probably showing my ignorance here, but doesn’t the “C” in GCM actually stand for Circulation?).
During the recent oil spill in the Gulf of Mexico, a model was displayed that showed the oil being entrained into the loop current, into and out of the Florida Straits, into the Gulf Stream, up the East Coast, and finally dispersing over England. I believe this model was produced by the NCAR folks although it was only identified as coming from Boulder, Colorado. It was very smooth and detailed so somebody with a lot of computing horsepower produced it.
Unfortunately, I never saw a legend to actually know what the colors meant but I assume that it was meant to convey different concentration of oil on the surface.
The reason I brought this up is that I believe it is another example of how poorly a GCM model can perform when supplied with bad variables and forcing parameters.
The cell sizes were minute as were the step sizes and the resulting model was impressive and to many shockingly believeable. The only problem was that it was just plain “wrong”. The oil, to my knowledge, never actaully made it into the loop current, much less into (in order) The Florida Strait, Gulf Stream, and the waters off of Coastal Great Britain.

Kev-in-UK
December 2, 2010 2:54 pm

I have a question for Willis – apologies if this appears simple. but the GCM models you describe; how are they actually ‘run’? I mean, do they start at time X and run forwards to time Y – or do they start at time Y and work back to time X?
For my money – surely the best model would ‘start’ at some point in the middle of available observational data, and then be able to accurately hindcast the known data – and then, without any adjustment or tuning, could be run forward and accurately predict the ‘later’ observed data. Would this be a good test of a model? Similarly, dropping a ‘shortened time sequence’ of data ‘into’ a model should enable a robust model to work forward or backwards and still come up with reasonable results when compared to the longer observational values?

gnarf
December 2, 2010 3:04 pm

>> Finally, gnarf, I totally disagree with your claim that “Even an accurate model (if it exists) will fail your tests.” The UKMO model did a reasonable job, and that’s only looking at the nine models used by Santer. When one of the first nine models grabbed at random does all right, I’m not buying the ‘but it’s toooo hard’ excuse.
I would agree more with your tests if they used as input some climate relevant data, like 30 years moving average, detrended, on a significant period of time, but here detrended temp is more like weather, and it’s derivatives even more.
To make another comparison, I can make an iterative model which will predict quite reliably how the smoke climbs in a complex system of pipes and what is the average flow (climate) at each corner, but the model will give results with very narrow distribution, while measures have wide distribution…because the model does not try to model short term irregularities like vortices in the pipes(weather). I can add a random serie to simulate the chaotic part of the flow, but what is the point?
I totally agree with you that model results have to be compared with what they are supposed to model…but I am not sure comparing derivatives distribution is the right thing to do. Weather is chaotic and weather equations can’t be integrated there is no weather forecast possible after few days, climate is less chaotic.
Maybe comparing some self-correlation of the temperature series, fourier decomposition or some fractale measure. With fourier you can maybe spot the too high amplitude waves some model have, with self-correlation you may spot the feedbacks, when they come and how long they last….I am not a specialist at all of course.

gnarf
December 2, 2010 3:10 pm

When I wrote that an accurate model would fail your tests, I wanted to say that it will pass the first one if measure frequency is not too high, but when you compare first and second derivatives, results should be worse and worse.

JPeden
December 2, 2010 3:24 pm

My main interest is that there be some test, some way of separating the wheat from the chaff.
On the other hand, comparing Model output to actual data would certainly not be in the interests of any ‘quality’ ipcc Climate Science Propaganda Operation. Seriously, once an observer starts to see ipcc Climate Science as only a giant Propaganda Op., its “method” makes total sense – provided that we should also keep M. Stanton Evans’ “law of inadequate paranoia” very firmly in mind: ~”no matter how bad things look, it just gets worse.”

George E. Smith
December 2, 2010 4:14 pm

“”””” MikeO says:
December 2, 2010 at 12:14 pm
Can you Willis or someone else point me to a reference to the radiance of the sun. That is for a solar cell how many watts per square meter do you get at the earth’s surface as a maximum. Activists play with figures saying thing like you can get 600 watts or 1400 watts and that it is only ineffiencies of the collectors that is the problem. Currently I think it is 370 watts maximum and cells collect about 100 watts but cannot find an adequate reference. Is it a conspiracy could wikileaks help? “””””
Well MikeO, first let’s get the units correct. “Radiance” is the radient energy equivalent of “Luminance” which relates only to human eye response to light. The (very) loose colloquial term would be “Brightness”; which should be avoided like the plague in scientific writings.
But it is the wrong unit to use anyway since the units of Radiance are Watts per steradian per square metre; and it applies only to sources.
The unit I am sure you were meaning is “Irradiance” and that truly does refer to the energy falling on a target surface in Watts per square metre, and as you can see it has no angular factor. The visual equivalent would be lumens per Square metre.
So for the sun, the most often cited unit is often referred to as simply TSI, which is total Solar Irradiance; and it’s value, averaged over all the earth orbit locations is about 1366 W/m^2; based on the best satellite measurements over about three sunspot cycles. That is the value that solar cells would react to in earth orbit.
On the earth’s surface with the sun directly overhead (zenith), atmospheric absorption, reduces that number down to something pretty close to 1000 W/m^2, and that is what earth bound solar cells would be limited to. That is often referred to as the “air mass one” irradiance, since one atmospheric mass of air stands in the path. If the sun were 60 degrees from the zenith or 30 degrees above the horizon, the slant range air path is twice as long so we would call that air mass two, and the solar cell output will be even less.
In addition the ground level sunlight is suceptible to water vapor in the atmosphere which can absorb sunlight in the long visible to near infrared range from about 0.75 microns to about 4.0 microns. About 45% of the solar energy resides in that range, and water may be capable of absorbing about half of that range or about 20% of total sunlight with high tropical humidities.
That 1000W/m^2 number is a good one to hang onto; but remeber that is per unit of area perpendicular to the sun beam; so it assumes that solar arrays will be pointed with their normal towards the sun; and hopefully track that in some way.
So 1400 W/m^2 is not real; but as a matter of course, a lot of radiation engineers do use 1400 as anumber for just rough calculations. I don’t know why, becasue computers can deal with 1366 just as easily as 1400. When I went to school, the value used was 1353 W/m^2 but that was pre-sputnik days’ so based on balloon or rocket borne data.

December 2, 2010 4:33 pm

Huh? What am I missing here? I predict that the world will disappear in 2060. Please tell me how I can test that prediction in less than fifty years. Use all the math you want.
##### what I mean is simply this. A gcm makes billions of predictions. we may predict that temps will be 2C higher in 100 years, but we need not wait 100 years
to evaluate that prediction. Your prediction of the world ending in 2060, could be disconfirmed by the world disappearing next week. It couldnt of course be confirmed.
no math required.
“Please don’t play word games. Their claim is that the “fundamental physics” are correctly installed and represented in such a way as to enable them to do century-long forecasts. Any issues that prevent that happening are part of their claim.”
It’s not word games. The fundamental physics could be correct but incomplete. Nobody has said the physics is complete. Just that GCM are based on fundamental physics. Further they could be wrong but fundamental. What is meant is this. Its a physics simulation. So, with our best understanding of physics constrained to run on the best available hardware in a constrained amout of time we get answers that are
roughly consistent with observations. As a decision maker I would weigh all this. I would acknowledge the flaws, the uncertainty, and weigh the evidence accordingly. I’ve give it more weight than a blog post.
“Ah. Ok, that could work. Are you aware of specific predictions of sea level rise, weather extremes, or sea level rise by a given model that we could use to test the models?”
Models that get precipitation correct often get temperature wrong. Google taylor Diagrams. You might find sea levels in the outputs.. not sure
And how long will we have to wait to get the answers? Your three proposed phenomena are long-term, slow-changing, and noisy. So how soon would your method be able to give us answers?
That’s why I’ve proposed comparing model results to data and its derivatives. That at least we have a lot of data on, and can test now rather than in decades.
All the best,
w.

johanna
December 2, 2010 6:24 pm

I know a bit about economic forecasting models, which others have referred to. I grant that there are significant differences in terms of the subject matter – but where there are important similarities is in the inherent limitations of forecasting modelling of complex systems.
Paradoxically, testing against past events is not as useful as it intuitively seems. As others have pointed out, it is not difficult to construct a model that perfectly backcasts, say, interest rates for the last 30 years, but that proves nothing about its predictive value. It is just an artifact of statistics which happens to spit out the right results. While this is an over-simplification of how economic (and climate) models work and are constructed, I trust that the point is clear.
In fact, the better interest rate predictor model might not work very well in retrospect, because it includes variables that were not measured (or measurable) in the past and therefore cannot be used for backcasting. No doubt this is an issue for climate modelling as well. That is leaving aside the whole issue of data integrity, which is even more of an issue in climate modelling than it is in economics.
The other point I would like to make is that the reliability of every other kind of forecasting model I have ever come across degenerates rapidly as the time horizon gets longer. As others have pointed out, tiny errors magnify quickly over time. And there are always (at a minimum) tiny errors. I am amazed at the hubris of climate modellers with regard to this issue.

AusieDan
December 2, 2010 6:38 pm

Bill ILLIS
Your model seems to mimic the history very accurately.
However, I do not see any allowance for UHI which is a major factor in inhabited regions.
Have I missed something?
Or is there no appreciable UHI in the tropics?

David A. Evans
December 2, 2010 7:25 pm

Murray Grainger says:
December 2, 2010 at 8:58 am

TerryS says:
December 2, 2010 at 5:52 am
Re: Murray Grainger
Mike Haseler says:
December 2, 2010 at 5:54 am
Murray Grainger says: “Apples and Oranges folks.
Sorry chaps, I shall make my sarcasm more obvious in future.

Please don’t. I enjoy reading the responses because they didn’t get it LOL
DaveE.

Geoff Sherrington
December 2, 2010 8:08 pm

Gut reactions are often wrong, but my gut reaction is that the above assessment might be complicated by smoothing concepts. For example, if a sea temperature can change from normal to abnormal and back again within a week, it is hard to capture this event faithfully in monthly or longer-period data.
Similarly with confidence bars. The daily variation might be smaller than the weekly, smaller than the monthly, smaller than the annual, smaller than the decadal. You have to compare horses with courses. All of which says, if we have hourly data, then use it as the input of the model unless the demands on computer power become too large. If so, go to daily, etc.
Are we confident that the statistical data distributions at each time scale have been studied to confirm if either traditional or non-customary methods of confidence calculation are applicable? If, as you note, there is a difference between ocean water cooling and heating rates, then one might not validly apply Poisson statistics to rates of change or rates of rates of change.
Hopefully these factors have been considered and my misapprehension is misplaced. But then I wonder how they all missed that 1998 was a hot year globally.