Guest Post by Willis Eschenbach
Over at Judith Curry’s excellent blog she has a post on how to test the climate models. In response I wrote a bit about some model testing I did four years ago, and I thought I should expand it into a full post for WUWT. We are being asked to bet billions of dollars on computer model forecasts of future climate catastrophe. These global climate models, known as GCMs, forecast that the globe will warm extensively over the next century. In this context, it is prudent to take a look at how well the models have done at “hindcasting” historical temperatures, when presented with actual data from historical records.
I analysed the hindcasts of the models that were used in “Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere“, (PDF, 3.1Mb) by B. D. Santer et al. (including Gavin Schmidt), Science, 2005 [hereinafter Santer05].
In that study, results were presented for the first time showing two sets of observational data plus 9 separate GCM temperature “hindcasts” for the temperatures at the surface, troposphere, and stratosphere of the tropical region (20°N to 20°S) from 1979 to 2000. These models were given the actual 1979-2000 data for a variety of forcings (e.g., volcanic eruptions, ozone levels, see below for a complete list). When fed with all of these forcings for 1979-2000, the GCMs produced their calculated temperature hindcasts. I have used the same observational data and the same model results used by Santer. Here’s what their results look like:
Results from Santer05 Analysis. Red and orange (overlapping) are observational data (NOAA and HadCRUv2). Data digitized from Santer05. See below for data availability.
The first question that people generally ask about GCM results like this is “what temperature trend did the models predict?”. This, however, is the wrong initial question.
The proper question is “are the model results life-like?” By lifelike, I mean do the models generally act like the real world that they are supposedly modeling? Are their results similar to the observations? Do they move and turn in natural patterns? In other words, does it walk like a duck and quack like a duck?
To answer this question, we can look at how the models stand, how they move, and how they turn. By how the models stand, I mean the actual month by month temperatures that the GCMs hindcast. How the models move, on the other hand, means the monthly changes in those same hindcast temperatures. This is the month-to-month movement of the temperature.
And how the models turn means the monthly variation in how much the temperatures are changing, in other words how fast they can turn from warming to cooling, or cooling to warming.
In mathematical terms, these are the hindcast surface temperature (ST), the monthly change in temperature [written as ∆ST/month, where the “∆” is the Greek letter delta, meaning “change in” ], and the monthly change in ∆ST [ ∆(∆ST)/month ]. These are all calculated from the detrended temperatures, in order to remove the variations caused by the trend. In the same manner as presented in the Santer paper, these are all reduced anomalies (anomalies less average monthly anomalies) which have been low-pass filtered to average slight monthly variations
How The Models Stand
How the models stand means the actual temperatures they hindcast. The best way to see this is a “boxplot”. The interquartile “box” of the boxplot represents the central half of the the data (first to third quartiles). In other words, half the time the surface temperature is somewhere in the range delineated by the “box”. The “whiskers” at the top and bottom show the range of the rest of the data out to a maximum of 1.0 times the box height. “Outliers”, data points which are outside of the range of the whiskers, are shown as circles above or below the whiskers. Here are the observational data (orange and red for NOAA and HadCRUT2v surface temperatures), and the model results, for the hindcast temperatures. A list of the models and the abbreviations used is appended.
Figure 1. Santer Surface Temperature Observational Data and Model Hindcasts. Colored boxes show the range from the first (lower) quartile to the third (upper) quartile. NOAA and HadCRUT (red and orange) are observational data, the rest are model hindcasts. Notches show 95% confidence interval for the median. “Whiskers” (dotted lines going up and down from colored boxes) show the range of data out to the size of the Inter Quartile Range (IQR, shown by box height). Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). Gray rectangles at top and bottom of colored boxes show 95% confidence intervals for quartiles. Hatched horizontal strips show 95% confidence intervals for quartiles and median of HadCRUT observational data. See References for list of models and data used.
Fig. 1 shows what is called a “notched” boxplot. The heavy dark horizontal lines show the median of each dataset. The notches on each side of each median show a 95% confidence interval for that median. If the notches of two datasets do not overlap vertically, we can say with 95% confidence that the two medians are significantly different. The same is true of the gray rectangles at the top and bottom of each colored box. These are 95% confidence intervals on the quartiles. If these do not overlap, once again we have 95% confidence that the quartile is significantly different. The three confidence ranges of the HadCRUT data are shown as hatched bands behind the boxplots, so we can compare models to the 95% confidence level of the data.
Now before we get to the numbers and confidence levels, which of these model hindcasts look “lifelike” and which don’t? It’s like one of those tests we used to hate to take in high school, “which of the boxplots on the right belong to the group on the left?”
I’d say the UKMO model is really the only “lifelike” one. The real world observational data (NOAA and HadCRUT) has a peculiar and distinctive shape. The colored boxes showing the interquartile range of the data are short. There are numerous widely spread outliers at the top, and a few outliers bunched up close to the bottom. This shows that the tropical ocean often gets anomalously hot, but it rarely gets anomalously cold. UKMO reproduces all of these aspects of the observations pretty well. M_medres is a distant second, and none of the others are even close. CCSM3, GISS-EH, and PCM often plunge low, way lower than anything in the observations. CM2.1 is all over the place, with no outliers. CM2.0 is only slightly better, with an oversize range and no cold outliers. GISS-ER has a high median, and only a couple outliers on the cold side.
Let me digress for a moment here and talk about one of the underlying assumptions of the climate modellers. In a widely-quoted paper explaining why climate models work , the author states (emphasis mine):
On both empirical and theoretical grounds it is thought that skilful weather forecasts are possible perhaps up to about 14 days ahead. At first sight the prospect for climate prediction, which aims to predict the average weather over timescales of hundreds of years into the future, if not more does not look good!
However the key is that climate predictions only require the average and statistics of the weather states to be described correctly and not their particular sequencing. It turns out that the way the average weather can be constrained on regional-to-global scales is to use a climate model that has at its core a weather forecast model. This is because climate is constrained by factors such as the incoming solar radiation, the atmospheric composition and the reflective and other properties of the atmosphere and the underlying surface. Some of these factors are external whilst others are determined by the climate itself and also by human activities. But the overall radiative budget is a powerful constraint on the climate possibilities. So whilst a climate forecast model could fail to describe the detailed sequence of the weather in any place, its climate is constrained by these factors if they are accurately represented in the model.
Well, that all sounds good, and if it worked, it would be good. But the huge differences between the model hindcasts and actual observations clearly demonstrate that in all except perhaps one of these models the average and statistics are not described correctly …
But I digress … the first thing I note about Fig. 1 is that the actual tropical temperatures (NOAA and HadCRUT) stay within a very narrow range, as shown by the height of the coloured interquartile boxes (red and orange).
Remember that the boxplot means that half of the time, the actual tropical surface temperature stayed in the box, which for the observations shows a +/- 0.1° temperature range. Much of the time the tropical temperature is quite stable. The models, on the other hand, generally show a very different pattern. They reflect much more unstable systems, with the temperatures moving in a much wider range.
The second thing I note is that the model errors tend to be on the hot side rather than the cold side. The PCM, GISS-EH, and CCSM3 models, for example, all agree with the observations at the first (cooler) quartile . But they are too hot at the median and the third (upper) quartile. This is evidence that upwards temperature feedbacks are being overestimated in the model, so that when the models heat up, they heat too far, and they don’t cool down either as fast or as far as the real tropics does. Again, of the nine models, only the UKMO model reproduces the observed behaviour. All the rest show a pattern that is too hot.
Third, I note that all of the model interquartile boxes (except UKMO) are taller than the actual data, regardless of the range of each model’s hindcast. Even models with smaller ranges have taller boxes. This suggests again that the models have either too much positive feedback, or too little negative feedback. Negative feedback tends to keep data bunched up around the median (short box compared to range, like the observational data), positive feedback pushes it away from the median (tall box, e.g. PCM, with range similar to data, much taller box).
Mathematically, this can be expressed as an index of total data range/IQR (Inter Quartile Range). For the two actual temperature datasets, this index is about 5 and 5.3, meaning the data is spread over a range about five times the IQR. All the models have indices in the range of 2.7-3.6 except UKMO, which has an index of 4.7.
Some of the models are so different from the data that one wonders why these are considered “state of the art” models. The CM models, both 2.0 and 2.1, give hindcast results that go both way hotter and way colder than the observational data. And all of the models but two (GISS-ER and UKMO) give hindcast results that go colder than the data.
How the Models Stand – Summary
- UKMO is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
- M_medres did a distant second best. It only matched the lower quartile confidence interval, with both the median and upper quartile being too hot.
- The rest of the models were all well out of the running, showing distributions which are strikingly different from the observational data.
- Least lifelike I’d judge to be the two CM models, CM2.0 and 2.1, and the PCM model.
Since we’ve seen the boxplot, let’s take a look at the temperature data for two lifelike and two un-lifelike models, compared with the observational data. As mentioned above, there is no trend because the data is detrended so we can measure how it is distributed.
Figure 2. Surface Temperature Observational Data and Model Hindcasts. Two of the best on the left, two of the worst on the right. See References for list of models and data used.
Note how for long periods (1979-82, 1990-97) the actual tropical surface temperature hardly varied from zero. This is why the box in the boxplot is so short, with half the data within +/- 0.1°C of the average.
The most lifelike of the models (UKMO and M_medres), while not quite reproducing this behaviour, came close. Their hindcasts at least look possible. The CM2.1 and PCM models, on the other hand, are wildly unstable. They hindcast extreme temperatures, and spend hardly any of their time in the +/- 0.1° C range of the actual temperature.
How The Models Move
How the models move means the month-to-month changes in temperature. The tropical ocean has a huge thermal mass, and it doesn’t change temperature very fast. Here are the boxplots of the movement of the temperature from month to month:
Figure 3. Surface Temperature Month-to-Month Changes (∆ST/month), showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile (IQR, or inter-quartile range). Notches show 95% confidence interval for the median. Circles show “outliers”, points which are further from the quartile than the size of the IQR (length of the whiskers). . See References for list of models and data used.
Here in the month-to-month temperature changes, we see the same unusual pattern we saw in Fig. 1 of the temperatures. The observations have a short interquartile box compared to the range of the data, and a number of outliers. In this case there are about equal numbers of outliers above and and below the box. The inter quartile range (IQR, the box height) of tropical temperature change is about +/- 0.015°C per month, indicating that half of the time the temperature changes that little or less. The total range of the temperature change is about +/- 0.07. It is worth noting that in the 21 year record, the tropical surface never warmed or cooled faster than .07°C per month, so the models predicting faster warming or cooling than that must be viewed with great suspicion.
Although all but two of the models (CM2.0 and CM2.1) matched all three confidence intervals, there are still significant differences in the distribution of the hindcasts and the observations. The most lifelike is M_medres, with UKMO second. GISS-ER (purple) is curious, in that the month-to-month movements are all very small, never more than +/- 0.03 degrees per month. It never hindcasts anything like the larger monthly changes that we see in the actual data.
Next, consider the speed at which the ocean heats and cools. In the real world, as shown by the data, the heating and cooling rates are about the same. This makes sense, as we would expect the tropical ocean to radiate heat at something like the same rate it gains it. It has to lose the heat it gains at night by the next morning for the temperature to stay the same over several days.
Now look at the data distribution for GISS-EH, CM2.0 or CM2.1. They rarely heat up fast, but they cool down very fast (short whisker on top, long whisker plus outliers on bottom). Slow heating and fast cooling, that doesn’t make physical sense. The maximum heating rate for GISS-EH (0.03°C/mo.) is less than half the maximum heating rate of the actual tropics. PCM has the same problem, but in the other direction, heating up much faster than it cools down.
How the Models Move – Summary
- M_medres is the most lifelike. It matched all three quartile confidence intervals, as well as having a similar outlier distribution.
- UKMO did a credible second best. However, the ranges of UKMO and M_medres were too small
- The rest of the models were well out of the running, showing distributions which are strikingly different from the observational data.
- The least lifelike? I’d judge that to be the two CM models, CM2.0 and 2.1, and the CCSM3 model.
Let’s take a look at these winners and losers at reproducing the changes in temperature (∆ST). As mentioned above, the data is detrended so we can see how it is distributed.
Figure 4. Surface Temperature Observational Data and Model Hindcast Delta ST. Shows monthly changes in the temperature. Two of the best, two of the worst. See References for list of models and data used.
Once again, the boxplots correctly distinguish between lifelike and un-lifelike models. The large and extremely rapid temperature drops of the CM2.1 model are clearly unnatural. The GISS-ER model, on the other hand hardly moves from month to month and is unnatural in the other direction.
How the Models Turn
Acceleration is the rate of change of speed. In this context, speed is the rate at which the tropical temperatures warm and cool. Acceleration is how fast the warming or cooling rate changes. It measures how fast a rising temperature can turn to fall again, or how fast a falling temperature can turn into a rising temperature. Since acceleration is the rate of change (∆) of the change in temperature (∆ST), it is notated as ∆(∆ST). Here are the results.
Figure 5. Surface Temperature Month-to-Month Changes in ∆ST, showing Data and Model Hindcasts. Boxes show the range from the first quartile to the third quartile. Notches show 95% confidence interval for the median. Whiskers show the range of data out to the interquartile box height. Circles show outliers. See References for list of models and data used.
Using the 95% confidence interval of the median and the quartiles, we would reject CM2.1, CM2.0, CCSM3, GISS-ER, and UKMO. PCM and M_medres are the most lifelike of the models. UKMO and GISS-ER are the first results we have seen which have significantly smaller interquartile boxes than the observations.
CONCLUSIONS
The overall conclusion from looking at how the models stand, move, and turn is that the models give results that are quite different from the observational data. None of the models were within all three 95% confidence intervals (median and two quartiles) of all of the data (surface temperatures ST, change in surface temps ∆ST, and acceleration in surface temps ∆∆ST). UKMO and M_medres were within 95% confidence intervals for two of the three datasets.
A number of the models show results which are way too large, entirely outside the historical range of the observational data. Others show results that are much less than the range of observational data. Most show results which have a very different distribution from the observations.
These differences are extremely important. As the Thorpe quote above says, before we can trust a model to give us future results, it first needs to be able to give hindcasts that resemble the “average and statistics of the weather states”. None of these models are skillful at that. UKMO does the best job, and M_medres comes in third best, with nobody in second place. The rest of the models are radically different from the reality.
The claim of the modellers has been that, although their models are totally unable to predict the year-by-year temperature, they are able to predict the temperature trend over a number of years. And it is true that for their results to be believable, they don’t need to hindcast the actual temperatures ST, monthly temperature changes ∆ST, and monthly acceleration ∆∆ST.
However, they do need to hindcast believable temperatures, changes, and accelerations. Of these models, only UKMO, and to a much lesser extent M_medres, give results that by this very preliminary set of measures are at all lifelike. It is not believable that the tropics will cool as fast as hindcast by the CM2.1 model (Fig. 3). CM2.1 hindcasts the temperature cooling at three times the maximum observed rate. On the other hand, the GISS-ER model is not believable because it hindcasts the temperature changing at only half the range of changes shown by observation. Using these models in the IPCC projections is extremely bad scientific practice.
There is an ongoing project to collect satellite based spectrally resolved radiances as a common measure between models and data. Unfortunately, we will need a quarter century of records to even start analysing, so that doesn’t help us now.
What we need now is an agreed upon set of specifications that constitute the mathematical definition of “lifelike”. Certainly, at a first level, the model results should resemble the data and the derivatives of the data. As a minimum standard for the models, the hindcast temperature itself should be similar in quartiles, median, and distribution of outliers to the observational data. Before we look at more sophisticated measures such as the derivatives of the temperature, or the autocorrelation, or the Hurst exponent, or the amplification, before anything else the models need to match the “average and statistics” of the actual temperature data itself.
By the standards I have adopted here (overlap of the 95% confidence notches of the medians, overlap of the 95% confidence boxes of the quartiles, similar outlier distribution), only the UKMO model passed two of the three tests. Now you can say the test is too strict, that we should go for the 90% confidence intervals and include more models. But as we all know, before all the numbers and the percentages when we first looked at Figure 1, the only model that looked lifelike was the UKMO model. That suggests to me that the 95% standard might be a good one.
But I’m not insisting that this be the test. My main interest is that there be some test, some way of separating the wheat from the chaff. Some, indeed most of these models are clearly not ready for prime time — their output looks nothing like the real world. To make billion dollar decisions on an untested, unranked suite of un-lifelike models seems to me to be the height of foolishness.
OK, so you don’t like my tests. Then go ahead and propose your own, but we need some suite of tests to make sure that whether or not the output of the models is inaccurate, it is at least lifelike … because remember, being lifelike is a necessary but not sufficient condition for the accurate forecasting of temperature trends.
My best to everyone,
w.
DATA
The data used in this analysis is available here as an Excel workbook.
REFERENCES
B. D. Santer et. al., 2005, September 2, “Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere”, Science Magazine
Thorpe, Alan J., 2005, “Climate Change Prediction — A challenging scientific problem”, Institute for Physics, 76 Portland Place London W1B 1NT
MODELS USED IN THE STUDY
National Center for Atmospheric Research in Boulder (CCSM3, PCM)
Institute for Atmospheric Physics in China (FGOALS-g1.0)
Geophysical Fluid Dynamics Laboratory in Princeton (GFDL-CM2.0, GFDL-CM2.1)
Goddard Institute for Space Studies in New York (GISS-AOM, GISS-EH, GISS-ER)
Center for Climate System Research, National Institute for Environmental Studies, and Frontier Research Center for Global Change in Japan (MIROC-CGCM2.3.2(medres), MIROCCGCM2.3.2(hires))
Meteorological Research Institute in Japan (MRICGCM2.3.2).
Canadian Centre for Climate Modelling and Analysis (CCCma-CGCM3.1(T47))
Meteo-France/Centre National de Recherches Meteorologiques (CNRM-CM3)
Institute for Numerical Mathematics in Russia (INM-CM3.0)
Institute Pierre Simon Laplace in France (IPSL-CM4)
Hadley Centre for Climate Prediction and Research in the U.K. (UKMO-HadCM3 and UKMO-HadGEM1).
FORCINGS USED BY THE MODELS
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.






Good work, Willis, hats off to you! This post deserves to be placed in Rick Werme’s “WUWT Classics”.
At this stage in their development it is premature to call these programs “Models”; they do not meet the basic requirements to be so classified or cited.
PS: I doubt that anyone would object to their being called billion dollar “Gigos”. Perhaps we’ll develope a first, true “Model” by the mid-Century mark in 40 years. We have been progressing rather well of late. (Or were, before The Great Recession.)
One single number representing the temperature of a whole planet?
Meaningless.
Predicting a meaningless number 100 years from now?
With feedback-loops and couplings noone have the full picture of?
Convection? Jet streams? Sea-currents? Clouds?
Even more meaningless.
The confidence levels referenced, whether 50%, 75%, 90%, or 95%, are ludicrously low for such crucial predictions. Given the numerous known and un-counted unknown sources of bias and contamination, a far more robust standard is required. That models of this type can almost never meet such standards is not an excuse: it’s a reflection of their nature and utility.
Nature Admits Scientists are Computer Illiterate (Nature)
The fact that every single climate model is different is irrefutable proof that the science is not settled.
Standards? A litmus test of sorts? Willis, you’re suggesting climate scientists adhere to standards, when you and I both know they have none. There is no ice extent too low to predict, no temperature too high to forecast, no alarm too loud to sound. They could no more adopt self-restraining standards than a zebra could change his stripes or a leopard his spots. A good thought though.
Backcasting tells you nothing about the utility of any model, except if the model can’t backcast your model is demonstrably wrong. This is the garden path financial modelers walk down every day: if you add enough corrections, adjustments, smoothing, etc., you can replicate *any* complex historical wave form. If a model was completely useless at backcasting, it would never be published. The thing is, data mining and heuristics can back cast as well as deterministic physics.
When you think of backcasting, you have to ask ‘how accurate is the historical data and how many accurate data points do we have for that data at that time?’ Financial and economic models have an enormous amount of long terms, exactly accurate, data, and they have never been shown to have any predictive value, despite billions of dollars spent on the task every year.
If there were a model which showed, for example, an anomaly in the past that wasn’t known, then further research showed that, indeed, that anomaly actually occurred, that would be interesting, but not proof. If I recall correctly, when historic climate data is refined, the models are ‘tweaked’ and rerun to backcast correctly – if nothing else, this shows they are heuristic, not deterministic.
Regardless, the only way of testing a model is whether the model is capable of making accurate (within uncertainty bounds) predictions of the future. Most natural systems, and clearly the climate, is a complex, non-linear, chaotic system. (By the way, chaotic does not actually mean random). The climate has a huge number of unknowns, even if you take as gospel all the ‘true facts’ of the AGW hypothesis. Even the knowns have limited precision.
The nice thing about a climate model is that it is a long, long term prediction. Which means any discrepancy between the model and reality can be explained away as confounding weather with climate. Nice – remember the Jehovah’s Witnesses predicted the end of the world in 1976. Didn’t happen, but it didn’t hurt business. If you are going to make predictions, try make sure they don’t happen within a human lifetime. This would not matter, if this was simply some interesting scientific theory. Climate models drive policy despite the fact there is no reason whatsoever to assume they have any predictive ability whatsoever. I’ve been told that ‘it’s the best information we’ve got’. This misses the point: bad information is worse than no information.
By the way – I do not understand the point of running statistics (average, mean, etc.) on a group of climate models. I think I understand the analysis presented about, but the first graph shows a ‘median’ prediction. I don’t understand the mathematical relevance of these figures any more than averaging the number of times a chicken clucks with random noise. Its a bit like averaging the guesses (sorry – estimates) of economists regarding unemployment statistics.
That being said, if there was science behind climate models, then I would expect they would all predict (and backcast) the same thing. ‘Models’ of gravity do not arrive at divergent conclusions.
Thanks Willis for your excellent interpretation of a difficult topic. From your analysis it seems that the current generation of computer climate models don’t even get to first base by passing your ‘reasonableness’ tests for the bounds of even a possible Earth climate.
Another interesting question is ‘Does a Global Temperature Exist?’, and this quote from an article published article in Journal of Non-Equilibrium Thermodynamics by Christopher Essex, Ross McKitrick, and Bjarne Andresen, indicates that the topic is still open to debate:-
“There is no global temperature. The reasons lie in the properties of the equation of state governing local thermodynamic equilibrium, and the implications cannot be avoided by substituting statistics for physics.
Since temperature is an intensive variable, the total temperature is meaningless in terms of the system being measured, and hence any one simple average has no necessary meaning. Neither does temperature have a constant proportional relationship with energy or other extensive thermodynamic properties…”
Full paper here:-
http://www.uoguelph.ca/~rmckitri/research/globaltemp/GlobTemp.JNET.pdf
Perhaps no surprise the results of all the global climate models are so poor!
Willis is too diplomatic. When you take a model, algorithm or whatever, and tune the parameters until you get the best possible hindcast, what you have done is known as curve fitting. A complex enough system can be made to hindcast the stock market, but will have very little predictive power. The fact that the best that could be achieved at ‘predicting the past’ is an uninspiring ‘ok’ from one single model, is all I need to know about their (lack of) predictive skill.
Surely the accuracy of these models is even worse than has been suggested above. The baseline they are being measured against has been demonstrated to be radically flawed. It has been ‘fixed’. The numbers for the early 20th century and before have been statistically adjusted downwards, with the recent past statistically adjusted upwards. This is being done to create an enhanced sense of ‘warming’.
Surely this will affect the accuracy of this model’s comparison even more.
Willis,
Has this been used to evaluate models:
Model outputs should be evaluated on how parallel they run to measurements.
Please show the model outputs and actual measurements integrated (delta T)
This will help comparison of when the model is trending opposite the actual temperature.
What might appear to be the better model with one evaluation, might not be the best with another.
This looks like a nice, careful, piece of work but it has, I think, some conceptual and practical problems.
On the conceptual side we need to be concerned that these models are highly incestuous with respect to the data. Specifically, they’re continually adjusted to better fit the data – so checking their reliability through hindcasting merely tells you directly how good the various maintainers are at tinkering and indirectly which data set, with which adjustments, were preferred by the people involved for each period during which they worked on it.
On the practical side the reality is that we don’t have reliable retrospective data – so a model which hindcast some average with near perfect precision over a period of some years is obviously inaccurate because we know the data is wrong. Since we don’t have good data, we don’t know how far off the data we do have is, but in this context that doesn’t really matter: whether x is 0.0001 degrees f per acre/year or plus or minus 2 degrees C per continent/hour the point is that the better the model does at predicting bad data, the worse we should think it to be.
“”””” “Amplification of Surface Temperature Trends and Variability in the Tropical Atmosphere“, “””””
So I haven’t read the paper yet; or any of the posts, but I was wondering just what the blazes this title means.
It would appear the paper deals with two different subjects:- Tropical Atmosphere Variability; and Surface Temperature Trend Amplification.
Do all Universities have departments that teach how to write gobbledegook titles for the publish or perish papers; because you only have to read any weekly issue of SCIENCE to see that that skill is very wide spread.
So I wonder just what aspects of the Tropical Atmosphere are varying enough to bother studying. Is the N2/O2 proportion changing enough to comment on or maybe the isotopic ratios of those constituents of the tropical atmosphere? It would seem to me that global mixing works well enough that monitoring how the atmosphere in just one region varies from time to time, would not be of much use.
And then there’s that Surface Temperature Trend and its amplification. Well do they really mean “changes” in the surface temperature trend; it seems odd to talk of amplification of a trend; well unless it is simply that the trend line slope changes.
Well it’s all very curious; so I guess I’ll take a chance and read the paper, and see what comments other readers had to say.
TerryS says:
December 2, 2010 at 5:52 am
Re: Murray Grainger
Mike Haseler says:
December 2, 2010 at 5:54 am
Murray Grainger says: “Apples and Oranges folks.
Sorry chaps, I shall make my sarcasm more obvious in future.
For this to be true, the errors over time would have to average out to zero. In other, the cumulative effects of the errors in the model would have to cancel themselves out.
It is far more likely that the cumulative effects of the model errors over time are to amplify the errors. The errors are compounded, not averaged out.
A model makes its run for a year. Then it takes that output and uses it as input for the next run. Each time, the errors get larger.
very nice work, Willis …
may I suggest you and some of the more statistical guys here go over it for errors/changes/clarifications, and you submit it to some AGW minded journal … ?
that might help the discussion, not with the fanatics, but with the reasonable …
A serious issue with hindcasting is the quality and methodology of the historical surface data.
The surface station and satellite data needs to be cross-calibrated on a station-by-station basis to widen the period of high-quality data. Doing a correlation study on the end product (GMST) is nowhere near the same thing as figuring out exactly the mathematical relationship between the HCN Station #123456 and the corresponding satellite gridcell.
It is nice when one can say the surface stations and the satellite data agree in general on the averaged data. It is far better when one can say: “Station #123456 reads 2.4±0.03C higher than the satellite data over 1978-now.” This examination would also allow the methodical evaluation of UHI effects and the other adjustments that occur during the satellite period.
“”””” Tenuc says:
December 2, 2010 at 8:25 am
…………………………
Another interesting question is ‘Does a Global Temperature Exist?’, and this quote from an article published article in Journal of Non-Equilibrium Thermodynamics by Christopher Essex, Ross McKitrick, and Bjarne Andresen, indicates that the topic is still open to debate:-
“There is no global temperature. The reasons lie in the properties of the equation of state governing local thermodynamic equilibrium, and the implications cannot be avoided by substituting statistics for physics. “””””
Well this is an often posed question; usually drawing the answer above most often; in fact no less a luminary that Prof Richard Lindzen recently stated that in his five minutes of commentary to that pitiful lame duck Congressional committee hearing; that insulted all of us.
Now I agree with the sentiment; but not pedantically with the answer.
To me it is quite obvious. Mother Gaia, has a thermometer in each and every single atom or molecule on the planet; or shall we say just the near surface area. So she knows what each thermometer says, and she can read every one of them instantaneously; every atto second if she wants to. Then it is trivial for her to add them all up, and divide by the number of thermometers, and, voilla ! she has the average temperature of that limited near surface region at that time.
So it exists; but of course, the real issue, is can WE measure it, since MG is never going to tell us the answer.
And of course we can’t measure it; and it’s a waste of time anyway; becasue as you say Tenuc it has no connection to energies or anything else we would want to know.
By the way; for the legal disclaimer:- I sure hope Y’alls don’t figgah me as one of those Gaia kooks. I just subjected myself last night to watching my son’s Blue-Ray copy of AVATAR; he’s a video/movie student, so he studies such films from their artistic, and technique points of view. It was certainly a marvel of science fiction out of the box writing, and special effects wonderment; and I was impressed with all of that.
Apart from that it is the most blatant piece of political propaganda crap I have ever seen; and of course the whole GAIA concept of an integrated network of interdependent cells of a single organism is central to the totally nude message. Unobtainium is of course the evil oil under the desert sands of Arabia; and the Arabs are the set upon innocents trying to eke out a peaceful living by just eating sand, and minding their own business; which is hard to do given the constant assaults by the evil Americans.
So Cameron is evidently a spokesperosn for the one world movement; and I’m surprised that he is able to get away with such blatant plagiarism; without giving credit to the Gaia worshipers.
No; my Mother Gaia, is just a Super Maxwell’s Demon; that is able to observe and note that which the laws of physics don’t really allow US to observe and note; and of course she can never tell us her findings; but we can take comfort in the knowledge that the actual state of the planet, or say its climate, is ALWAYS exactly that, which we could (in principle) compute if we were so fortunate (or maybe unfortunate), to have all that information that Mother Gaia has but we can never know.
So don’t bother with cleaning out a room at the funny farm for me; I am quite sane; I just use a different toolbox from some others.
But back to the subject; the global mean temperature if we could measure it, carries no more scientific knowledge or significance, than does the mean telephone number contained in the latest edition of the Manhattan Telephone directory; so Lindzen and other s are correct in saying there’s no such thing; and it would add nothing to our knowledge of energy flows in our climate system, if we DID know such a number. And if you don’t like the number; what would you change it to, if we had that power; which fortunately we do not.
gnarf says:
December 2, 2010 at 5:15 am
Even an accurate model (if it exists) will fail your tests. These models are supposed to forecast temperature trend-> they should pass your first test.
But the only way to pass your 2nd and 3rd test is to have a statistically accurate temperature derivative and second derivative… very complex (increased chaos every time you derivate), not sure it can be done at all…and it should have very limited influence on the trend so I understand perfectly that most models don’t even try to do such thing and focus on forcasting the trend.
###################
Agreed. However as Willis said if we want to set our own list of tests we should.
1. There should be a suite of tests.
2. These tests should be related to the USE of the models.
3. Criteria need to be established prior to the testing.
The concept is simple. When we look at the damage mechanisms of climate change
we can for example call out these three. Increased temperature, drought and floods, increased sea level. Simply, temperature trend, precipitation extremes, and sea level rise.
It would of course be nice to get the variables correct that Willis chose. But one can get the trend correct and miss on all the variables that Willis selected. especially in a 21 year period, especially with some actual climate cycles being longer than the observation period.
it’s silly to continue to use models that perform badly. it also weird that the models dont use the same inputs.
Poptech says:
December 2, 2010 at 8:14 am
As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists. […]
Thanks Poptech – this observation has been my view for some time. Code documentation is, by and large, a foreign concept to many of these GCM development groups. Some do make an effort though (like NCAR)…
The whole idea of “reproducability” brings up a good question in light of Willis’ present article. Each of the codes cited above purports to solve the basic thermo-fluid dynamics equations and related submodels required to “accurately” simulate “climate”. Each starts with the SAME initial conditions, uses the SAME boundary conditions, and the SAME forcings. Yet, the solutions plotted in the first graph show wildly different time histories (different amplitudes and phase). Why is that? Well, more than likely each code has different assumptions for how they assimilate data for initial conditions, different submodels for everything from turbulence to sea ice, different ways of discretizing and solving the basic equations (e.g. finite difference versus spectral methods), different ways of handling the boundary conditions and forcings, etc. Moreover, the codes are so logically complex (and likely full of bugs unique to a given code) that it is very unlikely that an independent model development group could take the methods described by another group and “reproduce” their results precisely.
Great Post.
The underlying issue is that the whole concept of radiative forcing is invalid. The so called ‘radiative forcing constant’ for CO2, 2/3 C increase in ‘surface temperature’ per each increase of 1 W in the downward LWIR flux is just nonsense. It is simply the effect of ocean surface temperatures, urban heat islands and a lot of temperature data ‘fixing’ on the meteorlogical surface temperature record.
Until these models are upgraded to include realistic solar ocean heating effects and raditive forcing is removed, they will be incapable of predicting any kind of climate change.
Garbage in, Gospel out.
@Mingy
> Backcasting tells you nothing about the utility of any model, except if the
> model can’t backcast your model is demonstrably wrong. This is the
> garden path financial modelers walk down every day: if you add
> enough corrections, adjustments, smoothing, etc., you can
> replicate *any* complex historical wave form.
You’re throwing out the baby with the bath. Backcasting is very useful for validating a model, because you don’t have to “wait for the future to happen” to do the validation.
It’s still “blind testing” in the sense that no modelers can possibly train their models on every piece of existing data, In any case, most of us set aside some “blind data” for testing anyway. And those corrections you mention apply to the future too, which we’re also “blind” to.
They’re using a small fraction of the data to predict the rest. That’s utility in my book.
Steve Mosher:
Well said. Willis has identified and explained some very reasonable and “necessary” tests for evaluating models – tests which most GCMs seem to fail. By their very nature these types of numerical tests are not “sufficient” to evaluate a model because they say little about the actual physical processes embedded in the model. In other words, Willis makes no statement as to the scientific reasonableness of the UKMET model and only states that its hindcast compares favorably with actual observations. The UKMET model may, for example, boil down to an extrapolation of past temperatures plus some highly constrained noise. Without a rigorous assessment of the model any analysis is incomplete. However, models that fail to meet Willis’ tests are inherently suspect.
It seems to me to be perfectly reasonable to require that any proposed model state the results of Willis’ type of tests in a clear way so that those using its outputs should be aware of its limitations.
This is just an example of an element of the validation suite that should be applied to each model before it is allowed to be used.
As a modellor (hydrological / hydrualic) my work has to be calibrated against observed data (numerous locations) and must be within stated tolerances, the model must then be tested for discreet events not used for calibration (to check it hasnt been force fitted), run with historical events and compared with known observations / photos etc… and I must then run a long period of Time Series Rainfall (TSR) to check that the model can account for seasonal changes such as Soil Moisture Deficit (SMD) or Evapotranspiration changes and correctly predict the right response for each rainfall event during this period.
This still doesnt ensure a completely “robust” model, but it gives us reasonable confidence that its fit for purpose. The model is then used to calculate flood extents for a design rainfall event, or develop solutions, sometimes costing millions of $.
As I do this commercial and often millions of $ can ride of the model results, we have to carefully test and demonstrate the model is of use.
The main things we test when “hind casting” is peak flow rates match, peak depths match, volumes match, the timing of the peaks is correct and they must have a good VISUAL fit. I would expect the same criteria to be applied to GCM’s giving whats riding on them, unfortunetly govt funded scientific modellors are often only interested in knocking out a paper rather than gathering months of real life data and spending months calibrating and testing a model, they dont have the same commercial pressures and I think thats the issue with GCM’s.
I would expect them to select at least 1000 temperature stations (all well sited with long records) for calibration, and I would expect for most of them to match the timing of ups and downs, the peaks, the dips, the overall increase or decrease in temp / energy and provide a good visual fit. The same should also be done for SST based on a number of grid boxes, and for rainfall and incoming radiation. If a model can not replicate the current or know temperature history on a regional scale, they are of little use to anyone and I will remain unconvinced.
Its like me telling my client that the peak of the flood, the depth and the duration dont match observed at all, but its ok as the overall volumes match ok, so its “robust”, despite that fact it would lead to poorly design defences and the deaths of inocent people!!!
I would be happy to write an article for WUWT on hydrological / hydraulic model calibration if of interest to anyone – for comparison of the GCM approach with an alternative well established modelling field.