Guest Post By Willis Eschenbach
Back in 2007, Jeffrey Kiehl first pointed out a curious puzzle about the climate models, viz: (emphasis mine)
 A review of the published literature on climate simulations of the 20th century indicates that a large number of fully coupled three dimensional climate models are able to simulate the global surface air temperature anomaly with a good degree of accuracy [Houghton et al., 2001]. For example all models simulate a global warming of 0.5 to 0.7C over this time period to within 25% accuracy. This is viewed as a reassuring confirmation that models to first order capture the behavior of the physical climate system and lends credence to applying the models to projecting future climates.
 One curious aspect of this result is that it is also well known [Houghton et al., 2001] that the same models that agree in simulating the anomaly in surface air temperature differ significantly in their predicted climate sensitivity. The cited range in climate sensitivity from a wide collection of models is usually 1.5 to 4.5C for a doubling of CO2, where most global climate models used for climate change studies vary by at least a factor of two in equilibrium sensitivity.
 The question is: if climate models differ by a factor of 2 to 3 in their climate sensitivity, how can they all simulate the global temperature record with a reasonable degree of accuracy? Kerr  and S. E. Schwartz et al. (Quantifying climate change–too rosy a picture?, available at www. nature.com/reports/climatechange, 2007) recently pointed out the importance of understanding the answer to this question.
Kiehl posed the question, what I thought at the time was a very interesting and important question, in a paper called Twentieth century climate model response and climate sensitivity and he got partway to the answer. He saw that as the forcings went up, the sensitivity went down, as shown in his Figure 1. He thought that the critical variable was the total amount of the forcing used by the model, and that the sensitivity was inversely and non-linearly proportional to the total amount of forcing.
Figure 1, with original caption, from Kiehl 2007.
However, my findings show that the models’ climate sensitivity can be directly derived from the model forcings and the models results. Sensitivity (transient or equilibrium) is directly proportional to the ratio of the trend of the temperature to the trend of the forcing. This makes intuitive sense, because the smaller the trend of the forcing, the greater the trend ratio. And the smaller the forcing the more you’ll have to amplify it to match the 20th century trend, so you need greater sensitivity. I have added two new models to my previous results, the MIROC model from Japan, and a most curious and informative dataset, the Crowley thousand-year hindcast (paywalled here , data here). The Crowley study used what they describe as a linear upwelling/diffusion energy balance model. As Figure 2 shows, just as with my previous findings, the climate sensitivity of the MIROC and the much simpler Crowley model is given the same simple function of the ratio of the trends.
Figure 2. Equilibrium climate sensitivity versus trend ratio (trend of results/trend of forcings). Equilibrium climate sensitivity is calculated in all cases as being 40% higher than transient climate response, per the average of the results of Otto, which cover the last four decades of observations.
This conclusion about the relationship of the forcing trend to the climate sensitivity is one outcome of the discovery that there is a one-line equation with which we can replicate the global average temperature results from any climate model. Strange but true, functionally it turns out that all that the climate models do to forecast the global average surface temperature is to lag and resize the forcing. That’s it. Their output is a simple lagged linear transformation of their input. This is true of individual climate models as well as the average of “ensembles” of models. Their output can be replicated, with a correlation of .99 or so, by a simple, one-line equation.
I have shown that this is true both for individual models, as well as for an average of a 19-model “ensemble” of models. Modelers call groups of models “ensembles”. I assume this term borrowed from music is used because each model is playing a different tune on a different instrument, but regardless of the etymology, the average of the ensemble results show exactly the same thing as the results from individual models. They all simply lag and scale the forcing, and call it temperature.
In my last report on this subject, I mentioned that I was about to shift the platform for my investigations from Excel to the computer language “R”. I’ve done that now, with some interesting results. Here’s a confirmation that my shift to R has been successful. This shows the results from the average of the 19 models used in the Forster analysis.
Figure 3. Average forcings and average modeled resulting temperatures of the 19 Forster models, along with my emulation of the modeled temperatures. The emulation (red line) is calculated using the one-line equation. a) Average of modeled temperatures from 19 global climate models (gray line with circles), along with the emulation given by the one-line equation (red line). b) same information as in 1a, with the addition of the forcing data. “Lambda” is the scaling factor. If the forcings are purely radiative, as in this case, lambda is the transient climate response. “Tau” is the time constant of the lagging process, which is also known as the “e-folding time”. “C” is the heat capacity of the upper layer of the ocean, showing the size of the thermal reservoir, which is calculated from the given tau and lambda. “TCR” is the temporary climate response to a doubling of CO2. Equilibrium climate sensitivity (ECS) is about 40% larger than the transient response.
Figure 1b shows the average inputs (blue line, 20th century “forcings” from CO2, volcanoes, the sun, aerosols, and the like) and outputs (gray line with circles, modeled temperatures for the 20th century) of 19 models used in the IPCC reports. You can see how the model outputs (global average temperatures) are merely a lagged and rescaled version of the inputs (“forcings”). Note that the correlation of the emulation (red line) and the actual model results is 0.99.
So … what are some implications of the finding that the hugely complex climate model global average temperature results are simply a lagged version of the inputs? Well, after thinking about that question for a while, I find that they are not exactly what I imagined them to be at first glance.
For me, the first implication of the finding that the models global temperature output is just lagged and resized forcings is that the models are all operating as designed. By that I mean, they have all faithfully and exactly reproduced the misconceptions of the programmers, without error. This is good news, as it means they are working the way the modelers wanted. It doesn’t mean that they are right—just that they are working as intended by the modelers. The claim of the modelers all along has been what I see as the fundamental misconception of climate science—the incorrect idea that the earth’s temperature is a linear function of the forcing, and everything else averages out. And that is exactly what the models do.
In some ways this unanimity is surprising, because there is a reasonable amount of variation in the complexity, the assumptions, and the internal workings of the individual models.
I ascribe this widespread similarity to two things. One is that the core physics is roughly correct. No surprise there. They’ve left out the most important part, the control mechanism composed of the emergent thermoregulatory phenomena like thunderstorms, so their models don’t work anything like the real climate, but the core physics is right. This makes the climate models similar in function.
The other thing driving the functional similarity is that the modelers all have one and only one way to test their models—by comparing them with the historical reality.
This, of course, means that they are all tuned to reproduce the historical temperature record. Now, people often bridle when the word “tuned” is used, so let me replace the word “tuned” with the word “tested”, and try to explain the difficulty that the modelers face, and to explain how testing turns into tuning.
Here’s the only way you can build a climate model. You put together your best effort, and you test it against the variable of interest, global temperature. The only data on that is the historical global average temperature record. If your model is abysmal and looks nothing at all like the real historical temperature record, you throw it away and start over again, until you have a model that produces some results that kinda look like the historical record.
Then you take that initial success, and you start adding in the details and improving the sub-systems. Step by step, you see if you can make it a “better” model, with better meaning that it is more lifelike, more realistic, more like the real world’s history. For example, you have to deal with the ocean-atmosphere exchange, it’s a bitch to get right. So you mess with that, and you test it again. Good news!. That removed the problems you’d been having replicating some part of the historical record. So you keep those changes that have resulted from the testing.
Or maybe the changes you made to the ocean-atmosphere interface didn’t do what you expected. Maybe when you look at the results they’re worse. So you throw out that section of code, or modify it, and you try again. People say the climate models haven’t been tested? They’ve been tested over and over by comparing them to the real vagaries of the historical temperature record, every one of them, and the models and the parts of models that didn’t work were gotten rid of, and the parts that did work were kept.
This is why I have avoided the word “tuning”, because that doesn’t really describe the process of developing a model. It is one of testing, not tuning. Be clear that I’m not saying that someone sat down and said “we’re gonna tune the ice threshold level down a little bit to better match the historical record”. That would be seen as cheating by most modelers. Instead, they do things like this, reported for the GISS climate model:
The model is tuned (using the threshold relative humidity U00 for the initiation of ice and water clouds) to be in global radiative balance (i.e., net radiation at TOA within 0.5 W m2 of zero) and a reasonable planetary albedo (between 29% and 31%) for the control run simulations. SOURCE
So the modelers are right when they say their model is not directly tuned to the historical record, because it’s not tuned, it’s tested. But nonetheless, the tuning to the historical record is still very real. It just wasn’t the “twist the knobs” kind of tuning—it was evolutionary in nature. Over the last decades, the modelers will tell you that they’ve gotten better and better at replicating the historical world. And they have, because of evolutionary tuning. All you have to do is what evolution does—constantly toss out the stuff that doesn’t pass the test, and replace it with stuff that does better on the test. What test? Why, replicating the historical record! That’s the only test we have.
And through that process of constant testing and refinement which is not called tuning but ends up with a tuned system, we arrive at a very curious result. Functionally, all of the various climate models end up doing nothing more than a simple lag and resizing of the forcing inputs to give the global temperature outputs. The internal details of the model don’t seem to matter. The various model parameters and settings don’t seem to matter. The way the model handles the ocean-atmosphere doesn’t seem to matter. They’ve all been smoothed out by the evolutionary process, and all that’s output by every model that I’ve tested is a simple lagging and resizing of the inputs.
The second implication is that for hindcasts or forecasts of global temperatures, climate models are useless. There is no way to judge whether GISS or CM2.1 or the average of the nineteen models is “correct”. All the models do is lag a given set of forcings, and get an answer—but a different set of forcings gives a very different answer, and we have no means to distinguish between them.
The third implication of the finding that the models just lag and resize the forcings is the most practical. This is that this highly simplified one-line version of the models should be very useful, but not for figuring out what the climate is doing. Instead, it should be useful for figuring out where in both time and space the climate is NOT acting the way the modelers think it operates. Following up on this one is definitely on my list.
The fourth implication is that once the forcings are chosen, the die is cast. If you are looking to hindcast the historical temperatures, your model output must have a trend similar to the historical temperatures. But once the forcings are chosen the trend of the forcing and the model are both known, and thus the climate sensitivity is fixed, it’s simply some constant times the temperature trend divided by the forcing trend.
The fifth implication is that this dependence of possible outcomes on the size and shape of the forcings raises the possibility that like the models themselves, the forcings have undergone a similar evolutionary tuning. There is no agreement on the size of several of the elements that make up the forcing datasets, or indeed which elements are included for a given model run. If a modeler adds a set of forcings and they make his model work worse when tested against the real world, he figures that the forcing figures are likely wrong, and so he chooses an alternate dataset, or perhaps uses an alternate calculation that does better with his model.
The sixth implication is that given a sufficiently detailed set of forcings and modeled temperatures, we can use this technique to probe more deeply into the internal workings of the models themselves. And finally, with that as context, that brings us to the Crowley thousand year model runs.
The Crowley dataset is very valuable because he has included the results of a simplified model which was run on just the volcanic forcings. These volcanic forcings are a series of very short interruptions in sunlight with nothing in between, so it’s an ideal situation to see exactly how the model responds in the longer term. Crowley reports the details of the model as follows:
A linear upwelling/diffusion energy balance model (EBM) was used to calculate the mean annual temperature response to estimated forcing changes. This model calculates the temperature of a vertically averaged mixed-layer ocean/atmosphere that is a function of forcing changes and radiative damping. The mixed layer is coupled to the deep ocean with an upwelling/diffusion equation in order to allow for heat storage in the ocean interior.
The radiative damping term can be adjusted to embrace the standard range of IPCC sensitivities for a doubling of CO2. The EBM is similar to that used in many IPCC assessments and has been validated against both the Wigley-Raper EBM (40) and two different coupled ocean-atmosphere general circulation model (GCM) simulations.
All forcings for the model runs were set to an equilibrium sensitivity of 2°C for a doubling of CO2. This is on the lower end of the IPCC range of 1.5° to 4.5°C for a doubling of CO2 and is slightly less than the IPCC “best guess” sensitivity of 2.5°C [the inclusion of solar variability in model calculations can decrease the best fit sensitivity (9)]. For both the solar and volcanism runs, the calculated temperature response is based on net radiative forcing after adjusting for the 30% albedo of the Earth-atmosphere system over visible wavelengths.
So that’s the model. However, bearing in mind the question of the evolutionary tuning of the forcings as well as of the models, as well as the total dependence of the output on the forcings chosen for the input, I first took a look at the forcings. Figure 4 shows those results:
Figure 4. Forcings used in the Crowley 1000 year model run. (As an aside, the volcanic forcings (black downwards lines) show a natural phenomenon called the “Noah Effect”. The hydrological event called Noah’s flood was allegedly very much larger than any other flood in history. Similarly, in a natural dataset we often find that the largest occurrence is much larger than the next largest occurrence. You can see the “Noah Effect” in the eruption of 1259 … but I digress.)
Bearing in mind that what will happen is that these forcings will simply be lagged and scaled, we can see what will make the differences in the final Crowley temperature hindcast.
First, I note that the volcanic forcings are larger than the volcanic forcings in any other model I’ve seen. The GISS model has the largest volcanic forcings of the models I’ve looked at. Here’s the comparison for the overlap period:
In addition to the overall difference in peak amplitude, you can see that the GISS data has many more small volcanic eruptions. Another oddity is that while some of the events happen in the same year in both dat sets, others don’t.
Next, the solar variations in Figure 4 are so small that they don’t really count for much. So we’re left with volcanic (black), aerosol (red), and GHG forcings (orange).
I have to say that the Crowley aerosol forcing in Figure 4 looks totally bogus. The post-1890 correlation of aerosol forcing (once it is no longer zero) with the GHG forcing is -0.97, and I’m not buying that at all. Why should aerosol forcing be the scaled inverse of the GHG forcing? The only function of the aerosol forcing seems to be to reduce the effect of the GHG forcing.
This is how you generate a model result that fits your needs, you just adjust the forcings. You pick a big size for the volcanic forcings, because that gives you the dip you need for the Little Ice Age. Then you adjust the GHG forcing by using an assumed aerosol forcing that is a smaller mirror reflection of the GHG forcing, and you have it made … and what is it that you do end up with?
This is the best part. You end up with a model that does a pretty good job of replicating the long-discredited Mann “hockey stick”, as proudly exhibited by Crowley …
Well, this dang post is getting too long. I still haven’t gotten to what I started to talk about, the question of probing the internal workings of the Crowley volcano model, so I’ll put that in the next post. It’s midnight. I’m tired. My job today was picking out and chiseling loose and carrying away pounds and pounds of wood-rat droppings out of an old pump-house in the rain. Ah, well, at least spending the day cleaning up some other creature’s sh*t puts food on the table … and casts a valuable and revealing light on my usual delusions of self-importance …
More to come, as always, as time and the tides of work and family allow.
Best to you all, I’d wish you a better day than mine, but that’s a low bar. Be kind to my typos and the like, it’s late.