The “ensemble” of models is completely meaningless, statistically

This  comment from rgbatduke, who is Robert G. Brown at the Duke University Physics Department on the No significant warming for 17 years 4 months thread. It has gained quite a bit of attention because it speaks clearly to truth. So that all readers can benefit, I’m elevating it to a full post

rgbatduke says:

June 13, 2013 at 7:20 am

Saying that we need to wait for a certain interval in order to conclude that “the models are wrong” is dangerous and incorrect for two reasons. First — and this is a point that is stunningly ignored — there are a lot of different models out there, all supposedly built on top of physics, and yet no two of them give anywhere near the same results!

This is reflected in the graphs Monckton publishes above, where the AR5 trend line is the average over all of these models and in spite of the number of contributors the variance of the models is huge. It is also clearly evident if one publishes a “spaghetti graph” of the individual model projections (as Roy Spencer recently did in another thread) — it looks like the frayed end of a rope, not like a coherent spread around some physics supported result.

Note the implicit swindle in this graph — by forming a mean and standard deviation over model projections and then using the mean as a “most likely” projection and the variance as representative of the range of the error, one is treating the differences between the models as if they are uncorrelated random variates causing >deviation around a true mean!.

Say what?

This is such a horrendous abuse of statistics that it is difficult to know how to begin to address it. One simply wishes to bitch-slap whoever it was that assembled the graph and ensure that they never work or publish in the field of science or statistics ever again. One cannot generate an ensemble of independent and identically distributed models that have different code. One might, possibly, generate a single model that generates an ensemble of predictions by using uniform deviates (random numbers) to seed

“noise” (representing uncertainty) in the inputs.

What I’m trying to say is that the variance and mean of the “ensemble” of models is completely meaningless, statistically because the inputs do not possess the most basic properties required for a meaningful interpretation. They are not independent, their differences are not based on a random distribution of errors, there is no reason whatsoever to believe that the errors or differences are unbiased (given that the only way humans can generate unbiased anything is through the use of e.g. dice or other objectively random instruments).

So why buy into this nonsense by doing linear fits to a function — global temperature — that has never in its entire history been linear, although of course it has always been approximately smooth so one can always do a Taylor series expansion in some sufficiently small interval and get a linear term that — by the nature of Taylor series fits to nonlinear functions — is guaranteed to fail if extrapolated as higher order nonlinear terms kick in and ultimately dominate? Why even pay lip service to the notion that R^2 or p for a linear fit, or for a Kolmogorov-Smirnov comparison of the real temperature record and the extrapolated model prediction, has some meaning? It has none.

Let me repeat this. It has no meaning! It is indefensible within the theory and practice of statistical analysis. You might as well use a ouija board as the basis of claims about the future climate history as the ensemble average of different computational physical models that do not differ by truly random variations and are subject to all sorts of omitted variable, selected variable, implementation, and initialization bias. The board might give you the right answer, might not, but good luck justifying the answer it gives on some sort of rational basis.

Let’s invert this process and actually apply statistical analysis to the distribution of model results Re: the claim that they all correctly implement well-known physics. For example, if I attempt to do an a priori computation of the quantum structure of, say, a carbon atom, I might begin by solving a single electron model, treating the electron-electron interaction using the probability distribution from the single electron model to generate a spherically symmetric “density” of electrons around the nucleus, and then performing a self-consistent field theory iteration (resolving the single electron model for the new potential) until it converges. (This is known as the Hartree approximation.)

Somebody else could say “Wait, this ignore the Pauli exclusion principle” and the requirement that the electron wavefunction be fully antisymmetric. One could then make the (still single electron) model more complicated and construct a Slater determinant to use as a fully antisymmetric representation of the electron wavefunctions, generate the density, perform the self-consistent field computation to convergence. (This is Hartree-Fock.)

A third party could then note that this still underestimates what is called the “correlation energy” of the system, because treating the electron cloud as a continuous distribution through when electrons move ignores the fact thatindividual electrons strongly repel and hence do not like to get near one another. Both of the former approaches underestimate the size of the electron hole, and hence they make the atom “too small” and “too tightly bound”. A variety of schema are proposed to overcome this problem — using a semi-empirical local density functional being probably the most successful.

A fourth party might then observe that the Universe is really relativistic, and that by ignoring relativity theory and doing a classical computation we introduce an error into all of the above (although it might be included in the semi-empirical LDF approach heuristically).

In the end, one might well have an “ensemble” of models, all of which are based on physics. In fact, the differences are also based on physics — the physicsomitted from one try to another, or the means used to approximate and try to include physics we cannot include in a first-principles computation (note how I sneaked a semi-empirical note in with the LDF, although one can derive some density functionals from first principles (e.g. Thomas-Fermi approximation), they usually don’t do particularly well because they aren’t valid across the full range of densities observed in actual atoms). Note well, doing the precise computation is not an option. We cannot solve the many body atomic state problem in quantum theory exactly any more than we can solve the many body problem exactly in classical theory or the set of open, nonlinear, coupled, damped, driven chaotic Navier-Stokes equations in a non-inertial reference frame that represent the climate system.

Note well that solving for the exact, fully correlated nonlinear many electron wavefunction of the humble carbon atom — or the far more complex Uranium atom — is trivially simple (in computational terms) compared to the climate problem. We can’t compute either one, but we can come a damn sight closer to consistently approximating the solution to the former compared to the latter.

So, should we take the mean of the ensemble of “physics based” models for the quantum electronic structure of atomic carbon and treat it as the best predictionof carbon’s quantum structure? Only if we are very stupid or insane or want to sell something. If you read what I said carefully (and you may not have — eyes tend to glaze over when one reviews a year or so of graduate quantum theory applied to electronics in a few paragraphs, even though I left out perturbation theory, Feynman diagrams, and ever so much more:-) you will note that I cheated — I run in a semi-empirical method.

Which of these is going to be the winner? LDF, of course. Why? Because theparameters are adjusted to give the best fit to the actual empirical spectrum of Carbon. All of the others are going to underestimate the correlation hole, and their errors will be systematically deviant from the correct spectrum. Their mean will be systematically deviant, and by weighting Hartree (the dumbest reasonable “physics based approach”) the same as LDF in the “ensemble” average, you guarantee that the error in this “mean” will be significant.

Suppose one did not know (as, at one time, we did not know) which of the models gave the best result. Suppose that nobody had actually measured the spectrum of Carbon, so its empirical quantum structure was unknown. Would the ensemble mean be reasonable then? Of course not. I presented the models in the wayphysics itself predicts improvement — adding back details that ought to be important that are omitted in Hartree. One cannot be certain that adding back these details will actually improve things, by the way, because it is always possible that the corrections are not monotonic (and eventually, at higher orders in perturbation theory, they most certainly are not!) Still, nobody would pretend that the average of a theory with an improved theory is “likely” to be better than the improved theory itself, because that would make no sense. Nor would anyone claim that diagrammatic perturbation theory results (for which there is a clear a priori derived justification) are necessarily going to beat semi-heuristic methods like LDF because in fact they often do not.

What one would do in the real world is measure the spectrum of Carbon, compare it to the predictions of the models, and then hand out the ribbons to the winners! Not the other way around. And since none of the winners is going to be exact — indeed, for decades and decades of work, none of the winners was even particularly close to observed/measured spectra in spite of using supercomputers (admittedly, supercomputers that were slower than your cell phone is today) to do the computations — one would then return to the drawing board and code entry console to try to do better.

Can we apply this sort of thoughtful reasoning the spaghetti snarl of GCMs and their highly divergent results? You bet we can! First of all, we could stop pretending that “ensemble” mean and variance have any meaning whatsoever bynot computing them. Why compute a number that has no meaning? Second, we could take the actual climate record from some “epoch starting point” — one that does not matter in the long run, and we’ll have to continue the comparison for the long run because in any short run from any starting point noise of a variety of sorts will obscure systematic errors — and we can just compare reality to the models. We can then sort out the models by putting (say) all but the top five or so into a “failed” bin and stop including them in any sort of analysis or policy decisioning whatsoever unless or until they start to actually agree with reality.

Then real scientists might contemplate sitting down with those five winners and meditate upon what makes them winners — what makes them come out the closest to reality — and see if they could figure out ways of making them work even better. For example, if they are egregiously high and diverging from the empirical data, one might consider adding previously omitted physics, semi-empirical or heuristic corrections, or adjusting input parameters to improve the fit.

Then comes the hard part. Waiting. The climate is not as simple as a Carbon atom. The latter’s spectrum never changes, it is a fixed target. The former is never the same. Either one’s dynamical model is never the same and mirrors the variation of reality or one has to conclude that the problem is unsolved and the implementation of the physics is wrong, however “well-known” that physics is. So one has to wait and see if one’s model, adjusted and improved to better fit the past up to the present, actually has any predictive value.

Worst of all, one cannot easily use statistics to determine when or if one’s predictions are failing, because damn, climate is nonlinear, non-Markovian, chaotic, and is apparently influenced in nontrivial ways by a world-sized bucket of competing, occasionally cancelling, poorly understood factors. Soot. Aerosols. GHGs. Clouds. Ice. Decadal oscillations. Defects spun off from the chaotic process that cause global, persistent changes in atmospheric circulation on a local basis (e.g. blocking highs that sit out on the Atlantic for half a year) that have a huge impact on annual or monthly temperatures and rainfall and so on. Orbital factors. Solar factors. Changes in the composition of the troposphere, the stratosphere, the thermosphere. Volcanoes. Land use changes. Algae blooms.

And somewhere, that damn butterfly. Somebody needs to squash the damn thing, because trying to ensemble average a small sample from a chaotic system is so stupid that I cannot begin to describe it. Everything works just fine as long as you average over an interval short enough that you are bound to a given attractor, oscillating away, things look predictable and then — damn, you change attractors.Everything changes! All the precious parameters you empirically tuned to balance out this and that for the old attractor suddenly require new values to work.

This is why it is actually wrong-headed to acquiesce in the notion that any sort of p-value or Rsquared derived from an AR5 mean has any meaning. It gives up the high ground (even though one is using it for a good purpose, trying to argue that this “ensemble” fails elementary statistical tests. But statistical testing is a shaky enough theory as it is, open to data dredging and horrendous error alike, and that’s when it really is governed by underlying IID processes (see “Green Jelly Beans Cause Acne”). One cannot naively apply a criterion like rejection if p < 0.05, and all that means under the best of circumstances is that the current observations are improbable given the null hypothesis at 19 to 1. People win and lose bets at this level all the time. One time in 20, in fact. We make a lot of bets!

So I would recommend — modestly — that skeptics try very hard not to buy into this and redirect all such discussions to questions such as why the models are in such terrible disagreement with each other, even when applied to identical toy problems that are far simpler than the actual Earth, and why we aren’t using empirical evidence (as it accumulates) to reject failing models and concentrate on the ones that come closest to working, while also not using the models that are obviously not working in any sort of “average” claim for future warming. Maybe they could hire themselves a Bayesian or two and get them to recompute the AR curves, I dunno.

It would take me, in my comparative ignorance, around five minutes to throw out all but the best 10% of the GCMs (which are still diverging from the empirical data, but arguably are well within the expected fluctuation range on the DATA side), sort the remainder into top-half models that should probably be kept around and possibly improved, and bottom half models whose continued use I would defund as a waste of time. That wouldn’t make them actually disappear, of course, only mothball them. If the future climate ever magically popped back up to agree with them, it is a matter of a few seconds to retrieve them from the archives and put them back into use.

Of course if one does this, the GCM predicted climate sensitivity plunges from the totally statistically fraudulent 2.5 C/century to a far more plausible and stillpossibly wrong ~1 C/century, which — surprise — more or less continues the post-LIA warming trend with a small possible anthropogenic contribution. This large a change would bring out pitchforks and torches as people realize just how badly they’ve been used by a small group of scientists and politicians, how much they are the victims of indefensible abuse of statistics to average in the terrible with the merely poor as if they are all equally likely to be true with randomly distributed differences.

rgb

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

323 Comments
Inline Feedbacks
View all comments
KitemanSA
June 18, 2013 5:17 pm

Might it be a valid mean of random stupidity?

Ian W
June 18, 2013 5:24 pm

An excellent post – it would be assisted if it had Viscount Monckton’s and Roy Spencer’s graphs displayed with references.

June 18, 2013 5:28 pm

This assertion (wrong GCM’s should be ignored not averaged) is so clearly explained and justified, I am amazed no statistician made the point earlier, like sometime in the last 10 years as the climate change hysteria became so detached from reality, as all bad weather is now blamed on climate change.

OK S.
June 18, 2013 5:34 pm

The Bishop has some something to say regarding this comment over at his place:
http://bishophill.squarespace.com/blog/2013/6/14/on-the-meaning-of-ensemble-means.html

PaulH
June 18, 2013 5:36 pm

The ensemble average of a Messerschmidt is still a Messerschmidt. :->

June 18, 2013 5:37 pm

What I’m trying to say is that the variance and mean of the “ensemble” of models is completely meaningless, statistically
Indeed. At best, the outputs of climate models are the opinions of climate modellers numerically quantified.
As such, I’d argue the variance, is direct evidence the claimed consensus is weak.

mark
June 18, 2013 5:43 pm

damn.
just damn.

June 18, 2013 5:47 pm

rgb: “a small group of scientists and politicians,
It’s not a small group. It’s a large group.
Among US scientists, it’s the entire official institutional hierarchy, from the NAS, through the APS, the ACS to the AGU and the AMS. Among politicians, it’s virtually the entire set of Democratic electees, and probably a fair fraction of the Republican set, too.
And let’s not forget the individual scientists who have lied consistently for years. None of this would be happening without their conscious elevation of environmental ideology over scientific integrity. Further, none of this would be happening if the APS, etc., actually did due diligence on climate science claims, before endorsing them. The APS analysis, in particular, is pathetic to the point of incompetent.
And all of this has been facilitated by a press that has looked to their political prejudices to decide which group is telling the truth about climate. The press has overlooked and forgiven obvious shenanigans of climate scientists (e.g., Climategate I&II, back to 1400 CENSORED, the obvious pseudo-investigatory whitewashes, etc.) the way believers hold fast to belief despite the grotesqueries of their reverends. It’s been a large-scale failure all around; a worse abuse of science has never occurred, nor a worse failure by the press.

Admin
June 18, 2013 5:49 pm

Lets face it, Ensemble Means were brought to us by the same idiots who thought multi-proxy averaging was a legitimate way to reduce the uncertainty of temporally uncertain temperature proxies.

June 18, 2013 5:53 pm

By the way, my 2008 Skeptic article provides an analysis of GCM systematic error, and shows that their projections are physically meaningless.
I’ve updated that analysis to the CMIP5 models, and have written up a manuscript for publication. The CMIP5 set are no better than the AMIP1 set. They are predictively useless.

tz2026
June 18, 2013 5:56 pm

Well put. In great detail too.

k scott denison
June 18, 2013 5:58 pm

Brilliant, thank you. Can’t wait to see the defenders of the faith stop by to tell us, once again, “but, but, but they’re the best we have!!!” Mosher comes to mind.
That the best we have are all no good never seems to cross some people’s minds. Dr. Brown, the simplicity of your advice to ask the key questions about the models is greatly appreciated.

MaxL
June 18, 2013 6:00 pm

I have been doing operational weather forecasting for several decades. The weather models are certainly a mainstay of our business. We generally look at several different models to gain a feel for what may occur. These include the Canadian, American and European models. They all have slightly differing physics and numerical methods. All too often the models show quite different scenarios, especially after about 48 hours. So what does one do? I have found through the years that taking the mean (ie. ensemble mean of different models) very seldom results in the correct forecast. It is usually the case that one of the models produces the best result. But which one is the trick. And you never now beforehand. So you choose what you think is the most reasonable model forecast, bearing in mind what could happen given the other model output. And just because one model was superior in one case does not mean it will be the best in the next case.

Eeyore Rifkin
June 18, 2013 6:01 pm

Philip Bradley says:
“At best, the outputs of climate models are the opinions of climate modellers numerically quantified.”
Agreed, but I don’t believe that’s meaningless, statistically or otherwise.
“As such, I’d argue the variance, is direct evidence the claimed consensus is weak.”
I think the magnitude of the variance depends on the scale one uses. Pull back far enough and it looks like a strong “consensus” to exaggerate.

Greg L.
June 18, 2013 6:02 pm

I have mostly stayed out of the fray, as most of the arguing over runaway anthropogenic global warming has for a good bit of time looked to me as far more religious than scientific on all sides. Having said that, and as a professional statistician (who possesses graduate degrees in both statistics and meteorology), I finally have seen a discussion worth wading into.
The post given here makes good sense, but I want to add a caution to the interpretation of it. Saying that making a judgement about an ensemble (i.e., a collection of forecasts from a set of models and their dispersion statistics) has no scientific/statistical validity does not mean that such a collection has no forecast utility. Rather, it means that one cannot make a statement about the validity of any individual model contained within the set based upon the performance of the ensemble statistics versus some reference verification. And this is exactly the point. We are a long way from the scientific method here – the idea that that an experimental hypothesis can be verified/falsified/replicated through controlled experiments. We are not going to be able to do that with most integrated atmospheric phenomena as there simply is no collection of parallel earths available upon which to try different experiments. Not only that, but the must basic forms of the equations that (we think) govern atmospheric behavior are at best unsolvable, and in a number of cases unproven. Has anyone seen a proof of the full Navier-Stokes equations? Are even some of the simplest solutions of these equations solvable (see, for example, the solution to the simplest possible convection problem in Kerry Emmanuel’s Atmospheric Convection text – it is an eight order differential equation with a transcendental solution). And yet we see much discussion on proving or validating GCM’s – which have at best crude approximations to many governing equations, do not include all feedbacks (and may even have the sign wrong of some that they do include), are attempting to model a system that is extremely nonlinear …
Given this, I actually don’t think the statement in this post goes far enough. Even reducing the set of models to the 10% or so that have the least error does not tell one anything. We cannot even make a statement about a model that correlates 99% with reality as we do not know if it has gotten things “right” for the right reasons. Is such a model more likely to be right? Probably. But is it? Who knows. And anyone who has ever tried to fit a complicated model to reality and watch the out-of-sample observations fail knows quickly just how bad selection bias can be. For example, the field of finance and forecasting financial markets is saturated with such failures – and such failures involve a system far less complicated than the atmosphere/ocean system.
On the flip side – this post does not invalidate using ensemble forecasts for the sake of increasing forecast utility. An ensemble forecast can improve forecast accuracy provided the following assumptions hold – namely, that the distribution of results is bounded, the errors of the members are not systematically biased, and that the forecast errors of the members are at least somewhat uncorrelated. Such requirements do not mean whatsoever that the member models use the same physical assumptions and simplifications. But once again – this is a forecast issue – not a question of validation of the individual members. And moreover, in the case of GCM’s within an ensemble, the presence of systematic bias is likely – if for no other reason than the unfortunate effects of publication bias, research funding survivorship (e.g, those who show more extreme results credibly may tend to get funding more easily), and the unconscious tendency of humans that fit models with way too many parameters to make judgement calls that causes the model results to look like what “they should be”.

Chuck Nolan
June 18, 2013 6:02 pm

I believe you’re correct.
I’m not smart enough to know if what you are saying is true, but I like your logic.
Posting this on WUWT tells me you are not afraid of critique.
Everyone knows nobody gets away with bad science or math here.
My guess is the bad models are kept because it’s taxpayer money and there is no need for stewardship so they just keep giving them the money.
cn

Abe
June 18, 2013 6:04 pm

WINNER!!!!!
The vast majority of what you said went WAY over my head, but the notion of averaging models for stats as if they were actual data being totally wrong I totally agree. I think looking at it in that light says a lot about the many climate alarmists who continue to use their model outputs as if they were actual collected data and ignore or dismiss real empirical data.

June 18, 2013 6:14 pm

All the climate models were wrong. Every one of them. [Click in chart to embiggen]
You cannot average a lot of wrong models together and get a correct answer.

June 18, 2013 6:18 pm

Eeyore Rifkin says:
June 18, 2013 at 6:01 pm

I agree with both your points.
I was agreeing with rgb’s statements in relation to the actual climate. Whereas my points related to the psychology/sociology of climate scientists, where the model outputs can be considered data for statistical purposes. And you may well be right that those outputs are evidence of collective exaggeration, or a culture of exaggeration.

June 18, 2013 6:22 pm

Can someone send enough money to RGB to get him to do the 10 minutes of work, and the extra work to publish a model scorecard and ranking for all to see. Like in golf or tennis. At the BH blog someone pointed out that some models are good for temperature, others for for precipitation. So there could be a couple of ranking list. But keep it simple.

Nick Stokes
June 18, 2013 6:22 pm


This is reflected in the graphs Monckton publishes above, where the AR5 trend line is the average over all of these models and in spite of the number of contributors the variance of the models is huge. It is also clearly evident if one publishes a “spaghetti graph” of the individual model projections (as Roy Spencer recently did in another thread) — it looks like the frayed end of a rope, not like a coherent spread around some physics supported result.
Note the implicit swindle in this graph — by forming a mean and standard deviation over model projections and then using the mean as a “most likely” projection and the variance as representative of the range of the error, one is treating the differences between the models as if they are uncorrelated random variates causing >deviation around a true mean!.
Say what?
This is such a horrendous abuse of statistics that it is difficult to know how to begin to address it. One simply wishes to bitch-slap whoever it was that assembled the graph and ensure that they never work or publish in the field of science or statistics ever again. One cannot generate an ensemble of independent and identically distributed models that have different code. One might, possibly, generate a single model that generates an ensemble of predictions by using uniform deviates (random numbers) to seed
“noise” (representing uncertainty) in the inputs.
What I’m trying to say is that the variance and mean of the “ensemble” of models is completely meaningless, statistically because the inputs do not possess the most basic properties required for a meaningful interpretation.”

As I said on the other thread, what is lacking here is a proper reference. Who does this? Where? “Whoever it was that assembled the graph” is actually Lord Monckton. But I don’t think even that graph has most of these sins, and certainly the AR5 graph cited with it does not.
Where in the AR5 do they make use of ‘the variance and mean of the “ensemble” of models’?

Editor
June 18, 2013 6:24 pm

The idiocy of averaging “the terrible with the merely poor.” Nice.

edcaryl
June 18, 2013 6:28 pm

Averaging climate models is analogous to averaging religions, with about the same validity.

Pamela Gray
June 18, 2013 6:30 pm

That was like eating a steak. Every bite was meaty!

Mark Bofill
June 18, 2013 6:32 pm

~applause~
Very well said, Dr. Brown!

1 2 3 13
Verified by MonsterInsights