New peer reviewed paper finds the same global forecast model produces different results when run on different computers
Did you ever wonder how spaghetti like this is produced and why there is broad disagreement in the output that increases with time?
Graph above by Dr. Roy Spencer
Increasing mathematical uncertainty from initial starting conditions is the main reason. But, some of it might be due to the fact that while some of the models share common code, they don’t produce the same results with that code owing to differences in the way CPU’s, operating systems, and compilers work. Now with this paper, we can add software uncertainty to the list of uncertainties that are already known unknowns about climate and climate modeling.
I got access to the paper yesterday, and its findings were quite eye opening.
The paper was published 7/26/13 in the Monthly Weather Review which is a publication of the American Meteorological Society. It finds that the same global forecast model (one for geopotential height) run on different computer hardware and operating systems produces different results at the output with no other changes.
They say that the differences are…
“primarily due to the treatment of rounding errors by the different software systems”
…and that these errors propagate over time, meaning they accumulate.
According to the authors:
“We address the tolerance question using the 500-hPa geopotential height spread for medium range forecasts and the machine ensemble spread for seasonal climate simulations.”
…
“The [hardware & software] system dependency, which is the standard deviation of the 500-hPa geopotential height [areas of high & low pressure] averaged over the globe, increases with time.”
The authors find:
“…the ensemble spread due to the differences in software system is comparable to the ensemble spread due to the differences in initial conditions that is used for the traditional ensemble forecasting.”
The initial conditions of climate models have already been shown by many papers to produce significantly different projections of climate.
It makes you wonder if some of the catastrophic future projections are simply due to a rounding error.
Here is how they conducted the tests on hardware/software:
Table 1 shows the 20 computing environments including Fortran compilers, parallel communication libraries, and optimization levels of the compilers. The Yonsei University (YSU) Linux cluster is equipped with 12 Intel Xeon CPUs (model name: X5650) per node and supports the PGI and Intel Fortran compilers. The Korea Institute of Science and Technology Information (KISTI; http://www.kisti.re.kr) provides a computing environment with high-performance IBM and SUN platforms. Each platform is equipped with different CPU: Intel Xeon X5570 for KISTI-SUN2 platform, Power5+ processor of Power 595 server for KISTI-IBM1 platform, and Power6 dual-core processor of p5 595 server for KISTI-IBM2 platform. Each machine has a different architecture and approximately five hundred to twenty thousand CPUs.
And here are the results:

While the differences might appear as small to some, bear in mind that these differences in standard deviation are only for 10 days worth of modeling on a short term global forecast model, not a decades out global climate model. Since the software effects they observed in this study are cumulative, imagine what the differences might be after years of calculation into the future as we see in GCM’s.
Clearly, an evaluation of this effect is needed over the long term for many of the GCM’s used to project future climate to determine if this also affects those models, and if so, how much of their output is real, and how much of it is simply accumulated rounding error.
Here is the paper:
An Evaluation of the Software System Dependency of a Global Atmospheric Model
Abstract
This study presents the dependency of the simulation results from a global atmospheric numerical model on machines with different hardware and software systems. The global model program (GMP) of the Global/Regional Integrated Model system (GRIMs) is tested on 10 different computer systems having different central processing unit (CPU) architectures or compilers. There exist differences in the results for different compilers, parallel libraries, and optimization levels, primarily due to the treatment of rounding errors by the different software systems. The system dependency, which is the standard deviation of the 500-hPa geopotential height averaged over the globe, increases with time. However, its fractional tendency, which is the change of the standard deviation relative to the value itself, remains nearly zero with time. In a seasonal prediction framework, the ensemble spread due to the differences in software system is comparable to the ensemble spread due to the differences in initial conditions that is used for the traditional ensemble forecasting.
h/t to The Hockey Schtick
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.

Mark Negovan says:
July 28, 2013 at 6:03 am
“The problem is that if you just test the extremes of the error uncertainty after a time step in a GCM, you are just computing a different state of the system. Thus errors and uncertainty at any time step is unbounded and the error bound will increase to the extremes of where the climate has been in the past in a very short time.”
Exactly. “Defocussing” the state to a meaningless blur in a few time steps.
Rounding errors? We don’t care ’bout no stinkin’ rounding errors! As long as the results are in the ballpark & moving upward. /sarc
How about something as simple as unit tests? Do they exist for every subroutine?
@Nick Stokes
NS says: It isn’t an error. It is very well known that atmosphere modelling is chaotic (as is reality). Numerical discrepancies of all kinds grow, so that the forecast is no good beyond about 10 days. Rounding errors grow like everything else.
NS says: The alternative is no numerical forecast at all.
Nick that is all blatantly wrong you may not know how to deal with the problem but many scientists do so please don’t lump us all in with your limited science skill and abilities.
First lets sort out some terminology errors that grow are called compounding errors and they occur due to refusal to deal with them. A chaotic system can be forecast if you know what you are doing and know how to deal with the errors in fact most weather forecasters do it every day. In Quantum Mechanics chaos in all sorts of forms crops up and we deal with it forecast and model around it and lift it into theories usually as probabilities.
We did this dance with Nicola Scafetta and his astrological cycles I am beginning to think it must be common for the climate science community to be ignorant on compounding errors.
So Nick without getting into some of the more complex mathematical and QM ways to deal with chaotic errors the absolutely simple way to do it is to drag the model back down to reality and dump the errors each cycle. All weather reports typically do that they ignore yesterdays mistake and forecast forward.
Similarly most manual human navigation systems have the concept of waypoints where you visually locate a point and then fix that point as your new start point removing all the previous errors.
It’s really not that hard and the most famous example of a chaotic system was found in 1964 (http://mathworld.wolfram.com/Henon-HeilesEquation.html).
You might want to start there before you make further stupid statements like chaotic systems can’t be forecast or approximated or at least talk to a real scientist.
Seems completely obvious and not an issue to me. To say it plainly, we all know that a butterfly flapping its wings can influence the weather in a distant place and time; however, the same cannot (or at least, it is extremely improbable) cause an ice age or a global warming in the distant future. Every single model’s prediction of a particular level of warming in a particular year is per se meaningless. It is the sum of all predictions that produces a meaningful result. And unless it is demonstrated that slightly different CPUs or, for that matters, inital conditions, generate completely different _average trends_, the thing is completely irrelevant to climate models.
Nick Stokes says:
July 27, 2013 at 12:40 pm
Frank K. says: July 27, 2013 at 12:16 pm
“Unfortunately, Nick, climate (as formulated in most GCMs) is an initial value problem. You need initial conditions and the solution will depend greatly on them (particularly given the higlhy coupled, non-linear system of differential equations being solved).
“They follow patterns of synthetic weather”??
REALLY? Could you expand on that?? I have NEVER heard that one before…”
Here is something that is familiar, and is just a small part of what AOGCM’s do. Time varying ocean currents, shown with SST. You see all the well known effects – Gulf Stream, ENSO, Agulhas. It’s time varying, with eddies.
If you run that on another computer, the time features won’t match. You’ll see eddies, but not synchronous. That’s because of the accumulation of error. There is no prediction of exactly what the temperature will be at any point in time. But the main patterns will be the same. The real physics being shown is unaffected.
Nick, the ‘real physics’ may be unaffected but the different values for the various vectors will be and if these are the starting conditions for a floating point constrained program modeling a chaotic system then the results will be totally different dependent on which of the computer models were used for the start point. This is indeed what we see with ‘GCM’s.
From experience, even trying to model mesoscale weather (say out to 20 Km) accurately is not feasible more than 20 – 30 minutes out.
Therefore, there will be hidden assumptions in the models about which physical laws can be discounted as unimportant at a particular scale, which are averaged together and which are completely retained. These assumptions will also create wild variance in the chaotic models. I presume that some of the ‘training’ of the models is to bound the wildness of these variances by rule of thumb to stay inside acceptable (to the researcher) bounds. Thus the models cease to be scientific – purely based on ‘the laws of physics’, and become merely the expression of the researcher’s best guess hidden behind complex software.
ZOMG I take it back someone in climate science does understand chaos in science after looking around
Very good article:
http://judithcurry.com/2011/03/05/chaos-ergodicity-and-attractors/
AND THIS I AGREE WITH:
=>Nothing tells us that such a finite dimensional attractor exists, and even if it existed, nothing tells us that it would not have some absurdly high dimension that would make it unknown forever. However the surprising stability of the Earth’s climate over 4 billion years, which is obviously not of the kind “anything goes,” suggests strongly that a general attractor exists and its dimension is not too high.
The challenge of a climate science modeler is to keep adjusting the attractor until you got lock on the chaotic signal and at that point you can perfectly predict.
LdB says:
July 28, 2013 at 7:47 am
@Nick Stokes
NS says: It isn’t an error. It is very well known that atmosphere modelling is chaotic (as is reality). Numerical discrepancies of all kinds grow, so that the forecast is no good beyond about 10 days. Rounding errors grow like everything else.
…
So Nick without getting into some of the more complex mathematical and QM ways to deal with chaotic errors the absolutely simple way to do it is to drag the model back down to reality and dump the errors each cycle. All weather reports typically do that they ignore yesterdays mistake and forecast forward.
Exactly — and this approach is used in many optimization systems — where you can have “way points” this approach will constantly correct the model and can be used to adjust parameters. Anyone who has worked with manufacturing scheduling realizes this. The scheduling algorithms are typically NP problems — with approximation solutions. Correcting further by monitoring progress with real time data collection reduces errors considerably — an obvious point one would think.
My first work was in designing math processors that could multiply and divide numbers of any length — decimal or otherwise — with any degree of accuracy/precision desired. The math processing routines in today’s compilers do use approximations — as do the math co-processors in the more advanced chips. This is simply to keep computational time reasonable in “small” computation systems.
Many times the algorithm chosen to solve a problem is one that uses a lot of “multiply/divide” (where relative error is specified) where it could just as easily rely on an algorithm that uses mostly “add/subtract” — where you can typically use “absolute error”. Sometimes we have a choice with the algorithm — sometimes we don’t….
The add/subtract error error grows much more slowly than the error of the multiplication and division operations. It’s one reason the use of determinants to solve an equation set can often be a bad idea as opposed to LU factorization of one of the other algorithms for equation set solution that use mostly add/subtract. (Gauss Siedel if I recall correctly).
On occasion when asked to evaluate why a computer is providing “bad answers” I have created and run algorithms that gave an answer correctly to say a hundred decimal places to make a point. (Whether a micro-processor or a mainframe). It’s all about the time requirements and your appetite (tolerance) for error.
Advances in FPGAs and ASIC’s could change how we do calculations of popular algorithms as we could design special purpose chips or programs that would run only a particular algorithm — or at high accuracy. Again — it is a money vs time vs accuracy requirements issue.
Apologies for the simplistic explanation.
Anthony and others,
As a numerical modeler at the University of Washington, let me make something clear: this issue is not really relevant to climate prediction–or at least it shouldn’t be. The fact that differences in machine architecture, number of processors, etc. will change a deterministic weather forecast over an extended period is well known. It is reflection of the fact that the atmospheric is a chaotic system and that small differences in initial state will eventually grow. This is highly relevant to an initial value problem, like weather forecasting. Climate modeling is something else…it is a boundary value problem…in this case the radiative effects due to changes in greenhouse gases.
To to climate forecasting right, you need to run an ensemble of climate predictions and for a long climate run the statistical properties of such climate ensembles should not be sensitive to the initial state. And they should reveal the impacts of changing greenhouse gas concentrations.
…cliff mass, department of atmospheric sciences, university of washington
This article was an eye opener for me. I falsely assumed issues of rounding error, math library calculation algorithms, and parallel processing were considered and handled properly in software. I assumed differences between measured values and climate projections were entirely due to climate simulation algorithms. I had always wondered how so many competent people were miss-led about the accuracy of projections of future warming. Now I see that it is quite possible to assume other folks are operating with the same or better level of knowledge and care as themselves and be led to invalid confidence in their claims.
Blarney says:
July 28, 2013 at 8:38 am
“And unless it is demonstrated that slightly different CPUs or, for that matters, inital conditions, generate completely different _average trends_, the thing is completely irrelevant to climate models.”
You sound as if you believed the average trends computed now by climate models were not already falsified.
It has been demonstrated that the average trend computed by GCM’s does not correspond to reality; see chart in headpost. So the Null hypothesis holds – the climate is doing nothing extraordinary but slowly recovering from the LIA; CO2 concentrations do not affect it in a significant way.
Now the onus is on the programmers of the climate models to come up with a new hypothesis, incorporate it into their programs, make predictions, and wait for possible falsification again.
“You sound as if you believed the average trends computed now by climate models were not already falsified.”
This is a completely different matter. The influence of numerical approximations on predictions based on climate models can be (is, in my opinion) completely irrelevant, and still those models may be unable to produce accurate predictions for any other reason.
If you think that climate models are wrong, just don’t make the mistake of buying in any proposed explanation of why it is so, because this instead of making your argument stronger, it makes it weaker.
cliff mass says:
July 28, 2013 at 8:52 am
“This is highly relevant to an initial value problem, like weather forecasting. Climate modeling is something else…it is a boundary value problem…in this case the radiative effects due to changes in greenhouse gases.”
You are claiming that there are no tipping points. “Tipping points” have been the number one mainstay of climate science for years, and still are; they continue talking about a “point of no return”.
Thanks for clarifying that official climate science now believes something else; namely a simple one to one relationship between greenhouse gases and temperature.
In that case, why do they run 3 dimensional models at all? A handkerchief should provide sufficient space for extrapolating the temperature in the year 2100.
Are you saying they -cough- are wasting tax payer money by buying themselves unnecessary supercomputers? Oh. Yes that is exactly what you are saying.
The more important question is how do you validate a model of a chaotic system? To validate the model of a stable system is simple – you just compare prediction to reality (with error bounds) and throw the model out if the two don’t match. But when you are modelling a chaotic system you can have no expectation that prediction will look anything like reality even if the model is perfect. So how could you falsify such a model? An unfalsifiable model isn’t science.
The best method is to make use of the properties of chaotic systems.The solutions are not completely random – there will be relationships between the variables reflecting the fact that the solution set forms an attractor – a surface of lower dimension than the full space. Seek those relations. If observation does not lie on the predicted attractor the model is falsified.
I don’t see much of this happening with climate models though. Instead of looking for relationships between the variables that would characterise the nature of the attractor and might allow the model to be tested, everyone just seems fixated on producing spaghetti graphs of temperature.
This sort of thing is very difficult to control for. See, for example Floating-Point Determinism:
steven says: July 27, 2013 at 11:17 am
Can’t wait to see what RGB has to say about this!
Just maybe we’ve stumbled across a reason for averaging GCM model results–we’re averaging out random rounding errors. But if the rounding errors are that large, who is willing to trust any of the results?
Blarney says:
July 28, 2013 at 9:18 am
“If you think that climate models are wrong, just don’t make the mistake of buying in any proposed explanation of why it is so, because this instead of making your argument stronger, it makes it weaker.”
I haven’t even begun to cite the holes in the physics in the models because I didn’t want to derail the debate.
Reed Coray says:
July 28, 2013 at 10:06 am
“Just maybe we’ve stumbled across a reason for averaging GCM model results–we’re averaging out random rounding errors. But if the rounding errors are that large, who is willing to trust any of the results?”
That argument, if it were made, would fail for two reasons:
-averaging would only help if the errors were equally or normally distributed; now; are they?
(Law of Large Numbers again; it does not hold for Cauchy-type distributions)
-the compounding errors grow over time so the averaging would become more ineffectual with every time step. Why do it if no predefined level of quality can be maintained? A scientist should have a reason for doing things – not just average 20 runs because 20 is a nice number and averaging sounds cool.
cynical_scientist says:
July 28, 2013 at 9:31 am
“The more important question is how do you validate a model of a chaotic system? ”
-get as much data of the real system as possible for a time interval.
-initialize the model with a state that is as close to the state of the real system at the start of your reference time interval as possible.
-run the model.
-compare with what the real system did.
And that’s what climate scientists obviously never did. They do their hindcasting but they initialize with a random state. Now, is that incompetence or malice or both?
Many, many (too many actually) years ago when I took my first class in digital signal processing (DSP), the instructor assigned us the problem of writing software code that represented an infinite impulse response (IIR) digital filter having several “poles” just inside the unit circle. We computed values for the feedback loop coefficients that would locate the “poles” at their proper locations and entered those coefficients into the computer representation of the filter. The instructor had deliberately chosen “pole” locations that required coefficients to an extremely high degree of precision. The computer/software we were using rounded (or truncated, I’m not sure which) our input coefficient values. The result was that the rounded values represented a filter where the “poles” moved from just inside the unit circle to just outside the unit circle. For those familiar with DSP, a digital filter having a “pole” outside the unit circle is unstable–i.e., for a bounded input, the output will grow without bound. When we fed random data into our computer-based IIR filter, sure enough it didn’t take long before “overflow” messages started appearing. The purpose of the whole exercise was to make us aware of potential non-linear effects (rounding) in our construction of digital filters. Apparently the Climate Science(?) community would have benefited from a similar example.
“Nope. you are forgetting that re-ordering non-dependant floating point operations can give you different rounding error. That is not deterministic.”
The compiler may well change the sequence of floating point instructions for better performance (e.g. changing a*b+c*b into (a+c)*b to eliminate a multiply), and thereby produce different results to those you expect. I’m not aware of any CPU that will do the same.
Mr. Layman here. Awhile ago I made a comment about models projecting out to 100 years. I tried to find it but couldn’t. (It had to do with models going that far would need to have numbers for such things as the price of tea in China for the next 100 years and how that might effect the crops that may be planted instead of tea and their effect on CO2 etc. etc. Lots of variables that need a number to run in a computer program.) Now we’re learning that when the “Coal Trains of Death” is dependent upon what kind of computer and which program is run?
Why bet trillions of dollars, national economies, and countless lives on such uncertainty?
Should be “Now we’re learning that when the “Coal Trains of Death” kill us all is dependent
Tsk Tsk says:
July 27, 2013 at 1:04 pm
It’s disturbing to hear them blame this on rounding errors. As Ric noted all of the hardware should be conforming to the same IEEE standard which means it will result in the same precision regardless. Now the libraries and compilers are a different beast altogether and I could see them causing divergent results. Different numerical integration methods could also be at play here. Whatever the ultimate cause this is a creative definition of the word, “robust.”
Reply:
My own experience has been that IEEE compliance helps accuracy, but the way the CPU implements floating point makes a larger difference. Intel x86 uses a floating point stack. As long as you keep numbers on the stack, they compute in 80-bit precision unless you tell it to round results. Once you take the result off the stack and store it in memory as 64-bit, it gets rounded. Depending on how you write your equations into C++ or some other language, the roundoff results can be very different.
I use neural networks in my work and MS Visual Studio C++ for development. Neural networks involve many calculations of dot products of two vectors that are fed forward through the network. In my case I calculate these in a single statement to keep the summation accuracy in the stack at 80-bits. If I do the calculation in an index loop, which might be the way most people would program it, I get wildly different results because each loop rounds the accumulated result down to 64-bit each time it loops. The MS C++ compiler also doesn’t implement 80-bit IEEE and so isn’t well suited to these calculations. I only get 80-bit because I’m aware of how the stack works. I doubt the MS compiler is used much for these climate simulations, and if it is, the accuracy should be questioned. Doing things at 80-bits in a compiler that supports it really doesn’t slow things down much.
Try this on a Power or ARM CPU and you will most likely get very different results even though they all support IEEE floating point. This means that you can’t just move the neural net (with fixed weight values) to another CPU and expect it to work.
If you are writing heavy duty code, in any language, to be used for a serious purpose, then there a few tips that people need to be aware of:
Others have and still are, doing this job now, so copy the best methods.
Use approved compilers and only use subsets of your language of choice.
A typical example is MISRA C. It is not a compile but a set of ‘tests’ that you chosen compiler/CPU will ALWAYS WORK THE SAME WAY EVERYTIME.
What does this mean in practice? Common programming tricks are banned because the ‘system’ can not guarantee it will give the same result every time it runs. (Heard that one before).
MISRA C is now the accepted standard to use if you are coding for automobiles etc.
http://www.misra.org.uk/misra-c/Activities/MISRAC/tabid/160/Default.aspx
If you want to code for aircraft flight control systems, it just gets worse, from the programmers perspective, (but safer for the passenger).
DO-178B (changing to ‘version C soon) is much tougher, contains many more rule/procedure based checking details.
To prove that you have correctly used MISRA C or DO-178 (the programming rules) you use a software QA toolset. http://www.ldra.com/ produce tools that analyse your code before you even run it, getting rid of many bugs at the beggining.
If i were involved in spending 1 million on producing a GCM I would have expected a few professional programmers, a test engineer, a QA engineer to prove to the customer that what we have produced fits the bill and can be seen to do so.
Nick Stokes, you keep saying global climate models are built on ‘simple’ and real physics.
The fact is we still struggle to model a ‘ simple’atom of CO2 in physics. The maths required to do this are frightening.
If there are serious and difficult issues describing how a single atom of CO2 can be modelled accurately using highly complex maths.
How on earth can we accurately predict how climate will react to adding more CO2 in the atmosphere if we can’t accurately model a CO2 atom