New peer reviewed paper finds the same global forecast model produces different results when run on different computers
Did you ever wonder how spaghetti like this is produced and why there is broad disagreement in the output that increases with time?
Graph above by Dr. Roy Spencer
Increasing mathematical uncertainty from initial starting conditions is the main reason. But, some of it might be due to the fact that while some of the models share common code, they don’t produce the same results with that code owing to differences in the way CPU’s, operating systems, and compilers work. Now with this paper, we can add software uncertainty to the list of uncertainties that are already known unknowns about climate and climate modeling.
I got access to the paper yesterday, and its findings were quite eye opening.
The paper was published 7/26/13 in the Monthly Weather Review which is a publication of the American Meteorological Society. It finds that the same global forecast model (one for geopotential height) run on different computer hardware and operating systems produces different results at the output with no other changes.
They say that the differences are…
“primarily due to the treatment of rounding errors by the different software systems”
…and that these errors propagate over time, meaning they accumulate.
According to the authors:
“We address the tolerance question using the 500-hPa geopotential height spread for medium range forecasts and the machine ensemble spread for seasonal climate simulations.”
…
“The [hardware & software] system dependency, which is the standard deviation of the 500-hPa geopotential height [areas of high & low pressure] averaged over the globe, increases with time.”
The authors find:
“…the ensemble spread due to the differences in software system is comparable to the ensemble spread due to the differences in initial conditions that is used for the traditional ensemble forecasting.”
The initial conditions of climate models have already been shown by many papers to produce significantly different projections of climate.
It makes you wonder if some of the catastrophic future projections are simply due to a rounding error.
Here is how they conducted the tests on hardware/software:
Table 1 shows the 20 computing environments including Fortran compilers, parallel communication libraries, and optimization levels of the compilers. The Yonsei University (YSU) Linux cluster is equipped with 12 Intel Xeon CPUs (model name: X5650) per node and supports the PGI and Intel Fortran compilers. The Korea Institute of Science and Technology Information (KISTI; http://www.kisti.re.kr) provides a computing environment with high-performance IBM and SUN platforms. Each platform is equipped with different CPU: Intel Xeon X5570 for KISTI-SUN2 platform, Power5+ processor of Power 595 server for KISTI-IBM1 platform, and Power6 dual-core processor of p5 595 server for KISTI-IBM2 platform. Each machine has a different architecture and approximately five hundred to twenty thousand CPUs.
And here are the results:

While the differences might appear as small to some, bear in mind that these differences in standard deviation are only for 10 days worth of modeling on a short term global forecast model, not a decades out global climate model. Since the software effects they observed in this study are cumulative, imagine what the differences might be after years of calculation into the future as we see in GCM’s.
Clearly, an evaluation of this effect is needed over the long term for many of the GCM’s used to project future climate to determine if this also affects those models, and if so, how much of their output is real, and how much of it is simply accumulated rounding error.
Here is the paper:
An Evaluation of the Software System Dependency of a Global Atmospheric Model
Abstract
This study presents the dependency of the simulation results from a global atmospheric numerical model on machines with different hardware and software systems. The global model program (GMP) of the Global/Regional Integrated Model system (GRIMs) is tested on 10 different computer systems having different central processing unit (CPU) architectures or compilers. There exist differences in the results for different compilers, parallel libraries, and optimization levels, primarily due to the treatment of rounding errors by the different software systems. The system dependency, which is the standard deviation of the 500-hPa geopotential height averaged over the globe, increases with time. However, its fractional tendency, which is the change of the standard deviation relative to the value itself, remains nearly zero with time. In a seasonal prediction framework, the ensemble spread due to the differences in software system is comparable to the ensemble spread due to the differences in initial conditions that is used for the traditional ensemble forecasting.
h/t to The Hockey Schtick
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.

When we know that in coming eons, we will have oscillated through glacial to interglacial periods of + or – 5C limits, I think we should be looking at a well established major determinative physical driver instead of believing we go from ice age to interglacial by “runaways” of a chaotic nature between two attractors. This latter bespeaks of too much mathematical mind-clogging on the tiny centennial ripples we preoccupy ourselves with in climate science, the ripples on the inexorable megaennial movements of temperature.
DirkH,
“Non-linear complex systems such as climate are by their very nature chaotic,”
“No. Only when they amplify low order state bits. Complexity alone is not necessary and not sufficient. The Mandelbrot equation is not very complex yet chaotic.”
*************************
When you can show me the math that handles turbulence (or even the physics of clouds), I’ll retract. Beyond the math, the behaviour of climate exhibits all the characteristics of chaotic behaviour. The more iterations, the quicker is spirals out to a completely different result. Simply compare the accuracy of a 1 day forecast to a 30 day one.
The basically ignorant supposition that simply running a GCM enough times will somehow give a credible result with a chaotic system, shows a total failure to understand the difference between an average predicted result and the complete unpredictability of chaotic systems.
The precision to which a computer calculates numbers is totally irrelevant, if you’re into the telling the future business. If you don’t understand the problem or don’t have the adequate physics/math to express it, you’re going to get a wrong result.
Pointman
Pointman says:
July 27, 2013 at 2:59 pm
“DirkH,
“Non-linear complex systems such as climate are by their very nature chaotic,”
“No. Only when they amplify low order state bits. Complexity alone is not necessary and not sufficient. The Mandelbrot equation is not very complex yet chaotic.”
*************************
When you can show me the math that handles turbulence (or even the physics of clouds), I’ll retract. Beyond the math, the behaviour of climate exhibits all the characteristics of chaotic behaviour. The more iterations, the quicker is spirals out to a completely different result. Simply compare the accuracy of a 1 day forecast to a 30 day one.”
You are of course completely right for climate; but you said
“Non-linear complex systems such as climate are by their very nature chaotic”
– where “such as climate” mentions an example, so let’s reduce it to the statement
“Non-linear complex systems are by their very nature chaotic”
which is not always correct. I’m picking nits, but definitions are all about the nits. 😉
1. This seems to be the IT equivalent of science’s ‘confirmation bias’. Only if you notice that rounding errors are causing problems do you fix them.
2. Weather forecasts use similar models and equations, and they start to fail just a few days ahead. That’s why the weather bureaus only make forecasts a few days ahead. Actually, they do make generalised longer term forecasts, but they aren’t worth much. Surely similar rules should apply to climate models.
3. As Willis Eschenbach has pointed out several times in the past, the climate models act as a black box in which the climate forecasts simply follow a single factor, their assumed ECS, and all the surrounding millions of lines of code have no effect. [From memory. Apologies, w, if I have misrepresented you].
Although I am blowing my own trumpet I explained precisely this problem in this post
http://wattsupwiththat.com/2013/03/08/statistical-physics-applied-to-climate-modeling/
and this
http://wattsupwiththat.com/2013/06/01/a-frank-admission-about-the-state-of-climate-modeling-by-dr-gavin-schmidt/
I claim my £5. Nothing wrong with being a sceptic. It just amazes me that the issue only now appears in a paper related to GCM’s.
Pointman
(to DirkH)
When you can show me the math that handles turbulence (or even the physics of clouds), I’ll retract.
>>>>>>>>>>>>>>>
This is the other side of the problem and Pointman has, in my opinion, nailed it. Beyond the inability of cpu’s to handle extremely precise numbers in a consistent fashion is that we forget in this day and age that computers are actually dumber than posts. Seriously, they are capable of only very simple instructions. Their advantage is that for the very simple things they CAN do, they have the capability to do them very, very, fast.
That’s just fine when you can break the problem you are working on into simple pieces. But if the problem itself is too complex for the human mind to understand, then the human mind cannot break it into simple pieces that are known to be correct, and any computer program built upon an incomplete understanding of the physics is going to produce correct results only by some miracle of chance.
So we only have about 31,755 days to go until the year 2100. So this means garbage in, garbage out into computer software handling differences = accurate IPCC projections for the year 2100. Oh, our children and grandchildren will have a field day in 2100. Historians won’t be able to type their books because of the huge belly laughs.
Would the models even produce the same result when run on the same computer on different runs?
Are the rounding errors always made to the high side? -following the well usual CAGW fashion of course…
What a fascinating discussion and what a privilege it is to be privy to the massive crowdsourcing that is made possible by this site.
The range of expertise available and the freedom to express ones ideas (within the bounds of decency) made possible by the world’s most viewed science blog is quite breath-taking.
Thank you Anthony and all those who contribute (for better or for worse) to demonstrate the future of learning and enquiry.
Jimmy Haigh says:
July 27, 2013 at 3:25 pm
“Would the models even produce the same result when run on the same computer on different runs?”
If the errors are due to differences of the floating point implementation of different computer systems, the result should stay constant on one system (given that the exact same initialization happens, which can usually be accomplished by using the same random seed for the random generator , if they use a random generator to fill the initial state of the model).
(deterministic)
If, on the other hand, errors are introduced by CPU errata or by race conditions between CPU cores, as mentioned by others, we would expect every run to have different results even when the initialization is identical.
(nondeterministic)
Correction: Depending on the nature of a CPU erratum, it could be present in the deterministic or in the nondeterministic camp. Many CPU errata are internal race conditions inside the CPU.
Jonathan Abbott (at 1:11PM)! So, you’re the boss in “Dilbert”!!! At least, you are bright and want to learn, unlike that guy. (and I’m sure you don’t style your hair like he does, either, lol)
Well, all I can say is, if you want some great insight into what it is like for your software engineers to work for someone who is new to coding, read Dilbert (by Scott Adams).
From your conscientiousness, I’m sure, given all the real Dilbert bosses out there, they consider themselves blessed.
Window into the world of being part of the “team”
“Since the software effects they observed in this study are cumulative”
Not necessarily. Models sometimes diverge and they may diverge for any number of reasons, floating point precision being the most common. However the models *should* be stable, that is return to a state (or trend) no matter any errors in the initial conditions or calculations. If the models are *not* stable then they are glorified curve fitted, unstable, extrapolations. Which they are.
It definitely is a heads up for climate modellers, but 10 days of simulation tells very little about how stable the simulation is as a whole. If you run a good weather model on different hardware or using different arithmetic settings, it will also produce different results based on type of rounding and arithmetic precision, but results from multiple runs will be spread around the same forecast values.
So it sure deserves attention and further examination but it’s too early to say that climate models are sensitive to that effect. It would be definitely great shame if they were, though.
davidmhoffer says:
If the programmer didn’t take errata into account, the most likely results is that they are ALL wrong.
No. Unless you are programming in Assembler language. It is generally the job of the compiller to deal with CPU bugs. A rare exception to this was the Pentium FDIV bug.
It is the job of the GCM programmer to understand the programming language guarntees and to know how to correctly perform numerical calculations accurately to the desired precission. They also need to have a good understanding of how errors can propagate.
Unless they wanted to use SIMD hardware ( Single Instruction Multiple Data ) such as SSE instructions or an FPU, then you should be using integral data types. I.e a temp of 15.11C could be stored in an interger data type as e.g. 1511. or as 151100 depending on the precession you need.
They would also need to avoid division as an intermediate operation. And you always need to be careful with division, in order to maintain precession.
Heather Brown’s comment above are spot on. We have Climate Scientists and Physicsits codimg climate models without sufficient trainning in Computer Science. No wonder thery produce GIGO results.
I tried to look at the open source climate models including GISS E. All of the models I looked at were coded in Fortran. Most used mixed versions of the language making it very difficiult to understand or reason about the computing model that they are fiollowing. Not one of them was even minimally documented and they were almost completely lacking in comments. This makes them almost undeciferable to an outsider. Fortran is an archiac language that is almost never used in the commercial world because we can get the same performance from more modern languages such as C or C++ with much better readability and easier reasoning for correctness.
We can not even check these models ourselves, because we do not have access to the hardware necessary to run them. Typically a cluster or supercomputer..
Irrespective of what we may think of the mathematical modelling in GCM’s We have a seperate and independant criticism. That they are not reproduceable across hardware platforms.
/ikh
But this may indicate that the models exhibit multiple solutions (which is somewhat a trivial statement) and the modellers fail to track the “physical” one (which is sort of surprising). I have wondered for a long time how they knew which solution to follow but now it seems they do not.
It still seems to me that failing to identify the physical solution is too fundamental aspect of the simulation to ignore but maybe the coupled non-linear nature of GCMs makes the problem to intractable that the consensus science agreed just to ignore it.
For those of sufficient curiosity, get the old Mandelbrot set code, and set it up to use the maximum resolution of your machine. Now take a Julia set and drill down, keep going until you get to the pixels. This is the limit of resolution for your machine, if you’re lucky, your version of the Mandelbrot algorithm lets you select double precision floating point numbers which are subsequently truncated to integers for display, but still give you some billions of possible colors. The point is, every algorithm numerically approximated by a computer has errors. A computer can only compute using the comb of floating point numbers, not the infinite precision that the real world enjoys. Between every floating point number, there are a very large (how large? for the sleepless among us) number of numbers with infinite decimal places.
Solving PDEs using approximate solutions, gets you errors, period. Parameters can make the pictures look pretty, but they can’t make the solutions better. The precision isn’t there. That’s one of the reasons that “Cigar box physics” still rules. If you can’t fit the problem description and solution in a cigar box, you probably don’t understand the physics yet.
To regular readers here, note that Willis’ contributions are generally a perfect example of CBP.
wsbriggs says:
July 27, 2013 at 4:14 pm
“For those of sufficient curiosity, get the old Mandelbrot set code, and set it up to use the maximum resolution of your machine. Now take a Julia set and drill down, keep going until you get to the pixels. ”
What he means is, zoom into it until it becomes blocky. The blocks you see are atrefacts because your computer has run out of precision. They shouldn’t be there if your computer did “real” maths with real numbers. floating point numbers are a subset of real numbers.
DirkH says:
July 27, 2013 at 3:37 pm
Jimmy Haigh says:
July 27, 2013 at 3:25 pm
“Would the models even produce the same result when run on the same computer on different runs?”
I am sorry DirkH but you are wrong. You are assuming far too simlistic a computing model. W£hat you need to remember is that each core can do “Out Of Order” excution, as long as the operations are non-dependant. This means that numercial calculations can be re-ordered.
And that is just on a single cored cpu. Then add multi-threading on multi -cores and multi-processing across a cluster that makes up a super computer and you have a completely non-determanistic piece of hardware. it is uto the programmer to impose order. And the Climate Modelers do not have that skill.
/ikh
I recall reading several years ago that modeling is an art, not a science, and that of ~140 ‘best practices’ learned the hard way by those whose living depended on accuracy (i.e. oil exploration, etc) over 120 were violated by climate models.
When I took a numerical computation class back in the 1970’s (using Fortran 4 and WATFIV) we spent weeks going over error terms and how to try to minimize them. They are generally represented by epsilon. Anyone with even a minimal background in computation would be aware of this. I’m sure there are advanced methods to try to compensate for these errors but these clowns don’t seem to find it necessary to get advice from experts in this field.
And this is even more remarkable when you consider the fact that NASA computer people must be well aware of these problems since their space craft seem to get where they are intended to go, for the most part anyway. You know NASA that where Mr Hansen used to work.
Paul Jackson says:
July 27, 2013 at 1:32 pm
Edward Lorenz pretty much came to almost the exact same conclusion, in regards to almost the exact same computational problem almost 50 years ago; this is as the Warmistas would say is “settled science”..
=============================
Exactly! And begat the study of non-linear dynamical systems, aka chaos theory.
This thread is an excellent discussion of potential math pitfalls while coding; even if the information is in bits and pieces.
I kept copying a comment with the intention of using the comment as an intro point for my comment; only to copy another comment further on.
Before I start a comment, I would like to remind our fellow WUWT denizens about some of the spreadsheets we’ve seen from the CAGW crowd. Lack of data, zeroed data, missing or incorrect sign. Just the basics behind developing models is flawed, let alone building code around them.
Having an idea for a program is good, programming from the idea is, well, not intelligent. The more complex a program is intended to be, the intense and rigorous the design and testing phases.
Any program should be required to process known inputs and return verified results. If the program is chaotic, then all information, data and numbers must be output for every operation. Then someone has to sit down and verify that the program is processing correctly.
All too often, both design and testing phases are skipped or assumed good. What makes a good design? Not that the end result is a perfect match to the design, but that the design is properly amended with explanation so that it matches the result.
All three items, design, test inputs and outputs and code should be available for review. Protection? That what copyright laws are intended for or patent if unique enough; though very few programs are truly unique.
Computer languages handle numerical information differently. From IEEE 754-1985, modern IEEE compliant computer language will automatically handle floating point calculations. What happens when numbers exceed the bounds of the language is a rounding.
Rounding is a known entity and can be controlled; which is why the frequent complaints about it’s a novice who allows a program to follow or process incorrect rounding assumptions.
The intention of rounding is to follow a basically sum neutral process of rounding where the total of rounding up equals the total round down; e.g. 365 numbers rounded, 182 numbers rounded up a total of +91, 183 numbers rounded down a total of -92.
365 divided by two gives 182.5; rounding up to 183 would make for 366 days, instead one number is rounded down while the other is rounded up. This rounding must be forced by the programmer!
The numbers rounded could be .4 normally rounded down or .6 normally rounded up. Depending on the rounding approach .5 is normally rounded up.
This concept follows the even division approach of a 50/50 split in how numbers will be rounded. Where datasets with huge arrays of numbers can really get caught hard is when large numbers of rounded numbers are aggregated or god forbid, the program’s default rounding approach is assumed good enough for climate work.
Nick Stokes supplied an example along with a description for what happens when the code is run on two different systems. Nick mentions that time features are wrong. Well that could be because portrayal of time in a program is a function of math computation and the intensive use of the time function with accumulated roundings, (Ingvar Engelbrecht’s N+1 discussion) . Note the ‘could’ as I haven’t dissected the code to follow and flesh out time. What happens internally is that the ‘time function’ call to the system is defined, processed, rounded, tracked and stored differently.
Uhoh!
I wouldn’t necessarily agree that the team must really know their math functions nowadays. It used to be that way, but the days where the programmer had to allocate specific memory, registers to store numbers, explicitly define every number field are pretty much past.
Mark Bofill states it fairly well in several different comments. The programmers must be diligent with both calculations and output rigorously checked. If you’ve got ask how the team is handling rounding errors, check and make sure you still have both legs as your software engineers should know better. If a software engineer tells you he cobbled up the code over the weekend, assign someone else, preferably not a pal, to verify the code and all data handlings, including math.
Sounds like complete garbage to me.
It has been many years since compilers were responsible for doing floating point computations. Nowadays all the heavy lifting is done using hardware and pretty much all of it is IEEE 754 compliant. IEEE 754 provides ways to configure how rounding is done, so it should be possible to get pretty identical results on different hardware.
Even if the programmer failed to configure the floating point hardware (a real rookie error) the rounding errors ought to be very small. Algorithms with large numbers of iterations can compound small rounding errors into big errors, but that is usually the sign of a very naive algorithm. Better designed programs can usually avoid this.
Large errors are most likely due to software bugs.
Numerical software that cannot produce consistent results on different platforms is unreliable and should not be trusted on any platform.
calvertn says:
July 27, 2013 at 11:55 am
There’s a major difference between a program looking for a forecast for next week and one looking at climate for the next decade. For the forecast you want to know the weather conditions at various times next week. For the climate forecast you want to know the average conditions then. The actual conditions would be nice, but chaos says they can’t be predicted.
If two climate models are working by simulating the weather, then it really doesn’t matter if the instantaneous weather drifts widely apart – if the average conditions (and this includes tropical storm formation, ENSO/PDO/AMO/NAO/MJO and all the other oscillations) vary within similar limits, then the climate models have produced matching results. (If they’re really good, they’ll even be right.)
This bugs the heck out of me, so let me say it again – forecasting climate does not require accurately forecasting the weather along the way.
Another way of looking at it is to consider Edward Lorenz’s attractor”, seehttp://paulbourke.net/fractals/lorenz/ and http://en.wikipedia.org/wiki/Lorenz_system While modeling the attractor with slightly different starting points will lead to very different trajectories, you can define a small volume that will enclose nearly all the trajectory.
The trajectory is analogous to weather – it has data that can be described as discrete points with numerical values. If some of the coefficients that describe the system change, then the overall appearance will change and that’s analogous to climate.
The trick to forecasting weather is to get the data points right. The trick to forecasting climate is to get the the changing input and the response to changing input right.