An Ocean of Overconfidence

Guest Post by Willis Eschenbach

I previously discussed the question of error bars in oceanic heat content measurements in “Decimals of Precision“. There’s a new study of changes in oceanic heat content, by Levitus et al., called “World Ocean Heat Content And Thermosteric Sea Level Change (0-2000), 1955-2010” (paywalled here). [UPDATE: Available here, h/t Leif Svalgaard] It’s highlighted over at Roger Pielke Senior’s excellent blog , where he shows this graph of the results:

Figure 1. From Levitus 2012. Upper graphs show changes in ocean heat content, in units of 1022 joules. Lower graphs show data coverage.

Now, there’s some oddities in this graph. For one, the data starts at year 1957.5, presumably because each year’s value is actually a centered five-year average … which makes me nervous already, very nervous. Why not show the actual annual data? What are the averages hiding?

But what was of most interest to me are the error bars. To get the heat content figures, they are actually measuring the ocean temperature. Then they are converting that change in temperature into a change in heat content. So to understand the underlying measurements, I’ve converted the graph of the 0-2000 metre ocean heat content shown in Figure 1 back into units of temperature. Figure 2 shows that result.

Figure 2. Graph of ocean heat anomaly 0.-2000 metres from Figure 1, with the units converted to degrees Celsius. Note that the total change over the entire period is 0.09°C, which agrees with the total change reported in their paper.

Here’s the problem I have with this graph. It claims that we know the temperature of the top two kilometres (1.2 miles) of the ocean in 1955-60 with an error of plus or minus one and a half hundredths of a degree C

It also claims that we currently know the temperature of the top 2 kilometers of the global ocean, which is some 673,423,330,000,000,000 tonnes (673 quadrillion tonnes) of water, with an error of plus or minus two thousandths of a degree C

I’m sorry, but I’m not buying that. I don’t know how they are calculating their error bars, but that is just not possible. Ask any industrial process engineer. If you want to measure something as small as an Olympic-size swimming pool full of water to the nearest two thousandths of a degree C, you need a fistful of thermometers, one or two would be wildly inadequate for the job. And the top two kilometres of the global ocean is unimaginably huge, with as much volume as 260,700,000,000,000 Olympic-size swimming pools …

So I don’t know where they got their error numbers … but I’m going on record to say that they have greatly underestimated the errors in their calculations.

w.

PS—One final oddity. If the ocean heating is driven by increasing CO2 and increasing surface temperatures as the authors claim, why didn’t the oceans warm in the slightest from about 1978 to 1990, while CO2 was rising and the surface temperature was increasing?

PPS—Bonus question. Suppose we have an Olympic-sized swimming pool, and one perfectly accurate thermometer mounted in one location in the pool. Suppose we take one measurement per day. How long will we have to take daily measurements before we know the temperature of the entire pool full of water to the nearest two thousandths of a degree C?

5 1 vote
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

135 Comments
Inline Feedbacks
View all comments
timg56
April 23, 2012 12:42 pm

RE ISO 9000 Standards.
Gail has it right. ISO Standards are basically feel good measures. They are documentation measures, not quality improvement measures.

John in L du B
April 23, 2012 12:51 pm

Rich Lambert says:
April 23, 2012 at 7:44 am
Climate science needs something like the ISO 9000 quality control standards used by industry.
REPLY: Yep, I argued for this back in 2008:
I’ve said the same thing here before too. I never believed that peer review was adequate quality assurance and neither do many regulators (e.g. the USFDA, USNRC). Actually better would be something less custormer focused, more along the lines of ISO 13485 or 21CRF820, Good Manufacturing Practices, which require you to fire the customer if they cut corners, are known to be engaging in fraud or dangerous practices.
“Sorry NSF, I’m returning your cheque because the Administration is misusing and misrepresenting my research to the National detriment”

Chuck Nolan
April 23, 2012 12:52 pm

Kelvin Vaughan says:
April 23, 2012 at 8:52 am
Why don’t heated swimming pool owners pump carbon dioxide into their pools. They will need less heating then?
————————
My guess is it would take a whole lot of CO2 to keep it warm in the winter.
/sarc

Deliberately Anonymous
April 23, 2012 1:02 pm

At the risk of not preserving anonymity, I know the company that makes the sensors on the ARGO floats – competitors of ours….
It is perfectly reasonable that they claim an accuracy of better than ±0.002°C for any individual measurement. In fact this particular company has an excellent reputation and can achieve even better than that. In order to achieve this claimed accuracy, they do pretty much what we do – basically you calibrate your sensors in a highly controlled environment, against highly accurate standards. Amongst other things, this means regular checking of your reference sensors against Triple Point of Water cells, Melting Point of Gallium cells, etc. It also means your calibration laboratory is temperature controlled (much like Gary Swift’s proving oven, above), and uses well stirred, insulated and very stable liquid baths. All of which is as you might expect.
Being in possession of several such insulated and stable liquid baths, I can tell you that the pinnacle of stability that we can achieve in a highly controlled calibration environment is to be able to hold ~1m³ water to ±0.001°C for about a minute or so* (what I mean by that is that taking point measurements at say 1Hz intervals in an agressively stirred bath, the peak to peak noise over a minute will be less than ±0.001°C). Over a day it may fluctuate by up to ±0.05° , or perhaps more if we are not careful. We can hold smaller volumes (1 or 2 litres) to perhaps ±0.0005°C for 2 or 3 minutes on a good day*. Not having been given a guided tour of our competitor’s facility I am straying into guesswork, but I would imagine that this is similar to their own experience.
My eventual point is that I agree with you, Willis. We can and do try our hardest to measure to this level of accuracy for a single measurement, but to suggest that an entire dataset can be reduced to a single value with this same level of accuracy just doesn’t add up. It fails the smell test. We go to a hell of a lot of effort to be able to measure temperature to this precision and accuracy in a controlled environment. The sea is not a controlled environment – I see data all the time where people take a temperature profile at a site, and then take another a few minutes later, and the results have changed by fractions of a degree (if they are lucky – sometimes it’s completely different). Even within a single profile, at “stable” conditions, the temperature data will vary within a metre or two by more than a couple of millidegrees.
I put my hands up – I am not a professional statistician, but am obviously familiar with most simple principles such as averaging to reduce white noise, statistical significance, distributions, SD , etc, etc; I am however a scientist of a certain stripe by (first degree) training, and now a bit-part engineer, production engineer, finance guy, salesman and HR “expert”, as most entrepreneurs end up being, with an added dose of expertise in one’s chosen field, which in my case happens to be oceanography. The oceans are a highly dynamic environment, not a controlled one. I could be persuaded that minute by minute readings of a single location over a year could yield an average for that exact location to this order of millidegree accuracy. Would that average mean much? Not in isolation, no. I just don’t find it conceivable that the sporadic nature of ARGO measurements (and highly sporadic methods beforehand) can yield that level of accuracy for the whole globe. A finger in the air (or dipped in the sea)? I would suggest that considering their paucity of data, combined with the variability of the environment they are operating in, it would not be unreasonable to think that the globalised average number (over the top bit of the ocean only) was correct to within ±2°C , whatever that single distilled number is actually worth. But what do I know? Two things:
1) When the guys who make your test equipment** doubt your conclusions, there is something seriously wrong.
2) There should be many many more sensors in the sea 😉
(*) As measured against many 10’s of thousands of dollars worth of measurement bridges and PRTs.
(**) Just to reiterate – my company doesn’t make ARGO sensors, just instruments with similar sensors.

Curiousgeorge
April 23, 2012 1:32 pm

Just a comment on standardization in general: There are 3 basic types of standardization. In Europe and some other countries, Standards carry the weight of law and are enforceable as such. In the US we have a voluntary Standards system, which is usually contractually enforced between companies and their customers, but only carries civil liability in most cases. In Asia, Standards are usually company based – or in the case of Japan within the Kiretsu (sp?) – a large company will have their own internal standards that are imposed on suppliers. Very little government involvement.
These different systems often result in trade problems with import/export, as we see from time to time, with regard to products and commodities.
A bit of trivia – the ancient Chinese dynasties were among the first to implement quality and process control standards. A bow and arrow from one manufacturer would be indentical to that from another, or somebody would lose their head. Same goes for many other ancient cultures. The Pyramids in Egypt could not have been built without formal standards (of measurement, process, etc.) that were rigidly enforced.

Jim G
April 23, 2012 2:01 pm

Willis,
“That’s not true, because the number of observations per buoy drops out of the equation. If you get accuracy Z from X buoys each taking Y observations per year, if you have ten times the buoys you get an extra decimal of precision regardless of whether Y is 10 or 1000 observations per year (as long as Y is constant).”
w.
I believe you mean as long as X is constant?

Eric in CO
April 23, 2012 2:14 pm

Forget the ocean data. I don’t buy that we have 1/10th of enough thermometers to measure the air temperature. When the temperature varies by as much as 10F on my way to work(15 miles) outside Denver and the closest NOAA station is in CO Springs, there is not enough data to plot global temperature. And don’t get me started on all the corrections they make.

Chuck Wiese
April 23, 2012 2:22 pm

There has been a lot of discussion about statistal inference of the claimed accuracy of the measuring of ocean heat, reverse engineered from temperature change. I side with Willis on his take of things here. He is doing an excellent job.
However, one important thing that needs to be stated here is that the measurements ( regardless of the claim of accuracy ) do not demonstrate that there has been any change in the Greenhouse factor G, which is the difference in the spectrally integrated outgoing long wave radiation compared to the surface blackbody emission, or G = Sg – OLR. That is what really counts, and measuring the ocean heat content does not identify what specific wavelengths of energy contributed to warming, regardless of the accuracy of the amount.. There is no way to separate the contributing wavelengths that vary anywhere from solar ultra violet to the far end of the infrared spectrum, so in this regard, the claim of a rising heat content as being a proof that greenhouse gases were the cause is patently ridiculous.

Nisse
April 23, 2012 2:28 pm

Don´t know about the rest of you guys, but to me that’s an example of a climbing curve.

Alan S. Blue
April 23, 2012 2:32 pm

peterhdn,
A regular thermometer is fundamentally a point-source measurement.
That is: The actual instrument is only competent at measuring the temperature of the material at ‘one point’ – the tip of the thermometer.
None of the satellite methods are point-source measurements. They have different issues – but they’re -fundamentally- integrating in their operation.

P. Solar
April 23, 2012 2:36 pm

Bob Tisdale says:
April 23, 2012 at 7:41 am
>>
Willis: A free draft of the Levitus et al (2012) paper is available through the NODC website:
http://data.nodc.noaa.gov/woa/PUBLICATIONS/grlheat12.pdf
>>
Good find Bob, that copy includes the all important supplementary information section that covers the uncertainty calculations.

Dr Burns
April 23, 2012 2:47 pm

Is there any evidence of 1955 marine thermometer temperature measurement having a measurement accuracy better than +/- 0.5 degrees ?

Curiousgeorge
April 23, 2012 2:57 pm

Dr Burns says:
April 23, 2012 at 2:47 pm
Is there any evidence of 1955 marine thermometer temperature measurement having a measurement accuracy better than +/- 0.5 degrees ?
**********************************************************************************************
For things metrological, I’d suggest you contact NIST – http://www.nist.gov/index.html . They are very helpful folks if you email the right person.

Eyes Wide Open
April 23, 2012 3:10 pm

reclaimreality says:
“Thermal heat expansion of ocean water is ~independent of water temperature, and thus changes in heat content and total volume match each other well.”
Huh? Ever looked up the coefficient of expansion for water? It varies highly with temperature and drops to zero as the temperature drops down to zero at about 4 degrees C. So that presents a challenge (and a benefit) in that heat entering the cold, deep layers of the ocean has little impact on sea level!
http://www.newton.dep.anl.gov/askasci/chem03/chem03335.htm

JK
April 23, 2012 3:28 pm

Willis writes:
“Square root scaling is not just “back of the envelope”, it is built into the mathematics of how we calculate errors. You depend on it and call it “basic statistics” in your example above. … the number of observations per buoy drops out of the equation.”
I don’t think it’s quite as simple as that. Possibly what you say is true for large numbers of observations well enough spaced in space and time.
Consider your PPS problem. As has been said, however many measurements you take at one point – even millions – will not give you a good average for the pool temperature.
However, if you are allowed to take the same number of measurements at random positions then your measurements will eventually converge on the average (presuming that the average is constant in time).
I think those are different statistical problems, which show that it is not simply a question of the number of measurements.
For an analogous situation with the planet suppose that you were allowed to take as many million observations as you like, but were restricted to one hemisphere.
Or suppose that you could take as many observations as you like but they were restricted to a few time points across 5 years.
Coverage matters, which is why it’s hard to do the job with three buoys and many measurements (also you may land up taking so many measurements from each buoy that correlation between measurements starts to matter much more, so after a certain point they are not really independent.)
I guess that there are contributions to the error which are large for small numbers of buoys, reflecting the problem of limited coverage, but go to zero for better coverage. With large numbers and good coverage error will be more square-root like, but there may be other complications, too.
It is easy by ‘intuition’ to say that 3 or 30 buoys is too few, and that millions would be adequate. That’s like the case of the 28 foot man don’t think it’s a co-incidence that the number of buoys we have is in the middle of this range. Most likely it was designed at minimum cost to get a worthwhile answer. That means, if the designers have done their job, that the error will likely be on the edge between obviously excellent and obviously terrible.

Deliberately Anonymous
April 23, 2012 4:08 pm

JK says:
“…Most likely it was designed at minimum cost to get a worthwhile answer….”
###########
More like……
We want to do this big temperature measurement thing…..
How much?
$100m!!!!!
OK – put a tender out….
Suppliers say: ” we can get $20k each for these things….”
Right – $20k each plus $10k deployment costs, less some cream off the top for administration 😉 ….
We can afford 3000 of them.. That’s how many we’ll have.
And that is how it is done………….

P. Solar
April 23, 2012 4:08 pm

I have just scanned the version the Bob linked above that has the error calculations. This details the error propagation involved in their gridding method based on the variance of the readings. Unless I have misread it, it makes NO reference to the measurement uncertainty, ie it assumes (implicitly) that the measurement uncertainty is zero.
JK: However, to get an idea of the power of averaging suppose that there are 3000 argo floats, each reporting once every 10 days for 5 years. This will give 547,500 measurements. Each of those measurements will have an error associated with it. But to the extent that the error is random these will cancel out. From basic statistics we can expect that the error on the mean will be down by a factor of about square root of 547,500 which is about 740. That means that if there is a random error in each individual float of about 1.5 degrees the error on the 5 year global mean will be about 0.002 degrees.
Willis:
>>
Thanks, JK. You are correct in general. There are, as always, two questions. One is whether their assumption of random errors is correct. The other is whether they have done all parts of the statistics correctly. I am engaged in some fascinating research on that question right now, more to follow.
>>
No.This is the fundamental fallacy behind the ridiculously low ARGO uncertainty. You do not have 547,500 measurements of the same thing , that would allow the ‘divide by square root’ thing. You have 547,500 separate measurements of separate temperatures each with the full uncertainty. An uncertainty that is far larger than the simple platinum sensor precision.
Some part of that uncertainty , the instrument calibration uncertainty, will be reduced in this way over the thousands of individual floats. In any case it will probably be negligible compared to other errors.
However, that idea can not be applied to averaging of the temperatures which introduces a whole realm of other uncertainties. Principally, how representative is one reading at one point and a depth of the average temperature of the HUGE volume of water that it is being taken to represent. The uncertainty here is not millikelvin , it’s probably several degrees. That is several orders of magnitude bigger than some silly claims based on the total number of individual measurements as JK suggested above.
This representative accuracy of the one temperature then has to be added to the variance of the individual readings used to calculate the propagation errors in the paper.
Just how that will affect the overall uncertainty needs digging into but it will make a significant difference.
The problem here is that this has been totally ignored without the slightest discussion or recognition in the paper. Once again, climate scientists seem quite incapable (or unwilling) to correctly asses the uncertainty in their work.

Justthinkin
April 23, 2012 4:26 pm

“Rich Lambert says:
April 23, 2012 at 7:44 am
Climate science needs something like the ISO 9000 quality control standards used by industry.”
Not a bloody chance! These clowns are all ready spewing out enough useless paperwork(and killing millions of innocent trees).All ISO 9000 and others do is produce jobs for writers and guys in companies like KPMG. As Gail pointed out,too much writing and not enough actual study. What climate scince needs is a professional legal body like engineers,geologists,lawyers,etc have.
As a quality assurance manager who was forced to introduce one of the ludicrous ISO programs,let me assure you that after a cost analysis of said implemantation and value added,it was scrapped pretty quickly.

JK
April 23, 2012 4:49 pm

P. Solar writes:
“Principally, how representative is one reading at one point and a depth of the average temperature of the HUGE volume of water that it is being taken to represent. The uncertainty here is not millikelvin , it’s probably several degrees.”
I agree that this is the main problem.
The question is can this error be reduced by averaging across many measurements?
(The real calculation should use heat content, although we can get a rough idea of what this means with an “equivalent temperature” calculation. Just to try to understand principles we can think a bit about temperatures.)
The huge body of water that each float represents does in fact have a real average temperature. Each float will measure a temperature that deviates from this true average. That deviation is the error.
Some floats will find themselves warmer than the average for their region. Others will find themselves cooler than the average for their region. The error will be a random variable. For example, it may be approximately normally distributed, with a standard deviation of several Kelvin.
In this simple model the errors on the floats are uncorrelated: whether one float finds itself at a higher or lower temperature than the average for its region is independent of whether it’s neighbours are higher or lower than the average for their regions. Note, in this model it is the errors that are uncorrelated not the average. If a float in a high average region then it’s neighbour is more likely to be in a high average region. That’s a different question.
In that case then averaging, adding up these uncorrelated error terms, will indeed reduce their expected sum by a square root factor.
If you disagree with that, please explain where the disagreement is. Otherwise we can move on to clarify what changes in more realistic situations.

April 23, 2012 5:09 pm

When I ride in an elevator, there is always an inspection sticker signed by the inspector.
When I stop at the gas pump, there are certification stickers certifying accuracy, who checked and when.
When I use the grocer’s scales, there are certification stickers certifying accuracy, who checked and when.
What I wonder is who has the responsibility for certifying the accuracy of Argo floats. I looked and was unable to find the specification for the thermometer component. I did find that Argo thinks of the floats as drop and run till dead.
I have a very hard time understanding that whatever temperature sensor they are using (thermocouple?) is consistently accurate for years without adjustment. About that adjustment; I did find this

“…It is important to understand the basic variable naming structure within the profile files. There are two versions of temperature, salinity and pressure in each file: a “real time” version (TEMP, PSAL, PRES) and a “delayed mode” version (TEMP_ADJUSTED, PSAL_ADJUSTED, PRES_ADJUSTED). It is recommended that users work with the delayed mode, or adjusted version, if it is filled.
In the “D” files, the adjusted variables are filled with the data that has been corrected after examination by the oceanographic experts, making it the highest quality data for that profile…”

at http://www.argo.ucsd.edu/Argo_date_guide.html#realtime.
Hmmm, adjustments make for more accurate files… Yeah, sure.
I also located this about relative accuracy;

“…Uncertainties less than 0.5oC are shaded in Fig. 5 to represent the achievable accuracy for upper layer
temperature estimation. This is equivalent to an accuracy in bimonthly heat content changes of 15 W/m2 for a 50 m thick layer. At that level of accuracy, errors in seasonal changes in heat content are comparable to the errors sought in air-sea heat exchange estimates. It should be noted that the temperature and heat storage errors can be reduced by temporal or spatial averaging or through combination of in situ data with altimetric data (see next section). Of the three terms in the oceanic heat balance – storage, air-sea flux, and ocean heat transport – the storage term is potentially the most accurate because it is not subject to large systematic errors. Large areas exist in the Pacific where the desired accuracy is not available from the present XBT network (Fig. 5). The 3o by 3o array attains 0.5oC accuracy over most of the domain….”

from http://www.argo.ucsd.edu/argo-design.pdf.
So, we do not know the specifications for the thermometer(s) used in Argo, their life cycle, or relative accuracy. Nor do we know dependability.
From what I read of the link Leif provided, I did not see where the author’s zeroed their temperature. That is, Did the authors use any other study to determine relative accuracy of temperatures between the different versions of Argo floats nor how to aggregate their temperatures without aggravating gross errors.
Yeah, I know the “D” files are adjusted and considered accurate… Sure and there also a certification of accuracy that allows one to calculate accuracy to .01% heat gain?

Jaye Bass
April 23, 2012 5:32 pm

While we are at it, I would love to see all the unit tests, for all the functions, for all the climate models that the US government has paid for. Surely they do that, right?

April 23, 2012 5:49 pm

Thanks Willis
Between you, Gary Swift, Curiousgeorge and others (engineers aware of the impossibility of such precision), Tallbloke and Nic Lewis (James Annan’s evidence of mistakes in Levitus being stonewalled by Levitus and Science mag) I think we have sent the whole wretched shabang to Davey Jones, scuttled faster than any official peer-review could salvage.

stevefitzpatrick
April 23, 2012 7:06 pm

Willis,
The change in ocean heat content is dominated by changes in the top 300 meters or so. Yes, there is modest warming below 300 meters (how could there not be?), but most of the accumulation of heat is in fact relatively close to the surface. The increase in temperature in the top 300 meters is not the tiny number that the “average of 2000 meters” indicates. There is of course uncertainty in the measured ocean heat content (Argo is by no means perfect!), but the continuing trend over time suggests a gradual accumulation of heat.
Which is not at all surprising when you consider that the ocean surface temperature has increased modestly (about 0.6 – 0.7C) over the past 100 or so years. I have a sincere question for you: do you really doubt that a rising ocean surface temperature would not be expected to lead to a rising ocean heat content? I can understand arguing about uncertainty in the exact values; I can’t understand arguing that a significant increase in ocean heat content is not expected based on the historical ocean surface temperature trend.

theofloinn
April 23, 2012 8:18 pm

ISO 9000 is like any other tool, capable of being used in foolish ways. I’ve been to companies that “have” SPC and proudly showed me walls wallpapered with computer printouts of control charts. But when I asked about various footprints I saw in the charts, I got blank looks on the bounceback. I had to tell them that SPC was not something you “have”, it’s something you “do.”
The same can be said for ISO 9000. It is certainly possible to turn it into a bureaucratic mess, especially if you approach it as “another paperwork requirement” rather than as an opportunity to study your processes systematically for inefficiencies, optimize those processes, then standardize and control to the optimum. The requirements of the standard are not especially onerous and can be found back in the Grand-daddy of em all, good ol’ MIL-Q-9858A.
+ + +
Precision, whether by the SQRT(n) or no, is a different critter from accuracy. It’s possible to get a very precise estimate of a totally inaccurate figure. Which is why “error bars” are not the whole story. There are other sorts of errors. My old boss, Ed Schrock, one of the founders of ASQC, liked to talk about “Errors of the Third Kind,” which was simply getting the wrong kind of data to begin with.
Besides, no matter how many men and women (n) you may measure, the tight precision on the average number of testicles will be utterly without physical meaning.
+ + +
It is always interesting to see confidence intervals on a parameter presented as if they were prediction intervals on the actual measurements. The former are always narrower than the latter; esp, if you start with five year rolling averages as your “data.”