Can We Tell If The Oceans Are Warming?

Guest Post by Willis Eschenbach

Well, I was going to write about hourly albedo changes, honest I was, but as is often the case I got sidetractored. My great thanks to Joanne Nova for highlighting a mostly unknown paper on the error estimate for the Argo dataset entitled On the accuracy of North Atlantic temperature and heat storage fields from Argo by R. E. Hadfield et al., hereinafter Hadfield2007. As a bit of history, three years ago in a post entitled “Decimals of Precision” I pointed out inconsistencies in the prevailing Argo error estimates. My calculations in that post showed that their claims of accuracy were way overblown.

The claims of precision at the time, which are unchanged today, can be seen in Figure 1(a) below from the paper Observed changes in top-of-the-atmosphere radiation and upper-ocean heating consistent within uncertainty, Norman G. Loeb et al, paywalled here, hereinafter Loeb2012

 

loeb ocean heating rates

Figure 1. This shows Fig. 1(a) from Loeb2012. ORIGINAL CAPTION: a, Annual global averaged upper-ocean warming rates computed from first differences of the Pacific Marine Environmental Laboratory/Jet Propulsion Laboratory/Joint Institute for Marine and Atmospheric Research (PMEL/JPL/JIMAR), NODC, and Hadley, 0–700m

I must apologize for the quality of the graphics, but sadly the document is paywalled. It’s OK, I just wanted to see their error estimates.

As you can see, Loeb2012 is showing the oceanic heating rates in watts per square metre applied over each year. All three groups report about the same size of error. The error in the earliest data is about 1 W/m2. However, the size of the error starts decreasing once the Argo buoys started coming on line in 2006. At the end of their record all three groups are showing errors well under half a watt per square metre.

 

loeb toa flux ocean heatingFigure 2. This shows Fig. 3(a) from Loeb2012. Black shows the available heat for storage as shown by the CERES satellite data. Blue shows heating rates to 1800 metres, and red shows heating rates to 700 metres. ORIGINAL CAPTION: a, Global annual average (July to June) net TOA flux from CERES observations (based on the EBAF-TOA_Ed2.6 product) and 0–700 and 0–1,800m ocean heating rates from PMEL/JPL/JIMAR

Here we see that at the end of their dataset the error for the 1800 metre deep layer was also under half a watt per square metre.

But how much temperature change does that half-watt per square metre error represent? My rule of thumb is simple.

One watt per square metre for one year warms one cubic metre of the ocean by 8°C

(Yeah, it’s actually 8.15°C, but I do lots of general calcs, so a couple of percent error is OK for ease of calculation and memory). That means a half watt for a year is 4°C per cubic metre.

So … for an 1800 metre deep layer of water, Loeb2012 is saying the standard error of their temperature measurements is 4°C / 1800 = about two thousandths of a degree C (0.002°C). For the shallower 700 metre layer, since the forcing error is the same but the mass is smaller, the same error in W/m2 gives a larger temperature error of 4°C / 700, which equals a whopping temperature error of six thousandths of a degree C (0.006°C).

I said at that time that this claimed accuracy, somewhere around five thousandths of a degree (0.005°C), was … well … highly unlikely.

Jo Nova points out that curiously, the paper was written in 2007, but it got little traction at the time or since. I certainly hadn’t read it when I wrote my post cited above. The following paragraphs from their study are of interest:

 

ABSTRACT:

Using OCCAM subsampled to typical Argo sampling density, it is found that outside of the western boundary, the mixed layer monthly heat storage in the subtropical North Atlantic has a sampling error of 10–20 Wm2 when averaged over a 10 x 10 area. This error reduces to less than 10 Wm2 when seasonal heat storage is considered. Errors of this magnitude suggest that the Argo dataset is of use for investigating variability in mixed layer heat storage on interannual timescales. However, the expected sampling error increases to more than 50 Wm2 in the Gulf Stream region and north of 40N, limiting the use of Argo in these areas.

and

Our analysis of subsampled temperature fields from the OCCAM model has shown that in the subtropical North Atlantic, the Argo project provides temperature data at a spatial and temporal resolution that results in a sampling uncertainty in mixed layer heat storage of order 10–20 Wm−2. The error gets smaller as the period considered increases and at seasonal [annual] timescales is reduced to 7 ± 1.5 Wm−2. Within the Gulf Stream and subpolar regions, the sampling errors are much larger and thus the Argo dataset will be less useful in these regions for investigating variability in the mixed layer heat storage.

Once again I wanted to convert their units of W/m2 to a temperature change. The problem I have with the units many of these papers use is that “7 ± 1.5 Wm−2” just doesn’t mean much to me. In addition, the Argo buoys are not measuring W/m2, they’re measuring temperatures and converting them to W/m2. So my question upon reading the paper was, how much will their cited error of “7 W/m2″ for one year change the temperature of the “mixed layer” of the North Atlantic? And what is the mixed layer anyhow?

Well, they’ve picked a kind of curious thing to measure. The “mixed layer” is the top layer of the ocean that is mixed by both the wind and by the nightly overturning of the ocean. It is of interest in a climate sense because it’s the part of the ocean that responds to the changing temperatures above. It can be defined numerically in a number of ways. Basically, it’s the layer from the surface down to the “thermocline”, the point where the ocean starts cooling rapidly with depth. Jayne Doucette of the Woods Hole Oceanographic Institute has made a lovely drawing of most of the things that go in the mixed layer. [For unknown reasons she’s omitted one of the most important circulations, the nightly overturning of the upper ocean.]

 

Jayne Doucette mixed layer WHOIFigure 3. The mixed layer, showing various physical and biological process occurring in the layer.

According to the paper, the definition that they have chosen is that the mixed layer is the depth at which the ocean is 0.2°C cooler than the temperature at ten metres depth. OK, no problem, that’s one of the standard definitions … but how deep is the mixed layer?

Well, the problem is that the mixed layer depth varies by both location and time of year. Figure 4 shows typical variations in the depth of the mixed layer at a single location by month.

 

monthly mixed layer depthFigure 4. Typical variations of the depth of the mixed layer by month. Sorry, no provenance for the graph other than Wiki. Given the temperatures I’m guessing North Atlantic. In any case, it is entirely representative of the species.

You can see how the temperature is almost the same all the way down to the thermocline, and then starts dropping rapidly.

However, I couldn’t find any number for the average mixed layer depth anywhere. So instead, I downloaded the 2°x2° mixed layer depth monthly climatology dataset entitled “mld_DT02_c1m_reg2.0_Global.nc” from here and took the area-weighted average of the mixed layer depth. It turns out that globally the mixed layer depth averages just under sixty metres. The whole process for doing the calculations including writing the code took about half an hour … I’ve appended the code for those interested.

Then I went on to resample their 2°x2° dataset to a 1°x1° grid, which of course gave me the same answer for the average, but it allowed me to use my usual graphics routines to display the depths.

 

average mixed layer depthFigure 5. Average mixed layer depth around the globe. Green and blue areas show deeper mixed layers.

I do love climate science because I never know what I”ll have to learn in order to do my research. This time I’ve gotten to explore the depth of the mixed layer. As you might imagine, in the stormiest areas the largest waves mix the ocean to the greatest depths, which are shown in green and blue. You can also see the mark of the El Nino/La Nina along the Equator off the coast of Ecuador. There, the trade winds blow the warm surface waters to the west, and leave the thermocline closer to the surface. So much to learn … but I digress. I could see that there were a number of shallow areas in the North Atlantic, which was the area used for the Argo study. So I calculated the average mixed layer depth for the North Atlantic (5°N-65°N, 0°W-90°W. This turns out to be 53 metres, about seven metres shallower than the global average.

Now, recalling the rule of thumb:

One watt per square metre for one year raises one cubic metre of seawater about eight degrees.

Using the rule of thumb with a depth of 53 metres, one W/m2 over one year raises 53 cubic metres (mixed layer depth) of seawater about 8/53 = .15°C. However, they estimate the annual error at seven W/m2 (see their quote above). This means that Hadfield2007 are saying the Argo floats can only determine the average annual temperature of the North Atlantic mixed layer to within plus or minus 1°C …

Now, to me that seems reasonable. It is very, very hard to accurately measure the average temperature of a wildly discontinuous body of water like oh, I don’t know, say the North Atlantic. Or any other ocean.

So far, so good. Now comes the tough part. We know that Argo can measure the temperature of the North Atlantic mixed layer with an error of ±1°C. Then the question becomes … if we could measure the whole ocean with the same density of measurements as the Argo North Atlantic, what would the error of the final average be?

The answer to this rests on a curious fact—assuming that the errors are symmetrical, the error of the average of a series of measurements, each of which has its own inherent error, is smaller than the average of the individual errors. If the errors are all equal to say E, then if we are averaging N items each of which has an error E, the error scales as

sqrt(N)/N

So for example if you are averaging one hundred items each with an error of E, your error is a tenth of E [ sqrt(100)/100 ].

If the 118 errors are not all equal, on the other hand, then what scales by sqrt(N)/N is not the error E but

sqrt(E^2 + SD^2)

where SD is the standard deviation of the errors.

Now, let’s assume for the moment that the global ocean is measured at the same measurement density as the North Atlantic in the study. It’s not, but let’s ignore that for the moment. Regarding the 700 metre deep layer, we need to determine how much larger in volume it is than the volume of the NA mixed layer. It turns out that the answer is that the global ocean down to 700 metres is 118 times the volume of the NA mixed layer.

Unfortunately, while we know the mean error (7 W/m2 = 1°C), we don’t know the standard deviation of those errors. However, they do say that there are many areas with larger errors. So if we assumed something like a standard deviation of say 3.5 W/m2 = 0.5°C, we’d likely be conservative, it may well be larger.

Putting it all together: IF we can measure the North Atlantic mixed layer with a mean error of 1° C and an error SD of 0.5°C, then with the same measurement density we should be able to measure the global ocean to

sqrt(118)/118 * sqrt( 1^2 + 0.5^2 ) = 0.1°C

Now, recall from above that Loeb2012 claimed an error of something like 0.005°C … which appears to be optimistic by a factor of about twenty.

And my guess is that underestimating the actual error by a factor of 20 is the best case. I say this because they’ve already pointed out that “the expected sampling error increases to more than 50 Wm2 in the Gulf Stream region and north of 40N”. So their estimate doesn’t even hold for all of the North Atlantic

I also say it is a best case because it assumes that a) the errors are symmetrical, and that b) all parts of the ocean are sampled with the same frequency as the upper 53 metres of the Mediterranean. I doubt if either of those is true, which would make the uncertainty even larger.

In any case, I am glad that once again, mainstream science verifies the interesting work that is being done here at WUWT. If you wonder what it all means, look at Figure 1, and consider that in reality the errors bars are twenty times larger … clearly, with those kinds of errors we can say nothing about whether the ocean might be warming, cooling, or standing still.

Best to all,

w.

PS: I’ve been a bit slow writing this because a teenage single mother and her four delinquent children seem to have moved in downstairs … and we don’t have a downstairs. Here they are:

CUSTOMARY REQUEST: If you disagree with someone, please quote the exact words you find problems with, so that all of us can understand your objection.

CODE: These days I mostly use the computer language “R” for all my work. I learned it a few years ago at the urging of Steve McIntyre, and it’s far and away the best of the dozen or so computer languages I’ve written code in. The code for getting the weighted average mixed layer depth is pretty simple, and it gives you an idea of the power of the language.

# specify URL and file name -----------------------------------------------

mldurl="http://www.ifremer.fr/cerweb/deboyer/data/mld_DT02_c1m_reg2.0.nc"

mldfile="Mixed Layer Depth DT02_c1m_reg2.0.nc"

# download file -----------------------------------------------------------

download.file(mldurl,mldfile)

# extract and clean up variable ( 90 rows latitude by 180 colums longitude by 12 months)

nc=open.ncdf(mldfile) 

mld=aperm(get.var.ncdf(nc,"mld"),c(2,1,3)) #the “aperm” changes from a 180 row 90 col to 90 x 180

mld[mld==1.000000e+09]=NA # replace missing values with NA

# create area weights ------------(they use a strange unequal 2° grid with the last point at 89.5°N)

latline=seq(-88,90,2)

latline[90]=89.5

latline=cos(latline*pi/180)

latmatrix2=matrix(rep(latline,180),90,180)

# take array gridcell averages over the 12 months 

mldmap=rowMeans(mld,dims = 2,na.rm = T)

dim(mldmap) #checking the dimensions of the result, 90 latitude x 180 longitude

[1]  90 180

# take weighted mean of gridcells 

weighted.mean(mldmap,latmatrix2,na.rm=T)

[1] 59.28661

The climate data they don't want you to find — free, to your inbox.
Join readers who get 5–8 new articles daily — no algorithms, no shadow bans.
0 0 votes
Article Rating
265 Comments
Inline Feedbacks
View all comments
June 8, 2015 6:14 am

Richard, right on. An additional point is made in a description from Columbia U.
“When air is contact with the ocean is at a different temperature than that the sea surface, heat transfer by conduction takes place. On average the ocean is about 1 or 2 degrees warmer than the atmosphere so on average ocean heat is transferred from ocean to atmosphere by conduction.
If the ocean were colder than the atmosphere (which of course happens) the air in contact with the ocean cools, becoming denser and hence more stable, more stratified. As such the conduction process does a poor job of carrying the atmosphere heat into the cool ocean.”
They calculate:
Solar heating of the ocean on a global average is 168 watts per square meter
Net LW radiation cools the ocean, on a global average by 66 watts per square meter.
On global average the oceanic heat loss by conduction is only 24 watts per square meter.
On global average the heat loss by evaporation is 78 watts per square meter.
https://rclutz.wordpress.com/2015/05/10/empirical-evidence-oceans-make-climate/

June 8, 2015 6:30 am

Lots of good comments above, However, ALL are still ignoring major factors that contribute to inaccurate measurement of the temperature of the ocean.
1. The stated accuracy is only good in laboratory conditions. Changes of ambient temperature will affect the reading.
2. The stated accuracy is for the Electronics – where is the accuracy for the RTD AND the accuracy for the loop (electronics and RTD and connecting conductors?
3, All electronics operated differently at different temperatures and voltages. Where is the chart/table providing the degradation of accuracy in relation to the change in ambient temperature and operating voltage?
4. The buoys sit at [1000] feet for a long period of time, become acclimated to that temperature and then rise taking the temperature at various elevations is it rises. What is the initial temperature of the electronics and the degraded accuracy for that temperature? What is the temperature of the electronics and the degraded accuracy for that temperature for each of the elevations it takes another measurement?
5. The buoy has a pump that pumps water past the RTD where the temperature is taken. What is the TC for the flow of water from the suction to the RTD? What is the TC of the RTD? Is the TC for the medium protecting the RTD from the sea water? Where is all of the data for this?
6. Battery voltage and currant capacity decreases with temperature. Since the buoy sat for a period of time at low temperature the electronics will be affected. The accuracy is affected by operating voltage. How is this factored into the reading? Does the buoy rise fast enough that it stays at this low voltage/temperature affecting all readings or will it warm up as it rises causing different inaccuracies as it rises?
This device seems to be the most expensive piece of equipment to purposely ignore multiple important factors to the point that the data it generates is, by design garbage.

Reply to  usurbrain
June 8, 2015 6:32 am

“The buoys sit at 100 feet” S/B “The buoys sit at 1000 feet”

Crispin in Waterloo but really in Yogyakarta
Reply to  usurbrain
June 8, 2015 8:01 am

The electronics handle the voltage issues by operating the electronics at a voltage below the battery minimum.
The temperature is corrected by using a 4 wire RTD which automatically compensates for the changing resistance of the wires. It’s clever and simple. Most of what you mention are non-issues. There is a hysteresis on the temperature but the rise rate is known and can be automatically corrected by the onboard computer with a signal timing algorithm.
Your points about the electronics are spot on when it comes to interpreting the output from the RTD. It is a variable resistor, not a temperature reporter.
A gas chromotagraph can do clever and precise and delicate things, but only once per 30 seconds. Same with an FTIR. I’ll bet an ARGO doesn’t make more than one reading per second.

Reply to  Crispin in Waterloo but really in Yogyakarta
June 8, 2015 8:42 am

Get on ARGO and/or seabird and read what they are doing. They calibrate the electronics to a Standard RTD resistance source and rely upon the fact that the RTD follows the manufactures standard resistance/temperature curve. NOT the actual RTD at a known temp, That is not how temperature instrumentation is calibrated for nuclear power plants, oil refracting and other processes requiring accurate temperature. That works, more or less, under laboratory conditions with no problems, but how does it work with the electronics at 1000 meters at an ambient temp of 5- 6 oC for several hours?

RACookPE1978
Editor
Reply to  usurbrain
June 8, 2015 9:10 am

usurbrain

That works, more or less, under laboratory conditions with no problems, but how does it work with the electronics at 1000 meters at an ambient temp of 5- 6 oC for several hours?

Worse – each different buoy over its lifetime irregularly coated by different layers of different marine biologics and skum and contaminates – NONE of which can be “calibrated out” because each is different on each different buoy and the different ties each buoy spends in each different sea temperature and sunlight conditions.
The calibration (at manufacturer) is in a single tank with clean (sterile!) water at absolutely known conditions. Thereafter? Every buoy will change differently over its lifetime uncontrollably from every other buoy!.
Now, to the point of the paper: Is a series of un-controlled ships randomly dropping buckets over the side in specific shipping lanes using ??? uncalibrated, unknown thermometers recorded under unknown conditions BETTER and MORE RELIABLE to a 1/2 of one degree such that you CHANGE the calibrated buoy temperatures back to what the ship buckets claim they had?

Reply to  Crispin in Waterloo but really in Yogyakarta
June 8, 2015 10:13 am

ARGO measurement rate (for reporting purposes) is 1 per second.

rgbatduke
June 8, 2015 9:30 am

I’ve never taken ARGO’s error estimates particularly seriously, any more than I take HadCRUT4’s estimates seriously. HadCRUT4, recall, has surface temperature anomaly error estimates in the mid-1800s only two times larger than they are today. I don’t think so. In fact, I think it is an absurd claim.
ARGO has the same general problems that the surface temperature record has, only much worse. For one thing, it is trying to measure an entire spatiotemporal profile in a volume, so they lose precision to the third dimension compared to the two dimensional (surface!) estimates of HadCRUT4. HadCRUT4 presumably at this point incorporates ARGO for sea surface temperatures (or should). Yet it only asserts a contemporary anomaly precision on the order of 0.15 C. Surely this is on the close order of the error estimate for the ocean in depth, as it is a boundary condition where the other boundary is a more or less fixed 4 C on the vast body of the ocean.
This absolutely matters because we are really looking at non-equilibrium solutions to the Navier-Stokes equation with variable driving on at least the upper surface and with a number of novel nonlinear terms — in particular the haline density component and the sea ice component and the wind evaporation component (which among other things couples it to a second planetary-scale Navier-Stokes problem — the atmosphere). The Thermohaline Circulation pattern of the ocean — the great conveyor belt — carries heat up (really enthalpy) up and down the water column at the same time it moves it great transverse distances at the same time it drives turbulent mixing at the same time the atmosphere and rivers and ice melt are dropping in fresh water and evaporating off fresh water and binding it up in sea surface ice and heating it and cooling it so that the density and hence relative bouyancy varies as it flows around the irregular shapes of the continents, islands and ocean bottom on the rotating non-inertial reference frame surface of the spinning oblate spheroid that gravitationally binds it.
Just computing the relaxation times of the bulk ocean is a daunting process. If we “suddenly” increased the average temperature of the surface layer of the ocean by 1 C and held it there, how long would it take for this change to equilibrate throughout the bulk ocean? Most estimates I’ve read (which seem reasonable) suggest centuries to over a thousand years. The only reason one can pretend to know ocean bulk temperatures to some high precision is because the bulk of the ocean is within a degree of 4 C and there is a lot of ocean in that bulk. This knowledge is true but useless in talking about the variation in the heat content of the ocean because the uncertainty in the knowledge of the upper boundary condition is as noted order of 0.1-0.2 C. — if you believe HadCRUT4 and the kriging and assumptions used to make the error estimate this small.
The sad thing about ocean water is that it is a poor conductor, is stratified by density and depth, and is nearly stable in its stratification, so much so that the relaxation time below the thermocline is really really long and involves tiny, tiny changes in temperature as heat transport (pretty much all modes) decreases with the temperature difference. This also makes assumptions about how the temperature varies both laterally and horizontally beg the question when it comes to evaluating probable bulk precision.
rgb

Kevin Kilty
June 8, 2015 9:34 am

Mike June 7, 2015 at 10:50 pm
re Willis Eschenbach June 7, 2015 at 12:59 pm
This map is interesting, but I think you’ll find the ITCZ is the zone, just above the equator, where we see *more* floats? Either side there is a bit less float density. This would suggest that there is some drift towards ITCZ, There will be surface winds and wind induced surface currents as this is where there is most rising air. Air is drawn in either side and thus deflected westwards by Coriolis forces causing the warm westward currents either side of ITCZ.
KK says : “the buoys, being buoyant, will tend to drift toward higher ocean surface”
No a buoy is a massive object and will go to the lowest gravitational potential : a dip, like a ball on uneven ground.
The sea level is higher along ITCZ but that is due to the same winds and wind driven currents that seem to affect ARGO distribution. You raise a valid and interesting point that had not occurred to me before, just the logic was wrong about the cause.
I’m sure Karl et al can make suitable correction to ARGO to create some more global warming using this information 😉

Mike:
This is more complex than you state. The buoys are massive objects, indeed, but so is a unit volume of water–both are buoyant in the sense that a pressure distribution on the mass maintains its vertical position. This is not so simple as stating that water and other massive things go downhill. We need to consider the lateral pressure gradients, or lateral forces, that are not likely to be in static equilibrium. Think of the warm-core rings that break off the gulf stream as an example.
There are several forces a person must consider in deciding what the buoys will do. In order to maintain an anomalous height (slope) of the ocean surface requires a lateral pressure gradient. What is the origin of this? Let’s look at the floats from a Lagrangian standpoint–a coordinate system that moves with the float. The lateral pressure gradient derives from gradients in density (temperature or salinity), and because the coordinate system is non-inertial there is also potentially a coriolis force and centrifugal force (no arguments, please, about whether or not these are real forces…we are in a non-inertial system and they behave like real forces). The lateral temperature and salinity gradients are not effective at the surface to maintain a slope and so water would slowly flow downhill if not for the dynamic influences. So to maintain a slope at the surface requires coriolis or centrifugal forces that sum to an inward directed force, or a constant compensating flow of water from depth. That is, the water mass must rotate or there has to be an inflow at depth or both. If the floats drift at 1000m depth they can easily drift toward the ocean surface high. If they spend enough time at the surface, they might drift away from the surface high following the likely divergent water flow. What one needs is information about is the secondary flows of water involved in maintaining ocean height anomalies. I see the potential for temperature bias in all this, and I cannot get any information about how people analyzed this problem and decided if it is a problem or not.
I think it is time to begin the arduous task of gathering drift data to see what it looks like statistically.
Kilty

Kevin Kilty
June 8, 2015 9:47 am

Mod: I posted a long reply to Willis and Mike at about 9:35 am PDT and it appears to be no where, can you check for it, please? I hate to retype the whole thing.

Kevin Kilty
June 8, 2015 9:52 am

Mod: never mind. It finally refreshed once I sent my previous plea for your help. Thanks.

Science or Fiction
Reply to  Willis Eschenbach
June 8, 2015 12:53 pm

Agree. See this freely available international standard for formal support of your claim:
“Guide to expression of uncertainty in measurement”
http://www.bipm.org/en/publications/guides/
Section 4.2.3 covers the uncertainty of the average – the “experimental standard deviation of the mean”.
IPCC failed to notice this international guideline in their guidelines.
“Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties”
IPCC can easily be outperformed here.

Crispin in Waterloo but really in Yogyakarta
Reply to  Willis Eschenbach
June 8, 2015 6:22 pm

Thank you Willis.
This is your comparison:
Average of True weights rounded to 1 pound
We are 95% confident the average lies between 149.99 and 150.12.
Average of the same True weights rounded to 10 pounds
We are 95% confident the average lies between 149.85 and 150.11
I am familiar with the concepts you outline and sometimes use them in my work. Your true weight values do not have any uncertainty attached. Rounding does not introduce uncertainty, it introduces a perfectly balanced rounding which, done on the same 100,000 random numbers, produces virtually no change in the average nor the standard deviation, which all should notice increased, not decreased, when rounded to 10 pounds. The magnitude of the increase is related to the number of samples, of course. Do it with 3600 numbers.
Your example uses ‘real’ weights of people and rounds them ‘precisely’. You weighed them all with the same perfect scale, mathematically speaking. That is not a good analogy. What we are discussing is not an idealised calculation, it is a practical experiment with additional considerations.
Generate a set of random weights as precisely as you like, 10 digits. Apply to each one a random error which is akin to the repeatability value of each scale (this applies as well to RTDs). This will truncate the number of valid digits. You can keep lots of significant digits, but beyond a certain number they carry no information. Store the value of the uncertainty applied because you will need it later.
Then apply to each weight another random error which is akin to the accuracy of each scale as they drift away from reality over time. This further truncates the number of digits that carry valid information. Store the value of the uncertainty applied because you will need it later.
Then add another random error which represents whether or not the individual just ate a donut. This is akin to the micro climate of the water body being measured. Getting a person’s True Weight is not really possible because their weight changes all day long, just as does the temperature of any volume of real water. (I just had to get the lab staff to use a much smaller volume of water in order to calibrate the thermocouples against the RTD’s.) Store the value of the uncertainty applied because you will need it later.
Anyone’s ‘true weight’ has a +- attached. In other words, the numbers you generated as people’s true weights have to be recognised as having their own uncertainties. You compared their average with the 10 pound-rounded-average and indicated they are very similar. The two averages are not similar, they are statistically speaking, functionally identical because the system cannot confidently tell whether or not rounding to 1 pound or 10 pounds makes a difference. Your proof is self-referential. It is a proof, near as dang-it, that the numbers were randomly generated.
Instead, have 100,000 San Franciscans put on a watch that weighs 0.01 pounds. Weigh them all using 100,000 different scales with and without the watch, first using a scale that is accurate to 10 pounds and then another accurate to 1 pound. (Using a 1 pound scale and rounding to 10 is not the same thing. That’s what you did.) Calculate their average weights with and without the watch. Will you detect the watch? Everyone weighs more, right? There is 1000 pounds of watches in there.
No, it is an ‘undetectable change’. Enough people will have eaten or not eaten a donut to hid the mass of the watch. Enough scales will have deviated their second readings enough to entirely hide the watch. The 10 pound readings couldn’t detect a purse.
Put a watch on half the group. Weigh the two groups using the same 100,000 scales and they all change scales this time. Can you tell which group has watches and which does not? No. The variability of each reading is larger than the mass of the watch. If you weighed each person 100,000 times each on their respective scales, you might be able to detect the watch but not with one weighing per person.
What is the standard deviation of one reading?
Next have 100,000 different people from the same total SF population put on a hat that weighs 0.1 pounds and repeat the readings using the 1 pound scales. You know the hats are there. Can you tell if the people participating are from the first lot or the second? If not, then the deviation of the first and second average weights from the true average weight are smaller than Limit of Quantification (LoQ). Good so far.
Can you detect the hat’s mass, or even if it is there at all? No, because the Limit of Detection (LoD) is larger than the mass of the hat. The true measurement of LoD should also include all systematic errors into the calculation of standard deviation. You can prove the average is ‘within bounds’ but you cannot detect the hat no matter how may people you weigh once each.
We are right back to square one. If the measuring device cannot confidently detect the change, averaging a large number of readings that cannot detect the change will not detect it confidently. Higher precision with a low precision instrument is only obtainable by taking a large number of readings of the same thing with that same instrument, and that still doesn’t make it accurate. Accuracy requires calibration and recalibration. A certified lab spends 20% of their time calibrating systems. How much time is spent calibrating ARGOs?
Taking a large number of readings (OK, not all that large) of isolated portions of an ocean which are known not to be the same temperature throughout cannot allow one to claim that the average temperature of the whole ocean is known with greater precision and accuracy than the precision and accuracy of the individual measurements of each portion. If it were true, we would not need RTDs, we could just put hundreds of thousands of cheap, uncalibrated thermocouples into the water and average the readings.
Unlike UAH satellites, these ARGO floats cannot be recalibrated after launch. All we know is that measurement uncertainty starts off above 0.01 C and increases with time. While one can estimate that their condition will drift ‘randomly’ and that ‘on average’ their conditions will be the average of the manufacturer’s curve, they are not measuring the same thing, ever, so it doesn’t help to know that. It is just another increase in uncertainty.
Uncertainties must propagate through subsequent calculations. No one can confidently detect a 0.005 degree change in ocean temperature with measurements at 3600 sample locations that are individually +- 0.06 Degrees. Moby Dick can swim twice through the uncertainty hole.

Bruce Williams
Reply to  Willis Eschenbach
June 8, 2015 8:40 pm

The average may be very accurate, but the total energy content will not be accurate.
Because temperature is the Kinetic Energy of the associated material you will get different results based on how and when the math is done. Example (I use a unit temperature conversion to kinetic Energy such that produces Temp^2*1 = KE)
Example
Energy Calculation
(#1 – 10C 1unit volume + #2 – 2C 1 unit volume)/2 = Avg KE
KE_1 = 100 KE_2 = 4
Total KE = 104
Avg KE = 52
Sqrt(Avg KE) = 7.21C
Temperature Calculation
(T_1 + T_2)/2 = 6C
So, averages are nice numbers but they do not calculate out when roots, powers, or division are used because these functions are non linear.
Which, unless I am wrong, you cannot add temperatures together and use the average for energy content because mixing equal amounts of water at 2 different temperatures does not produce the average of the two temperatures.
You can test this yourself. You need an IR temperature sensor (for quick response). A 2 cup bowl, a 2/3 cup measuring cup, some really cold water, some hot tap water and a microwave.
Put 2/3 cup of faucet hot water into the bowl and then heat up in the microwave for about a minute. You should have water in the 150F range. Stir the water to make sure you don’t have an upper layer of hot water!
Measure the temperature of you cold water, if from the refrigerator then it should be on the order of 30F to 40F. Now pour in the 2/3 cup of the cold water into the hot water and stir it up good. Then measure the temperature of the mixture.
This mixing and measuring should take less than a minute or you may be cooling off due to natural processes.
I just did mine again and I got
Hot water – 152.2F
Cold water – 38F
Mixed water – 98.3F
Average of 152F and 38F is 95.1F.
A difference of 3.2F (-3.3%) difference.
If they used the average temperature from various readings to calculate the oceans heat content then they are way off!
And even at that, to prevent bias due to measurement accuracy it seems they would have to calculate each and every points high and low energy values then sum them to get any where near an accurate number for the energy content of the ocean at any one time, especially for comparison from year to year. Does anyone know if they did that?

Crispin in Waterloo but really in Yogyakarta
Reply to  Bruce Williams
June 9, 2015 9:41 am

Bruce, the average is not very accurate, not in the sense claimed. If I used a 100 pound resolution scale and put a 2 pound dog on it, and measured it 1 billion times, do you think the dog would eventually show up as having a non-zero mass?
The reasons why not is the same reason the cosmic background radiation was not found for so long: they needed far more precise instruments to detect it. Measuring billions of times with a less precise instrument did not detect something below the detection limit. You can generate hundreds of examples. This is different from leaving a photo plate exposed to a distant galaxy and waiting for a long time. That works because the film really can detect the incoming photons which are very sparse but energetic Placing a metal plate over the film rendering the photons ‘undetectable’ will not generate a picture of the galaxy because the film can’t detect them anymore.
An ARGO float cannot detect a temperature change of 0.005 degrees, at all, let alone reliably. Therefore we cannot know the average temperature of the oceans to that level of precision based on the measurements available.

Reply to  Crispin in Waterloo but really in Yogyakarta
June 9, 2015 10:01 am

Crispin,
I’ve been working with surface stations, and to be honest, I’m not sure what is correct or not.
So, if I may, I’d like to outline what I’m doing and get your thoughts, plus it will be more of a real example.
Let’s start with a single surface station, NCDC says the temps are +/-0.1F
I take today’s minimum temp and subtract tomorrows to get the difference. In my mind this is today-tomorrow +/-0.2F. Now I take tomorrows min, and subtract the day after tomorrow’s from that.
I now have the difference for the first pair of days and the second pair of days. But the min temp on tomorrow can’t be both +0.1F and -0.1F at the same time, it can only be one or the other.
Is the difference values +/- 0.15F?
Then if I string 365 days would the daily difference be +/- 0.1F +/- 0.00027F so slightly more than +/-0.1F?
Should this be rounded to 1 decimal place?
Okay, second scenario instead of a single station I have 1,000 stations I want to calculate the day to day change for. I take the difference I calculated for the single station, and average that difference for each of the 1,000 stations together as the average day to day change.
What is the accuracy and precision for the average day to day change?
How many decimal places should this be?
Thanks!

Bruce Williams
Reply to  Bruce Williams
June 10, 2015 9:27 pm

Crispin
I realize what you are saying about the accuracy of any measurement by the Argo instruments, I am saying you cannot take 2 temperature readings, average them and get the average temperature of the (in this case) ocean. The temperatures are not linear functions and therefor cannot be averaged regardless of how accurate each reading is.
The fundamental physics behind temperature (average Kinetic Energy [Ke = 0.5*m*v^2] ) cannot be determined by adding several temperatures and averaging them. So the whole exercise seems to be a waste of time. You have to average the Ke, not temperature.

Reply to  Willis Eschenbach
June 9, 2015 3:04 am

Willis writes “This demostrates that is absolutely NOT true that the average can only be as precise as the underlying measurements.”
Crispen said it in detail but for those who dont read his post, Willis has made the mistake of applying a symmetrical “correction” across the data and that wont impact on the average much. His example doesn’t demonstrate his argument. Sorry Willis.

Crispin in Waterloo but really in Yogyakarta
Reply to  TimTheToolMan
June 9, 2015 9:30 am

TheToolMan
The key ingredient in Willis’ example is that he used a random set of numbers with a known average, and used the same numbers rounded to the nearest 10. That is not the same as generating numbers to the nearest 10 and another set of numbers to the nearest 1. Using the ‘1’ numbers and rounding the last digit has the predictable effect of not moving the average much because the ’10’ numbers are in fact the same as the ‘1’ numbers.
With measurements made once with unique instruments in an environment you know is different each time, there is no gain on the precision of the average if one takes additional, unique measurements of different things with additional instruments. The ‘rules’ of making lots of measurement are reserved for multiple measurements of the same thing made using the same instruments.
That is why the 0.005 change in ocean temps is called ‘false precision’. No single instrument can detect such a change – it is literally lost in the noise. Only taking many readings of the same place, probably at the same time, can detect the signal in the noise.
It relates to Willis’ swimming pool. We know for sure there regions of the pool that are not homogenous. The temperature varies all over. Therefore all readings are unique and taken once. To get a ‘more precise value’ from the instruments used, the swimming pool has to be stirred so it is the same temperature everywhere, in which case there is no need to spread the devices around – they can all be in one place because the temperature is the same everywhere.
Obviously that is never the case, and is the same in the oceans. Each measurement stands along with its set of uncertainties. The standard deviation of a single measurement is “Does Not Apply”. There is no CoV for 1 measurement. A measurement taken 5 minutes later in in different conditions and a different answer is expected. Additional precision come from measurement that are expected to be the same.
The implications of this are huge for the outrageous precision claimed for land and sea temperatures. To say we are measuring ‘1 atmosphere’ is not gonna carry the day. The atmosphere is inhomogeneous. All readings of its temperature stand alone in terms of precision, unless you have a way to make multiple measurements at each site.
Suppose you put 1000 thermometers in a Stevensen screen, each one giving 0.5 deg accuracy, i.e. readable by eye in 0.5 degree steps. Assume competence in the readers. You calibrate them and record temps. There you have 1000 readings of the same thing. This approach is taken at CERN and using XFR analysis where fantastic precision can be obtained by taking a large number of readings and patiently recording them for hours – with the same instrument. Using multiple instruments introduces some uncertainty but the precision will be much better than any individual thermometer, and must be reported with an uncertainty. One can even say if is good to 0.01 with A confidence, 0.1 with B confidence and 0.5 with C confidence. 2014 was the hottest year evah with 3/8ths confidence, out of 1.0. I am surprised he had the guts to admit it after making such a silly claim. False precision, false confidence, false conclusion IMV.
That example is completely different from taking one reading from each of 1000 locations each of which is expected to be different, with 1000 different thermocouples, even if they were calibrated once-upon-a-time.

Reply to  TimTheToolMan
June 9, 2015 7:46 pm

Crispen writes “That is not the same as generating numbers to the nearest 10 and another set of numbers to the nearest 1.”
Absolutely. Willis is concentrating on the error of the measurement itself and not on the error inherent in the method.

June 8, 2015 12:52 pm

Willis,
I still think you need an education in elementary statistical process control.
Climate science though is special. There are special rules for it.
Sorry to see you catching the disease.

Science or Fiction
Reply to  M Simon
June 8, 2015 1:14 pm

I disagree.
And I also have an issue with your argument. It is not a decent argument. It is an Ad hominem argument.
The uncertainty of the average will be reduced by 1/ (square root of the number of measurements)
if the individual observations differ in value because of random variations in the influence quantities, or random effects. Averaging will not reduce the error caused by systematic effects in measurement or sampling.
See:
“Guide to expression of uncertainty in measurement”
http://www.bipm.org/en/publications/guides/
Section 4.2.3 covers the uncertainty of the average – the “experimental standard deviation of the mean”.

Reply to  Science or Fiction
June 9, 2015 3:25 am

“Averaging will not reduce the error caused by systematic effects in measurement or sampling.”
Such as too slow a rate of sampling resulting in the average moving before enough data is acquired to calculate it, for example?

Bruce Williams
June 8, 2015 8:49 pm

Being new to this, does anyone know how many papers are based on temperature rather than energy content/absorption in this whole warming/climate debate? Such as, are the models based on linear temperature changes with energy or are they based on a squared/square root functions of some sort?

June 8, 2015 9:37 pm

Design and functioning of earth is such that when environmental heat increases the, then parallel heat within the internal environment also increases – ocean warming then becomes inevitable. There are reports of increasing under water volcanic eruptions https://www.scribd.com/doc/248327805/Truth-About-Climate-Change-How-It-is-Unfolding-and-Can-We-Survive

June 8, 2015 10:12 pm

“As you might imagine, in the stormiest areas the largest waves mix the ocean to the greatest depths, which are shown in green and blue. You can also see the mark of the El Nino/La Nina along the Equator off the coast of Ecuador.”
Willis, I disagree. I believe that what you are seeing is upwelling and downwelling areas, or if you prefer, areas of deep water ventilation and areas of deep water formation.
In areas of upwelling the mixed layer is zero, and in areas of downwelling, it is very deep. Upwelling occurs as a result of Ekman transport along the eastern continental edges and the trade wind ITCZ edges of the Hadley gyres. You can even see delineations of very shallow mixed layer along both tropics, the mean poleward edge of the Hadley/Ferrel analog of the ITCZ.
For reasons not yet clear (but for an ice covered continent) the very same wind shear seems to produce downwelling along the poleward edges of the Ferrel cells. Downwelling is unimpeded on the Southern Hemisphere but the only vestige in the continent clogged Northern Hemisphere is near Greenland (notably the best NH approximation of an ice covered continent).

Crispin in Waterloo but really in Yogyakarta
Reply to  Willis Eschenbach
June 9, 2015 10:40 pm

Willis I appreciate your demonstration. You are not grasping the fundamental problem of trying to report something that cannot be detected by the instrument. It is not a matter of knowing how to run a program.
If you measure to 5 significant digits once, how much confidence can you have in the numerical value of a 6th significant digit? None whatsoever. Why? Because you only have one reading to use to estimate (guess) it. The standard deviation of one reading is 0.000. CoV of one reading is 0.000. We have 100% confidence that the one reading is the one reading. We can have no confidence in another significant digit because it was not measured. Karl et al (and many others) claim the ARGOs did. WUWT??
Did you follow the example of the extra mass on 50,000 people and another 50,000 people without it? If the scales they are using are only allowed to measure the mass once, the mass is undetectable. Full stop.
Us being right or wrong about averages of large numbers of random numbers is not relevant to the measurement problem. If you used 10,000,000 random numbers the result would have been closer. Why?
You did not add, as I suggested to make the demonstration relevant, variability (un-confidence) to the numbers, right? All your numbers have no error. You set the final target to be 150.000000000 unless you used double precision.
randoms=round(rnorm(100000,150,20),0)
Why then is there any surprise that the final answer is about 150? You guaranteed the answer would be close. We don’t know the actual average temperature of the ocean, that is why we are measuring it.
Did you catch my point about your swimming pool? Taking multiple measurements of the pool is not going to give you ‘better precision’ of its average temperature because the water temperature is different in each position. It is not multiple readings ‘of the same thing’. Each reading has a precision. You can’t measure in each place once to two significant digits and get an average answer with three significant digits, or four. That’s high school lesson material. By implication, you are claiming it is possible, in concert with Messrs Karl et al.
The measurement problem is that each reading has an inherent variability comprised of multiple factors and the total uncertainty of every reading is larger than the claimed trend in ocean temperature. That claim is not supportable by mathematical manipulation. The data needed to make, at a higher precision, a claim as to where the centre of the error bars are, is simply not there. We do not have 100,000 readings or even 30 from each position of each instrument. We have only one, and each has its little imprecision bars to go with it. Such uncertainty propagates.
Having 100,000 readings from a temperature-inhomogeneous ocean is not the equivalent of 100,000 readings of a temperature-homogeneous ocean. The CAGW edifice rests on such fundamental conceptual errors (and models without skill, of course). Unique location measurements of air temperature to within half a degree carry their error bands with them through all subsequent calculations. This is standard ‘propagation of errors’ stuff. My life would be a lot easier if errors disappeared instead of propagating! They are fecund little buggers.
There is no data set with a known standard deviation available for each ARGO data point. The entire business of ‘calculating things to a higher level of precision’ than is available from the raw data is not even smoke and mirrors. There is no mirror. There is no smoke. The claim to be able to confidently report the temperature of any ocean to within 0.005 degrees C is just wrong by slightly more than a numerical, figurative and logical order of magnitude.

June 10, 2015 7:13 am

Willis writes ” I have four measurements, each of which has an inherent uncertainty of ± 2 units. ”
In the case of Argo, the uncertainty is not known and almost certainly changes considerably by location and time of day and year.

Crispin in Waterloo but really in Seoul
Reply to  Willis Eschenbach
June 11, 2015 4:57 pm

First Willis, I am sure that Tim and I agree that your demonstrations of math and stats are correct, but they are partial. You have demonstrated certain statistical techniques. But there remains an insurmountable challenge: you cannot use statistical techniques to correct a conceptual error.
Karl et al is ultimately claiming that something below the level of detection can be detected be clever math. It is not a matter of working out from multiple measurements of different things how to find it. There are severe limits place on that approach by the nature of instrumental readings.
“That one is obvious as well … which makes it clear that despite the inhomogeneity more measurements give us a better estimate of the overall average density.”
I have been searching around the Net for quotes that are relevant to this subject. Here is a suitable one from 2003:
From: http://en.wikipedia.org/wiki/Experimental_uncertainty_analysis (well down the page)
My bold, my italics. The bold indicates the point you make, the italics indicates the point I am about to make:
======
Sample size
What is missing here, and has been deliberately avoided in all the prior material, is the effect of the sample size on these calculations. The number of measurements n has not appeared in any equation so far. Implicitly, all the analysis has been for the Method 2 approach, taking one measurement (e.g., of T) at a time, and processing it through Eq(2) to obtain an estimate of g.
To use the various equations developed above, values are needed for the mean and variance of the several parameters that appear in those equations. In practical experiments, these values will be estimated from observed data, i.e., measurements. These measurements are averaged to produce the estimated mean values to use in the equations, e.g., for evaluation of the partial derivatives. Thus, the variance of interest is the variance of the mean, not of the population, and so, for example,
[gives examples]
which reflects the fact that, as the number of measurements of T increases, the variance of the mean value of T would decrease. There is some inherent variability in the T measurements, and that is assumed to remain constant, but the variability of the average T will decrease as n increases. Assuming no covariance amongst the parameters (measurements), the expansion of Eq(13) or (15) can be re-stated as
[Formula]
where the subscript on n reflects the fact that different numbers of measurements might be done on the several variables (e.g., 3 for L, 10 for T, 5 for θ, etc.)
This dependence of the overall variance on the number of measurements implies that a component of statistical experimental design would be to define these sample sizes to keep the overall relative error (precision) within some reasonable bounds.
===========
The number of measurements is 1. Each unique ‘experiment’ consists of a single measurement made at a certain place in 3D and time. There are no ‘multiple measurements’. Note the point Author makes that there is an inherent variability in the T measurements. That inherent variability is dealt with statistically by making multiple measurements of the same thing with the same instrument. We never have that for a land or sea temperature data set. Every measurement is unique and it represents an experiment performed once.
Last paragraph: There is an absolute requirement that in order to constrain the increase in the uncertainty caused by the inherent variability of all instruments and the rising number of readings, multiple measurements must be made of each data point, with a statistical design method that keeps the overall error ‘within reasonable bounds’.
ARGO floats do not, as a group, keep the average of all readings ‘within reasonable bounds’. Why? Because they are not measuring the same thing. There are no multiple measurements. The designers of the experiment know full well they are, as Monckton has said, measuring bodies of water on average as large as the volume of Lake Superior, and they have to be treated as independent bodies.
A good analogy is cups of coffee. Put 1000 cups of coffee on 1000 tables in 1000 restaurants in San Francisco. Using 1000 thermocouples accurate to 0.1 degrees C, measure the temperature of the coffee. Average the results. Can the average temperature by known to within 0.01 degrees? No. It cannot. Measuring the temperature of 2000 cups in 2000 restaurants with 2000 instruments will not reduce the uncertainty by half. It is the same or worse than measuring one cup once.
The math you propose is only valid for multiple measurements of one cup of coffee with one instrument. Even using 100 instruments to take 1 reading each of one cup of coffee is to invalidate the statistical claim to have increased the precision. Just as there is an inherent variability in the taking of each measurement with a single instrument, there is an inherent variability between instruments, and further, they may not be well calibrated against each other.
This whole air temperature and ocean temperature measurement to 0.001 degrees is so much statistical BS. Correctly described, each claim to ‘remarkable precision’ has to be accompanied by a confidence level.
There is a certain level of confidence in different levels of precision, which is to say,
(these number are illustrative)
Confidence that the average temperature is 30 degrees = 100%
Confidence that the average temperature is 30.0 degrees: = 95%
Confidence that the average temperature is 30.00 degrees: = 40%
Confidence that the average temperature is 30.000 degrees = <<1%
We have just been lampooning the claim that 2014 was the hottest year evah, at 38% confidence because there is 62% confidence that is was not. If Karl et al claims a change with 0.1% confidence, there is a 99.9% confidence that it was not detected. In other words, that it was a different number, and further, there is no way to know whether or not it is higher or lower. The confidence we can have in the value relates to the quality of the inputs which in this case rules out the use of the technique you have proposed.
The shape of the 'curve' of the confidence numbers is dictated by the uncertainty of the original measurements and the other confounding factors: area and depth weighting, vertical and horizontal position and so on. Each uncertainty adds to the standard deviation, reducing the confidence with which one can claim to have detected a very small change in the average.
You have raised the flag of "we can't tell anything from ARGO measurements". That is not what anyone is saying. We can tell lots, but we cannot tell if the oceans (plural) have warmed by 0.005 degrees with any meaningful confidence.
So Karl et al should have included a number that reflects how confident we can be that a change of 0.005 degrees has been detected. My confidence in his number is very close to zero. I am not saying that 0.005 is the ‘value’ he detected, it is the claim that the change is ‘known’ with that level of precision, with, say, 95% or 68% or some other meaningful level of confidence. The instruments x distribution x inherent variability x an inhomogeneous ocean simply cannot support such a claim.

June 12, 2015 1:10 pm

Willis, found this in the Univ. of Colo. Tide Guage Sea Level page.
Do you know anything about their data base? I am looking for new ways of looking at the Di-urnal cycles other than sea level pressue. I don’t want to waste my time if it has the same flaws as the pressure stations. I am sure it is already corrupted but maybe it gives insight on their conclusions..
“Major conclusions from tide gauge data have been that global sea level has risen approximately 10-25 cm during the past century.”

Crispin in Waterloo
June 12, 2015 2:11 pm

Sorry Willis about anything to do with words. I am really tired and traveling thousands of miles and can’t remember to delete everything you will react to. What is important is the methods of determining precision and accuracy. I am just about impossible to offend so don’t worry about harshness. I had a management advisor who was much worse than you ever will be.
I have consulted several more people on this and I am sorry to say that I have been unable to get you to view the problem as it is, instead of as you wish it to be. Here is a quote:
“Sorry, amigo, but that’s simply not true. If I know your weight with an uncertainty of 2 pounds, and I know my weight and two other people’s weight with an uncertainty of 2 pounds, then I know the average weight of the four of us to an uncertainty of 1 pound. ”
You have once again repeated the error of making the measurement with the same instrument. Further, you say, “I know your weight.” There is no uncertainty in that statement. But real measurements have uncertainty. The formula only applies when there is no uncertainty about “my weight”. Weigh everyone once using the same scale which has a resolution of 4 pounds. It yields numbers ± 2 pounds.
You do not really know my weight, you know my weight within a 4 pound range centered on the indicated value, say 150 pounds (which is not my real weight). You cannot know with greater certainty what my true weight is nor can you make a better estimate of the true position of the centre of the 4 pound range because you only have one measurement of my weight. This uncertainty propagates.
You also do not know your own weight save that it is within a 4 pound range centered on an indicated value, say, also 150 pounds. To calculate our total weight, it will be the sum of the indicated values plus or minus the sum of the uncertainties of each. The answer is 300 pounds with an uncertainty of ± (2+2 = 4) pounds. Our true combined weight could be as low as 296 or as high as 304. We do not know. Adding another two identical sized people weighed with the same precision would give a total weight of 600 pounds plus or minus 8 pounds. The average weight of the four of us is 150 pounds plus or minus 2 pounds.
I will apply the quadrature formula to the one reading we have for each person:
Quadrature applied to a single measurement: Sqrt(2^2) = 2 which is the same as before.
No matter how many single measurements we make the uncertainty of their average is not reduced. To reduce uncertainty we have to have more than one measurement of each person’s weight.
Suppose we weighed each person 4 times. This will more accurately place the centre of the range of uncertainty. The indicated (average) value will probably move up or down and the uncertainty range is reduced by half. The uncertainty about each person’s weight will be reduced to 1 pound because we have four measurements to rely on instead of only 1. The uncertainty of the average of all 4 of us will still be 1 pound even though we made 16 measurements total. In order to reduce the uncertainty of the average you would have to weigh each of us a larger number of times.
Similarly the uncertainty of the average weight of 8 or 16 people is not reduced just because you have included more people. To achieve that you have to take more measurements of each person. The reduction in uncertainty of each person’s weight is limited by the number of measurements of each person, in quadrature as you indicated.
Do you agree?
Now consider the same measurements made with 4 different scales, one for each person. This introduces an additional uncertainty related to the uncertainty of the readings, i.e. is it biased? Was it calibrated correctly? Is the response linear with mass change?
You have to consider the accuracy and drift of different instruments in the equation that calculates the uncertainty. The net is filled with examples but I was unable to find the exact formula for all the readings being taken once by a diversity of instruments. People keep writing about how many repeat measurements they must make of the same ‘thing (a specific point in the ocean) with the same apparatus (the surface temperature buoy). We do not have the luxury of multiple measurements, nor of using the same instrument everywhere.
Try sending a device for testing to each of four different labs: four samples from a manufacturing run, four labs with four sets of people, and four sets of lab instruments. You will get four different results. Averaging the results will is not more precise than any of the individual results. In many cases it will be less than the best individual result. In some cases you actually lose not just certainty but significant digits.
“You still don’t seem to understand that it doesn’t matter what is measured.”
You do not seem to understand that there is a fundamental difference between measuring the diameter of one penny 1000 times and measuring the diameter of 1000 pennies once each.
Say the uncertainty is 0.1mm per reading. The uncertainty of the first case is Sqrt(0.1^2*1000) The uncertainty of each of the 1000 measurements is Sqrt(0.1^2)
Your statement implies that we can know the average diameter of all 1000 pennies measured once each just as precisely we will know the exact diameter of one penny measured 1000 times. The 1000 pennies are not all the same diameter – there is a present there which one penny does not have. That variability has to be carried into the precision and the uncertainty of the average if the 1000 pennies are only measured once each.
When 1000 pennies are measured with 1000 different instruments that have not been calibrated in three years, another type of uncertainty is introduced which considers whether or not the readings are accurate, and the fact that different instruments may have different levels of inherent variability.

Crispin in Waterloo
Reply to  Willis Eschenbach
June 15, 2015 2:46 pm

Willis: “Argo doesn’t tell us anything about the ocean? I don’t think so …”
I am glad I never said anything like that. ARGO tells us a lot.
I am saying the Karl et all didn’t find anything. They cannot find an unquantifiable quantity smaller than the limit of detection with the instruments available.

Crispin in Waterloo
Reply to  Willis Eschenbach
June 15, 2015 3:01 pm

Willis: “Similarly, we don’t have to weigh every cup of coffee in San Francisco in order to get an accurate view of the average weight of a cup of coffee in SF. All we need to do is to take a representative sample.”
Working with a sample increase uncertainty. That is why the number in the denominator is N-1.
Taking a representative sample of all cups of coffee, one measurement each, allows one to calculate an average which represents the centre of the values recorded. None of the readings are necessarily correct and this applies equally to the average of them. All may be high. Distribution may not be normal.
The accuracy of the average is not affected by the number of readings, it is inherent in the instrument which you gave as +/- 2 units. The true average value may not lie close to the calculated average of all readings. The only assurance from the manufacturer is that it lies within +/- two units of the average of measurements. This is fundamentally different from increasing the accuracy. If the scale was mis-calibrated all of the readings will be off.
So what’s the lesson here? All calculated averages are constructs and the result is no better than the accuracy of the readings. Increasing the confidence of where the middle is, doesn’t reduce the range which remains at the accuracy of the instrument. To get a ‘better’ answer people have to use more accurate instruments.

K. Kilty
June 13, 2015 2:25 pm

Willis Eschenbach June 12, 2015 at 3:49 pm
Let me take another shot at this. Here are some questions. Please include your calculations
FIRST QUESTION
We have four numbers. Each has an associated uncertainty. Let’s say that they all have the same uncertainty of 2 units.
What is the uncertainty of the average of the four numbers?
SECOND QUESTION
We want to know if our four cats are gaining weight. We weigh each of them on a scale with an uncertainty of 2 units.
What is the uncertainty of the average of the four numbers?
THIRD QUESTION
Every day I go to a different coffeshop and weigh my coffee. My scale is good to ± 2 ounces. After I do this for four days, I take the average so I can find out how much coffee I’m drinking daily. What is the uncertainty of the average?
The part that people seem to have trouble with is that the uncertainty of an average is LESS than the average of the uncertainties. Average uncertainty in each of the above cases is 2 units … but the actual uncertainty of the average is only 1 unit.

Everyone has moved on in this thread, but here I am on a Saturday, killing time by reading papers on ARGO, and I come across this little challenge. Here goes.
FIRST QUESTION
The uncertainty of the average is sigma (common to all four numbers) divided by the square root of 4, which is to say 2/2 or one unit.
SECOND QUESTION
You say you want to know if the cats (as a group) are gaining weight, which implies two sets of measurements and then a comparison of the averages with an associated uncertainty of the difference. However, you ask only for the average of the four numbers (i.e. average weight of four cats at a single point in time). I can assume that each weighing of a different cat has the same uncertainty. This may not be so–the scale may have an uncertainty associated with weight (heterosekadasticity). If all is ideal, then the average of the four numbers is the total weight of the four cats divided by four with an uncertainty of one unit.
If you really want to know if the set of four cats are gaining weight, then the cats’ varying weights present an additional uncertainty at each set of measurements. If we assume that only the scale contributes uncertainty then the differencing would produce a number with an uncertainty of 1.414 units. But if the cats’ weights are varying throughout the period, so that there is an uncertainty with the cat’s weight according to time of day, then the uncertainty is larger.
THIRD QUESTION
Do you take the same coffee cup with you to each coffee shop? If so, then the cup is not a source of uncertainty, only your scale contributes, and your average is uncertain by 1 unit again. If you use the local shop’s cup, then the cups present additional uncertainty.
If uncertainty is always statistical, and there is no bias, then the uncertainty of an average is less than the average of the uncertainty. But in the worst possible case of non-statistical uncertainty we may have to estimate the upper bound of uncertainty as the sum of the absolute values of individual uncertainties. In manufacturing this is known as the iron-clad rule of stack-up error. In this worst case the additional measurements do not improve uncertainty at all. In two of the stated cases above we had to assume the data were distributed identically and the measurements independent in order to calculate anything at all.

Crispin in Waterloo
Reply to  K. Kilty
June 14, 2015 10:20 pm

K Kilty
Your last paragraph recognises that any expression of a calculated value has to be accompanied by a statement of confidence in that value. Adding ‘precision’ (meaning significant digits in base 10 or base 2) reduces the confidence. Why? Because the calculation cannot improve the accuracy or any or all measurements. If you weigh all 4 cats at the same time, you get the total to the reporting precision and inherent accuracy of the scale. Then divide by 4. OK, that ;’s better. But weighing them one by one, once each is not the same at all. Stacking error.

Crispin in Waterloo
June 14, 2015 10:15 pm

Willis, I have consulted a P.Eng in Materials who deals with the issues of sampling and asked him how I can better communicate the major points using the analogies so far. He taught engineers for a few years.
He insists I communicate the following: we “have not agreed on the definitions of terms and clarified when they will be used.” Oops.
++++++
“SECOND QUESTION
“We want to know if our four cats are gaining weight. We weigh each of them on a scale with an uncertainty of 2 units.
“What is the uncertainty of the average of the four numbers?”
The moderator is catching on. If you only have one measurement of each, your have a reduced uncertainty in the average but you have not increased the precision of the readings nor increased the accuracy of each not the accuracy of the average. The calculated average is not a mass, it is the centre of a range that is the same as before, +/- 2 units. Averaging the numbers does not give a ‘truer’ value, it just gives a number which is a ‘better guess’, that is to say, you have more confidence as to where the centre of the range is probably located. Read on for an example below.
The poor definition lies with the term ‘uncertainty of the average’ with the implication that an ‘average’ number is going to be closer to the true mass than the individual weights. Behind that thought is the expectation that 4 measurements will be normally distributed about the true mass. That is a logical error – you don’t have 4 measurements of each cat. You don’t get ‘normal distribution’ from 1 measurement.
++++++++
“THIRD QUESTION
“Every day I go to a different coffeshop and weigh my coffee. My scale is good to ± 2 ounces. After I do this for four days, I take the average so I can find out how much coffee I’m drinking daily. What is the uncertainty of the average?”
This analogy (discrete cups) is one step close to the real world example of the ARGO floats and the surface buoys. One more step is needed to frame the problem correctly.
The following is built upon the P.Eng’s recommendation for how to illustrate this type of problem.
Background:
You have a scale which reads 0-100 ounces with a readout precision of 1 oz and an accuracy +/-2 oz. The readout precision comes from the makings on the scale and the accuracy is from a test against a very precise Standard Scale. Every scale made by the manufacturer gives, on average and with a high confidence, 99%, a reading within 2 oz of the true weight. (The P.Eng pointed out right from the start that no matter what you do with that scale, the result of any calculated number will still be +/- 2 oz which is why they make more accurate scales. Averaging readings does not change the accuracy of the scale any more than summing readings does.)
You carry the scale with you and that is not like the ARGO floats which are separate instruments. But this is your experiment. Measuring different things once each isanalogous to the float and buoy measurements which are in a different volume of water each time.
Each coffee shop you will visit has an “Whizzo Coffee Machine” that is calibrated to reliably produce 10 oz coffees within the limit of the legal definition of ’10 oz’, i.e. at least 9.75 oz and not more than 10.25 oz. All the coffee shops you are going to visit sell 9.8 oz coffees, though you are not aware of that. You are going to weight them yourself.
The first cup reading is 10 oz
The second cup reading is 11 oz
The third cup reading is 11 oz
The fourth cup reading is 11 oz
All the readings are correct within the attested accuracy of the scale. The total mass of coffee is 43 oz +/- 8. The ‘8’ is because you have only made a single measurement of each serving, the point picked up by the Moderator. The summed value +/-2 each x 4 = +/-8 got the total.
The average is 10.75 oz but you are not allowed to claim that because to do so would attribute to the number spurious precision. You have only 2 significant digits from your instruments so the answer is only valid to two, so the average is 11. It is not 10.75 or 10.8. It is 11. (Technically speaking, because your scale goes to 100, not 99, it is a 2-1/2 digit instrument but you are not allowed to use the ‘1/2’ in this argument.)
The average of the readings is, logically, within the +/- 2 oz range of the true (unknown) value because all of the readings really are within the range. The scale is performing to spec.
The average of 11 oz is attended by a level confidence. You can be very confident, say 99%, that the true average, which is 9.8, lies within 2 oz of 11. And it does. You know no more than that. The average is not data (a measurement). It is a mathematical construct. If you claim added precision, you lose confidence. The claim is that you can guess more precisely using the stats procedure, where the middle of the +/- 2 range is.
If you wish to say, “But I have 4 readings and surely I can state the average with higher precision? The reply is that yes, you can! But you have to state that you have less confidence in any particular number that is within spec. The true value might 12 oz. You have no proof that it is not. All the readings are within 2 oz of 12. Any average has a confidence level. In fact all measurements have a confidence level, we just ignore them most of the time.
You could say you have at least some confidence that the average is 10.8 and that the true mean is +/- 1 oz. In fact it is true that the true mean is just within +/-1 oz, but because you do not know what the true mean is, you have to make that statement with a reduction in confidence. You cannot be as confident that it is within 1 oz of 10.8 as you can that it is within 2 oz of 11. None of the readings are correct and their distribution is not Normal because each cup is unique (and may be a little bit different).
You decide to drink more coffee. The readings are: 11, 10, 10 and 12. The ’12’ is there because the instrument cannot report the 11.8 it should have been because it rounds to the nearest 1. In fact even the Weights and Measures Inspector might not know if that particular coffee was 10.2 or 10.0 or 9.8 oz. She makes sure it falls within +/- 0,25.
Your scale is not precise or accurate enough to tell us. What we do know is, that given a lot of single weighings of a multiple objects, each result reported will be within +/- 2 oz of the true value. [Actually there are different ways of reporting repeatability with real scales; round Up, round Even etc: which are beyond the scope of this post.]
Because we can’t tell whether the last coffee was actually heavier, the best we can do is assume that the true weight was within +/-2 of 12, remembering that the value was rounded to the nearest oz because 1 oz is the reporting precision of the instrument.
The calculated average is now 10.75 which we have to round to 11, The total mass of 8 coffees is 86 oz +/- 16 and the reported average is 11 +/- 2. You could also report that the total is a10.8 +/-1 with a reduced level of confidence or 10.75 +/- 0.5 with very little confidence, or 10.750 +/- 0.25 (or whatever) with virtually no confidence at all. The ‘guess’ is based on the available data. If you want to ‘know’ the value with greater precision and greater accuracy, you have to get a scale that reports to more significant digits and has greater accuracy. And that is precisely (ha ha) why people make them.
Increasing the number of readings of each serving would allow for an enhanced level of confidence but would not reduce the ‘full confidence’ range from +/-2. Any number emerging from a stats procedure will always have attached to it the accuracy which is +/-2 because it is inherent in the instrument and therefore the measurement. The true average is still 9.8. You might make 3000 readings all of which are above 10 or 11 because the scale is only accurate to +/-2. Maybe all the readings are high by 1 oz, but that is within spec. A formula can’t solve that. And then there is instrument drift…
Lastly, consider that the ARGO and Buoy measurements are made once each using different instruments! That would be like each coffee shop having their own scale on which you weigh their single serving once. Readings might vary from 8 to 12 even if they were all 9.8 oz each and all would be within spec. If the servings actually varied from 10.25 to 9.75, as allowed by law, none of the 8 weighed values above would change, meaning that the true total served falls within a 1500 oz range in 3000 cups.
Disclaimer: I have simplified this last example. The results would actually by worse than indicated using every influence bearing on 4 measurements of single servings each made on a different scale.

Crispin in Waterloo
Reply to  Crispin in Waterloo
June 19, 2015 3:26 am

Willis I am again replying to Question 2 (cat weights) and part of 3:
“…uncertainty is always statistical, and there is no bias, then the uncertainty of an average is less than the average of the uncertainty.”
This is quite true however you do not know if the ‘more certain’ average is accurate. This is the issue, not straight forward statistics of perfect and normally distributed numbers. Knowing with less uncertainty the average of a data set does not mean knowing more accurately what the average is. My reply above inspired by P.Eng sets that out pretty clearly. If the accuracy of the instrument is not good, it does not address the following:
“If I make more measurements of something with an inaccurate instrument, whatever the precision (number of significant digits reported) I know ‘more accurately’ the true average value”.
That sentence is untrue. If the readings are all high by ‘2’ then the more-precisely-known-average is off by 2 every single time. One cannot assume that all readings from an instrument are distributed around the true value. They can be very nicely and normally distributed about any other value away from the true value.
Land temps: Because every single reading made on a visual thermometer is +/-0.5 and the instrument itself might be calibrated +/- 0.5 degrees, then the total range is 1.0 degrees from its true value. It is an error to assume that the calibrations are normally distributed around the true value. It is an error to assume that readings are normally distributed around the displayed value each time.
Claims that an average temperature on land has been calculated to 0.001 degrees are resting on the shifting mud of unlikely assumptions and spurious precision read from the mantissa of a calculator. The improvement of the precision of the average is different when doing the same calculation using bases other than 10 as the number of significant digits changes. Under some conditions the number of significant digits decreases.
Karl et al shows why one should not mix data sets and it should be put into textbooks as an object lesson with an explanation as to why not: because it turns a pause into a trend where there is none.