Guest Essay by Kip Hansen
Introduction:
Temperature and Water Level (MSL) are two hot topic measurements being widely bandied about and vast sums of money are being invested in research to determine whether, on a global scale, these physical quantities — Global Average Temperature and Global Mean Sea Level — are changing, and if changing, at what magnitude and at what rate. The Global Averages of these ever-changing, continuous variables are being said to be calculated to extremely precise levels — hundredths of a degree for temperature and millimeters for Global Sea Level — and minute changes on those scales are claimed to be significant and important.
In my recent essays on Tide Gauges, the question of the durability of original measurement uncertainty raised its toothy head in the comments section.
Here is the question I will try to resolve in this essay:
If original measurements are made to an accuracy of +/- X (some value in some units), does the uncertainty of the original measurement devolve on any and all averages – to the mean – of these measurements?
Does taking more measurements to that same degree of accuracy allow one to create more accurate averages or “means”?
My stated position in the essay read as follows:
If each measurement is only accurate to ± 2 cm, then the monthly mean cannot be MORE accurate than that — it must carry the same range of error/uncertainty as the original measurements from which it is made. Averaging does not increase accuracy.
It would be an understatement to say that there was a lot of disagreement from some statisticians and those with classical statistics training.
I will not touch on the subject of precision or the precision of means. There is a good discussion of the subject on the Wiki page: Accuracy and precision .
The subject of concern here is plain vanilla accuracy: “accuracy of a measurement is the degree of closeness of measurement of a quantity to that quantity’s true value.” [ True value means is the actual real world value — not some cognitive construct of it.)
The general statistician’s viewpoint is summarized in this comment:
“The suggestion that the accuracy of the mean sea level at a location is not improved by taking many readings over an extended period is risible, and betrays a fundamental lack of understanding of physical science.”
I will admit that at one time, fresh from university, I agreed with the StatsFolk. That is, until I asked a famous statistician this question and was promptly and thoroughly drummed into submission with a series of homework assignments designed to prove to myself that the idea is incorrect in many cases.
First Example:
Let’s start with a simple example about temperatures. Temperatures, in the USA, are reported and recorded in whole degrees Fahrenheit. (Don’t ask why we don’t use the scientific standard. I don’t know). These whole Fahrenheit degree records are then machine converted into Celsius (centigrade) degrees to one decimal place, such as 15.6 °C.
This means that each and every temperature between, for example, 72.5 and 71.5 °F is recorded as 72 °F. (In practice, one or the other of the precisely .5 readings is excluded and the other rounded up or down). Thus an official report for the temperature at the Battery, NY at 12 noon of “72 °F” means, in the real world, that the temperature, by measurement, was found to lie in the range of 71.5 °F and 72.5 °F — in other words, the recorded figure represents a range 1 degree F wide.
In scientific literature, we might see this in the notation: 72 +/- 0.5 °F. This then is often misunderstood to be some sort of “confidence interval”, “error bar”, or standard deviation.
It is none of those things in this specific example of temperature measurements. It is simply a form of shorthand for the actual measurement procedure which is to represent each 1 degree range of temperature as a single integer — when the real world meaning is “some temperature in the range of 0.5 degrees above or below the integer reported”.
Any difference of the actual temperature, above or below the reported integer is not an error. These deviations are not “random errors” and are not “normally distributed”.
Repeating for emphasis: The integer reported for the temperature at some place/time is shorthand for a degree-wide range of actual temperatures, which though measured to be different, are reported with the same integer. Visually:

Even though the practice is to record only whole integer temperatures, in the real world, temperatures do not change in one-degree steps — 72, 73, 74, 72, 71, etc. Temperature is a continuous variable. Not only is temperature a continuous variable, it is a constantly changing variable. When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another. Further, any and all values in the range shown above are equally likely — Nature does not “prefer” temperatures closer to the whole degree integer value.
[ Note: In the U.S., whole degree Fahrenheit values are converted to Celsius values rounded to one decimal place –72°F is converted and also recorded as 22.2°C. Nature does not prefer temperatures closer to tenths of a degree Celsius either. ]
While the current practice is to report an integer to represent the range from integer-plus-half-a-degree to integer-minus-half-a-degree, this practice could have been some other notation just as well. It might have been just report the integer to represent all temperatures from the integer to the next integer, as in 71 to mean “any temperature from 71 to 72” — the current system of using the midpoint integer is better because the integer reported is centered in the range it represents — this practice, however, is easily misunderstood when notated 72 +/- 0.5.
Because temperature is a continuous variable, deviations from the whole integer are not even “deviations” — they are just the portion of the temperature measured in degrees Fahrenheit normally represented by the decimal fraction that would follow the whole degree notation — the “.4999” part of 72.4999°F. These decimal portions are not errors, they are the unreported, unrecorded part of the measurement and because temperature is a continuous variable, must be considered evenly spread across the entire scale — in other words, they are not, not, not “normally distributed random errors”. They only reason they are uncertain is that even when measured, they have not been recorded.
So what happens when we now find the mean of these records, which, remember, are short-hand notations of temperature ranges?
Let’s do a basic, grade-school level experiment to find out…
We will find the mean of a whole three temperatures; we will use these recorded temperatures from my living room:
11:00 71 degrees F 12:00 72 degrees F 13:00 73 degrees F
As discussed above, each of these recorded temperatures really represent any of the infinitely variable intervening temperatures, however I will make this little boxy chart:

Here we see each hour’s temperature represented as the highest value in the range, the midpoint value of the range (the reported integer), and as the lowest value of the range. [ Note: Between each box in a column, we must remember that there are an infinite number of fractional values, we just are not showing them at this time. ] These are then averaged — the mean calculated — left to right: the three hour’s highest values give a mean of 72.5, the midpoint values give a mean of 72, and the lowest values give a mean of 71.5.
The resultant mean could be written in this form: 72 +/- 0.5 which would be a short-hand notation representing the range from 71.5 to 72.5.
The accuracy of the mean, represented in notation as +/- 0.5, is identical to the original measurement accuracy — they both represent a range of possible values.
Note: This uncertainty stems not from the actual instrumental accuracy of the original measurement, which is a different issue and must be considered additive to the accuracy discussed here which arises solely from the fact that measured temperatures are recorded as one-degree ranges with the fractional information discarded and lost forever, leaving us with the uncertainty — a lack of knowledge — of what the actual measurement itself was.
Of course, the 11:00 actual temperature might have been 71.5, the 12:00 actual temperature 72, and the 13:00 temperature 72.5. Or it may have been 70.5, 72, 73.5.
Finding the means kiddy-corner gives us 72 for each corner to corner, and across the midpoints still gives 72.
Any combination of high, mid-, and low, one from each hour, gives a mean that falls between 72.5 and 71.5 — within the range of uncertainty for the mean.

Even for these simplified grids, there are many possible combinations of one value from each column. The means of any of these combinations falls between the values of 72.5 and 71.5.
There are literally an infinite number of potential values between 72.5 and 71.5 (someone correct me if I am wrong, infinity is a tricky subject) as temperature is a continuous variable. All possible values for each hourly temperature are just as likely to occur — thus all possible values, and all possible combinations of one value for each hour, must be considered. Taking any one possible value from each hourly reading column and finding the mean of the three gives the same result — all means have a value between 72.5 and 71.5, which represents a range of the same magnitude as the original measurement’s, a range one degree Fahrenheit wide.
The accuracy of the mean is exactly the same as the accuracy for the original measurement — they are both a 1-degree wide range. It has not been reduced one bit through the averaging process. It cannot be.
Note: For those who prefer a more technical treatment of this topic should read Clyde Spencer’s “The Meaning and Utility of Averages as it Applies to Climate” and my series “The Laws of Averages”.
And Tide Gauge Data?
It is clear that the original measurement accuracy’s uncertainty in the temperature record arises from the procedure of reporting only whole degrees F or degrees C to one decimal place, thus giving us not measurements with a single value, but ranges in their places.
But what about tide gauge data? Isn’t it a single reported value to millimetric precision, thus different from the above example?
The short answer is NO, but I don’t suppose anyone will let me get away with that.
What are the data collected by Tide Gauges in the United States (and similarly in most other developed nations)?

The Estimated Accuracy is shown as +/- 0.02 m (2 cm) for individual measurements and claimed to be +/- 0.005 m (5 mm) for monthly means. When we look at a data record for the Battery, NY tide gauge we see something like this:
| Date Time | Water Level | Sigma |
| 9/8/2017 0:00 | 4.639 | 0.092 |
| 9/8/2017 0:06 | 4.744 | 0.085 |
| 9/8/2017 0:12 | 4.833 | 0.082 |
| 9/8/2017 0:18 | 4.905 | 0.082 |
| 9/8/2017 0:24 | 4.977 | 0.18 |
| 9/8/2017 0:30 | 5.039 | 0.121 |
Notice that, as the spec sheet says, we have a record every six minutes (1/10th hr), water level is reported in meters to the millimeter level (4.639 m) and the “sigma” is given. The six-minute figure is calculated as follows:
“181 one-second water level samples centered on each tenth of an hour are averaged, a three standard deviation outlier rejection test applied, the mean and standard deviation are recalculated and reported along with the number of outliers. (3 minute water level average)”
Just to be sure we would understand this procedure, I emailed CO-OPS support [ @ co-ops.userservices@noaa.gov ]:
To clarify what they mean by accuracy, I asked:
When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.
And received the answer:
That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time.
[ Note: In a separate email, it was clarified that “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.” ]
The question and answer verify that both the individual 1-second measurements and the 6-minute data value represents a range of water level 4 cm wide, 2 cm plus or minus of the value recorded.
This seemingly vague accuracy — each measurement actually a range 4 cm or 1 ½ inches wide — is the result of the mechanical procedure of the measurement apparatus, despite its resolution of 1 millimeter. How so?

NOAA’s illustration of the modern Acoustic water level tide gauge at the Battery, NY shows why this is so. The blow-up circle to the top-left shows clearly what happens at the one second interval of measurement: The instantaneous water level inside the stilling well is different than the instantaneous water level outside the stilling well.
This one-second reading, which is stored in the “primary data collection platform” and later used as part of the 181 readings averaged to get the 6-minute recorded value, will be different from the actual water level outside the stilling well, as illustrated. Sometimes it will be lower than the actual water level, sometimes it will be higher. The apparatus as a whole is designed to limit this difference, in most cases, at the one second time scale, to a range of 2 cm above or below the level inside the stilling well — although some readings will be far outside this range, and will be discarded as “outliers” (the rule is to discard all 3-sigma outliers — of the set of 181 readings — from the set before calculating the mean which is reported as the six-minute record).
We cannot regard each individual measurement as measuring the water level outside the stilling well — they measure the water level inside the stilling well. These inside-the-well measurements are both very accurate and precise — to 1 millimeter. However, each 1-second record is a mechanical approximation of the water level outside the well — the actual water level of the harbor, which is a constantly changing continuous variable — specified to the accuracy range of +/- 2 centimeters. The recorded measurements represent ranges of values. These measurements do not have “errors” (random or otherwise) when they are different than the actual harbor water level. The water level in the harbor or river or bay itself was never actually measured.
The data recorded as “water level” is a derived value – it is not a direct measurement at all. The tide gauge, as a measurement instrument, has been designed so that it will report measurements inside the well that will be reliably within 2 cm, plus or minus, of the actual instantaneous water level outside the well – which is the thing we wish to measure. After taking 181 measurements inside the well, throwing out any data that seems too far off, the remainder of the 181 are averaged and reported as the six-minute recorded value, with the correct accuracy notation of +/- 2 cm — the same accuracy notation as for the individual 1-second measurements.
The recorded value denotes a value range – which must always be properly noted with each value — in the case of water levels from NOAA tide gauges, +/- 2 cm.
NOAA quite correctly makes no claim that the six-second records, which are the means of 181 1-second records, have any greater accuracy than the original individual measurements.
Why then do they make a claim that monthly means are then accurate to +/- 0.005 meters (5 mm)? In those calculations, the original measurement accuracy is simply ignored altogether, and only the reported/recorded six-minute mean values are considered (confirmed by the author) — the same error that is made as with almost all other large data set calculations, applying the inapplicable Law of Large Numbers.
Accuracy, however, as demonstrated here, is determined by the accuracy of the original measurements when measuring a non-static, ever-changing, continuously variable quantity and which is then recorded as a range of possible values — the range of accuracy specified for the measurement system — and cannot be improved when (or by) calculating means.
Take Home Messages:
- When numerical values are ranges, rather than true discrete values, the width of the range of the original value (measurement in our cases) determines the width of the range of any subsequent mean or average of these numerical values.
- Temperatures calculated from ASOS stations however are recorded and reported temperatures as ranges 1°F wide (0.55°C), and such temperatures are correctly recorded as “Integer +/- 0.5°F”. The means of these recorded temperatures cannot be more accurate than the original measurements –because the original measurement records themselves are ranges, the means must be denoted with the same +/- 0.5°F.
- The same is true of Tide Gauge data as currently collected and recorded. The primary record of 6-minute-values, though recorded to millimetric precision, are also ranges with an original accuracy of +/- 2 centimeters. This is the result of the measurement instrument design and specification, which is that of a sort-of mechanical averaging system. The means of tide gauge recorded values cannot be made more accurate the +/- 2 cm — which is far more accurate than needed for measuring tides and determining safe water levels for ships and boats.
- When original measurements are ranges, their means are also ranges of the same magnitude. This fact must not be ignored or discounted; doing so creates a false sense of the accuracy of our numerical knowledge. Often the mathematical precision of a calculated mean overshadows its real world, far fuzzier accuracy, leading to incorrect significance being given to changes of very small magnitude in those over-confident means.
# # # # #
Author’s Comment Policy:
Thanks for reading — I know that this will be a difficult concept for some. For those, I advise working through the example themselves. Use as many measurements as you have patience for. Work out all the possible means of all the possible values of the measurements, within the ranges of those original measurements, then report the range of the means found.
I’d be glad to answer your questions on the subject, as long as they are civil and constructive.
# # # # #
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
Kip,
I am scientist and therefore skeptic about climate science and many other things. When there is an easy way to test something I like to try to do it myself. And in this case there is any easy empirical test. We do not have to believe the statisticians theories or assumptions. (Mainly the assumptions are the problem, but not in this case.)
So I was curious about what you are saying and decided to build simulation, using the program Mathematica, which enables one to do very large simulations quickly. I wrote the following code (see below), which simulates taking temperature measurements every 1/10 of an hour, for 365 days.
I first generate a simulated “actual” temperature pattern varying sinusoidally for each day, and over one year to simulate the seasons. Then I add some “random walk” noise to that, that can easily add more than a few degrees to deterministic “actual temps” (simulating weather variations over the day and year). I though this might be important to capture any problems with integer rounding of each measurement done by meteorologist.
I then find the Tmax and Tmin for each day from the integer “measurement” data (i.e., the “actual temp” rounded to nearest integer) ,and average those to get the “measured daily mean” temperature. The program then compares that to the mean that is measure by using the “raw” (not rounded) data, where the actual daily mean uses all 240 measurements made during each day.
To my surprise the actual and measured daily mean are very close to each (usually within 2 or 3 decimal places, sometimes 4 decimal places) when averaged over the whole year (i.e., the yearly mean, found by taking the mean of all 365 daily means.)
This simulation takes 10 seconds on my MAC, so I ran a dozens times for more.
However, of course, the daily measured mean is off up to +/- 1 degree, and on average is off by +/- 0.5.
I do remain skeptical of measured temperatures, and global averages, due to measurement environments warming (Anthony’s work shows it) and temperature adjustment biases of some of the climate scientists.
Here is the code, it is easy to follow; “rhv” is name I gave to “random walk” variation added, where each 0.1 hour the temp can increase or decrease randomly by up to .1 degree. Since it is cumulative it can accumulate variations of several degrees easily and is not bounded how far it an vary from the sum of the two sinusoidal variations which have excursions of +/- 12 degrees (daily) and +/- 18 degrees (year), together allowing deterministic variations of +/- 30 degrees. 0 degrees is assumed actual mean over all time in the simulation. The final comparison in calculated to 8 decimal points. Mathematica does arbitrarily large precision.
dailyt = {}; actualdailymean = {}; dailyhigh = {}; dailylow = {}; \
measdailymean = {}; rhv = 0;
Do[
rhv = rhv + RandomReal[{-.1, .1}];
temp = 12*Sin[2*Pi *t/10] + rhv + 18* Sin[2 *Pi*t/3650];
AppendTo[dailyt, temp];
If[Mod[t, 240] == 0,
AppendTo[actualdailymean, Mean[dailyt]];
rounddailyt = Round[dailyt] ;
tmax = Max[rounddailyt];
tmin = Min[rounddailyt];
AppendTo[dailyhigh , tmax];
AppendTo[dailylow , tmin];
AppendTo[measdailymean , (tmax + tmin)/2];
dailyt = {} ],
{t, 1, 10*24*365}];
N[{ Mean[measdailymean], Mean[actualdailymean]}, 8]
I averaged the AC voltage coming out of my electrical outlet and it was zero. Since the voltage is zero that means it’s safe to touch. Right?
I have a square hole and a round hole and make multiple measurements of diameter of each hole and average them out until I get results precise to 10 decimal places. Both diameters measure exactly the same therefore the holes are identical.
Averaging throws away information and reduces dimensionality. AC voltages have dimensions of frequency and voltage. Averaging throws away the frequency information leaving only the DC voltage information which is useless.
Averaging works very well for measuring DC since it is known that there are no frequency components which contain information and all frequencies can be filtered out. There is only one dimension to measure:voltage.
With AC there are at least three dimensions, voltage, phase and frequency. This is why all the “over unity” generators use AC measurements. By reducing three dimensions to one, the loss of information hides the fact the no energy is actually created.
When making AC or time varying measurements, information on the frequency must be known in advance so that all other frequencies can be filtered out from the measurement. What precise frequencies need to be filtered out to make daily temperature measurements? Averaging doesn’t work since it only removes higher frequencies, it doesn’t remove lower frequencies which will affect measurement precision. Errors caused by low frequency noise is indistinguishable from measurement error unless the precise frequency values are known and filtered out.
Climate “science” is like trying to find the pea under the cup. The trick is not trying to follow the cups, but realizing that sleight of hand is used to hide the pea so that it isn’t under any of the cups. Extremely high dimensional computer models are collapsed down to two dimensions, time and temperature, throwing away massive amounts of information in an attempt to fool the common herd.
What’s important is not what is shown, but what is hidden.
Yes, There is obviously a huge loss of information when averaging. No one debates that. But the average (i.e., the mean) is mostly unaffected, even by coarse measurement and a simplified method of finding the mean (e.g., average Tmax and Tmin for the day), if the coarse measurement is sufficiently fine (i.e., 10x to 30x smaller that the true variation in the quantity being measured) and if the coarse measurement is done consistently overtime, with all the daily course measurements of (Tmax+Tmin)/2 averaged over all of the days of a year.
Not a obvious results, but not all that surprising either.
That result would clearly NOT be true is there a consistent asymmetry in daily variation of temperature (i.e., more time spent near Tmin vs. that spent near Tmax. In fact that may occur in actual temps is some (or many) locations on the earth (e.g., where the ocean keeps temps near the Tmin for 10 to 20 hours/day, but peak solar heat (1 pm to 2 pm)? with only brief periods of no clouds in the afternoon determine the Tmax.
In that case taking (Tmin + Tmax)/2 to be the daily mean temp is a BIG MISTAKE, which should be obvious to all meteorologists, and grade school children as well.
Thanks TA. Someone above suggested he may do a similar thing but it looks like you’ve knocked it out in no time.
For those who cannot see the logic or refuse to see it, that pretty much knocks it on the head. Nice work.
You have a sinusoidal process with random noise. Now try that with an added constant, linear slope, say +.01 deg per day, and see what happens.
On initial review, I believe your method is a tautology.
You have assumed the result.
There is no “actual temperature” series – only temperature intervals (as represented by the “actual temperature” series). Your process reproduces (approx.) the original series – as it must with your underlying model assumptions.
Thin Air,
One thing that comes to mind is whether the random walk of 0.1 deg is sufficient. When a cold front moves through, tens of degrees change may be observed in a matter of minutes. Similarly, when a dense cloud moves overhead, degrees of change can be expected rather than 0.1 deg. Gusts of wind over a water surface are more likely to provide cooling in the range of a few tenths of a degree. Certainly, when a heat wave descends on an area, changes of much more than 0.1 degree can be expected. I’m curious what would happen if you increased the random walk increment, or had two different random walks of different magnitude and different periods, with the longer period RW having the larger magnitude.
Nicely done. Haven’t seen that in that few lines of code.
The arguers here cannot seem to separate the idea that precision (or accuracy due to lack of precision) is a completely separate argument about whether the mean temperature over a year is physically meaningful (or meaningful across geographies).
It’s completely valid (and correct) to argue that precision is not an issue, but the physical meaning of the average temperature of a year is a legitimate issue. It’s also very bad to argue as Kip as done about precision and its influence on accuracy (i.e. incorrectly), it just taints the rest of the valid arguments.
BTW the simulation needs to go a bit further. Though I don’t think anyone has addressed it, the calibration source and precision’s influence on accuracy needs to also be taken into account, because Kip is correct if we are talking about a single thermometer from a single calibration source. Again the CLT applies but only if there are many thermometers calibrated by different operators. I’d be curious if anyone has studied long term changes in calibration methods.
Superb !!
This piece should be required reading for every voter in the U.S., the E.U. and every other country.
If you have one perfectly calibrated thermometer whose output is integers reading 72 degrees F, the distribution of possible actual temperature is flat, with all temperatures between 71.5 and 72.5 being equally probable. If you have two such thermometers in two different places both reading 72 degrees F, then the distribution of possible actual average temperatures of both places, although it ranges from 71.5 to 72.5 degrees F, is not flat, but with probability of the actual average temperature of both places being zero for 71.5 or 72.5 F, and half as great for 71.675 or 72.375 F as for 72 F. Increasing the number of thermometers does not reduce the maximum possible difference between indicated and actual average temperatures of the region they measure, but the probability of any given actual temperature gets concentrated towards the indicated temperature.
Oh my , another assertion fan. If the actual temp is 71.5 and you have one perfect thermometer telling you its 72 , you believe it could be 71.5. But if you have two telling you the same thing you now state that there is zero probability that the real temp is 71.5 even though we know it is.
Seems like it really is the twilight of the age of reason. 🙁
Keeping it real ! Nice one, Greg.
I said two different thermometers in different places, which are not necessarily at the same exact temperature.
Apologies Donald, I was not following what you were describing. Having seen Nick’s link below, I realise what you meant.
I tried to illustrate this here.
Nick: I found it necessary to click on your highlighted “here” to see what you posted.
Also, my original statement is incorrect in minor ways. I meant to say that with two thermometers rounding to the nearest degree F but otherwise perfect and in two different places and agreeing, that the actual average temperature of the two places had 50% probability of being 3/8 degree off, but I typo’ed one of my figures by .05 degree F off. The second way was my failure to see correctly the probability of various ways for two rolled dice to add up to various numbers – I assumed a parabolic arch, and now that I think about this more I see this as with upsloping and downsloping straight lines. (With a flat region in the center if the two dice have an even and finite number of sides, getting smaller as the number of sides that the dice have increases.)
In order to give their published results the appearance (illusion?) of greater credibility, a statistical trick is often employed by scientists. The ‘trick’ used by many scientists involves . Many of them misuse it in ignorance of its purpose and meaning. When replicate measurements are made on a quantity it is trivial to calculate the mean and standard deviation. Often this provides a rough idea of the magnitude, but the SD may produce what look like unsavory ‘error bars’. To circumvent this embarrassment, the scientists may focus on the “Probable Error of the Mean”, a quantity that is smaller than the SD by a factor of the square root of (N-1), where N is the number of measurements. If 100 measurements were made of the quantity, then the PEM will be nearly an order of magnitude smaller than the SD, a much more ‘confident’ result.
The problem is that this is ONLY valid when the replicate measurements are made on exactly the same sample, with the same techniques, and within a brief time span during which the sample cannot be expected to change.
I once published my direct measurements by mass spectrometry of the helium-3 content of atmospheric air in a peer-reviewed journal. Not long after publication I had another researcher contact me and practically BEG me to report the PEM rather than the SD. I would not do this as the measurements were taken on a large number of discrete samples. He was somehow disappointed.
Nice post Kip.
Naturally assuming precision greater than the noise is a pathway to error.
Leads to proclaiming a signal smaller than that noise to be significant.
Apparently the next statistical skill is to pronounce the average of this imaginary signal to significant digits well beyond physical meaning.
If you take these statisticians arguments seriously, then no measurements would lead to absolute accuracy.
The willful blindness of some of the critics is astounding.
But that is Climate Science, an ideology wearing the cloak of science.
WUWT’s past post on methodology of measurement needs reposted.
I think something was lost from general reasoning when we transitioned from slide rules to calculators.
Some of the arguments against your simple statement are amazing, brings to mind Douglas Adams and “Six Impossible Things Before Breakfast.”
Obviously the answer to life the universe and everything was wrong because after 4 billion years of calculations it was output to the nearest integer.
Now read this:
https://wattsupwiththat.com/2017/10/14/durable-original-measurement-uncertainty/comment-page-1/#comment-2636995
But it was calculated to be 42.000000000000000000000000000000000000 and the computer was intelligent enough not to waste paper. Truncated. It’s done like this now all over the place, keep seeing it on invoices, documents, etc. Very shitty and very annoying, keep an eye out for it. Jumps right out to bite you.
I think my secretary said you could turn it off/on in windows7 in some spreadsheet stuff when I mentioned it to her.
Measurement instruments must be accurate.
They must be checked at least every year to verify accuracy.
There must be enough measurements, well distributed and sited.
The people collecting and compiling the data have to be trustworthy.
The people compiling the global average have to be trustworthy.
A global average must be a useful statistic that represents the climate.
There are no real time temperature data for 99.999% of earth’s history.
There is little data for the Southern Hemisphere before 1940.
No one knows what average temperature is “normal”, or if there is a “normal”.
No one can identify any problems specifically caused by the average temperature rising +1 degree C. since 1880 … +/- 1 degree C., in my opinion
No one knows if the 1 degree C. range of average temperature since 1880 is unusually large, or unusually small.
The data infilling and “adjustments” are done by people who expect to see global warming and have predicted global warming in the past — are they trustworthy?
I consider myself to be a logical person — after 20 years of reading about climate change, I believe the +1 degree rise of the average temperature, mainly at night, and adding CO2 to the air, ARE BOTH GOOD NEWS.
So, long before we get to debating math and statistics, why not debate whether CO2 controls the climate (no evidence of that, IMHO), and whether adding CO2 to the air is beneficial because it greens the earth (I think so).
Climate blog for non-scientists:
http://www.elOnionBloggle.Blogspot.com
Very good Richard, and I 100% agree that BEFORE anything else it would be, to say the very least, useful to answer to CO2 question. This would however rather reduce this blog to a one horse race and Anthony won’t be going down that route. It does rear it’s head from time to time but that particular elephant only pop’s its head round the door or sits in the corner of the room farting. We all know he is there and some of us comment on the smell. Usually ctm manages to lead it out.
So if you want to start a debate about “Why it’s not Carbon Dioxide” you will have to do it elsewhere. You know all the places (or should do) and I would be happy to join you for my $0.02 worth as a few others might BUT it’s generally a mighty quiet corner with very few active participants compared to WUWT.
RG,
You offered the advice, “They must be checked at least every year to verify accuracy.” That depends on the application and the consequences of being out of calibration. If you have a device that goes out of calibration the first month after installation, and it isn’t checked again until after a year’s production, you might have to throw everything away and declare bankruptcy. Or if you are monitoring toxins that accumulate in the body, a year might be far too long. That advice should be qualified by what it is that is being measured or monitored. A year might even be too long for use in an airport.
Thank you for the very nice detailed analysis of measurements. I like it. I confess I did not delve too deeply into the weeds of the article, but I believe I got the gist of it.
I would make a similar assertion with respect to long-term temperature estimates. For example, consider the oft-cited spaghetti chart of time-temperature series of over 100 GCM runs with some kind of an “average line” snaking through them. The chart does not indicate a best estimate with a range of uncertainty. The chart, at best, might define an infinite series of rectangular probability distributions along the time axis with ranges equal to the extreme high and low temperature estimates for all times. All temperature values within the range for any time, t, have the same probability of occurring. The analysis represents one bar of an incomplete probability bar graph. The bar’s fraction of the total probability of 1.0 is unknown.
This is a quick reaction. I may have gone into the abyss on this one. Any comments?
I’ll pull you up to the edge of the abyss, the rest is up to you.
My wife and all the other members of the coven have had a bake-in for charity. They have made a veritable ensemble of cakes of numerous varieties and flavours. I want a HOT sausage roll.
Kip,
Fine article. You address a crucial point in the climate debate: How accurate are the measurements and, therefore, how accurate can predictions (guesses) be?
You hit on a question I’ve puzzled about often:. “When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another. ”
Are the observations of a time series like the temperature of a particular weather station independent? On the face of it, I’d say no. They’re highly correlated. If the temperature at 10:00 AM is 20 degrees, it’s unlikely to be -40 at 10:01 and even less likely to be -40 at 10:00:01 and less so at 10:00:00.01 etc.
How should this autocorrelation be properly handled?
Pat
Pat ==> If one is doing something reasonable, like deciding if it is too early in the year to plant potatoes, one needn’t worry about it. The same is true if your are deciding whether to take a sweater on your hour on walk at 5 in the evening — we know that the temperature “moves”– changes — from one temperature to another, moving through all the intervening infinitesimal steps. Temperatures are certainly auto-correlated on the basis of a day and seasons.
The bottom line in this article, and others that attempt to explain measurement especially of temperature and water levels, is how temperature and water levels are finally reported to the world in general. I doubt there is a single environmental or science reporter who could understand any of the issues associated measurement precision and accuracy or means/ averages. I asked someone just today if they believed in CAGW and if so why. They said because “scientists” were telling us. I then asked why do you trust what scientists tell you, after all they are only human. The response was “Ah, gee, good point.” We then discussed briefly measuring things, means, accuracy and precision. Ask the next reporter you talk to to explain anomalies.
Now, a real world problem.
What is the “proper” way to get an “average” hourly weather (2 meter temperature, wind speed, wet bulb temperature (thus relative humidity) and pressure) for each day of the year at sea level, for 83 north latitude?
I’ve got 4 years of measurements for each hour at Kap Jessup Greenland. What I need is the average “weather” for each hour of day of the year to determine (approximate really) the heat loss from the ocean at that assumed air temperature hunidity, and pressure.
Thus, the “average” 12:00 weather over the 4 years could simply be “Average the 2010, 2011, 2012, 2013 12:00 readings for each day-of-year.” Plot all average temperatures, develop an equation (or set of equations) that curve-fit the daily cycle from 5:00 am low to the 2:00 PM high and the yearly cycle from a winter’s low to the mid-summer high. Trust the curve-fitting and the 4 data points for each hour to smooth own storms and clear periods.
But, is that a valid, adequately correct “average 12:00 temperature, pressure, humidity, wind speed” for 12 August? 12 Feb? 12 Dec?
“Weather” for 12:00 o’clock (on average for a yearly model) you expect to change slowly over the year’s time, but very rapidly over a 3-4 day period as storms go through. Should the storms and clear periods be “averaged through” as if they did not exist? What if May 12 had storms (high winds, high humidity, near-zero sunlight) 3 years of the 4?
“Weather” data for 12:00 should be close to that of 11:00 and 13:00. Should data for those hours be used to smooth the 12:00 information, or does that confuse it?
If one assumes 12 Aug 12:00 is the average of 4 yearly 12 August 12:00 records, should 11 Aug 12:00 and 13 Aug 12:00 data be included in the average of 12 Aug to “get more data poinits”? After all, the expected daily change from noon on 11 Aug to noon on 13 Aug should be very small compared to the difference between 12 Jan and 12 Aug?
Should that be expanded to successive noon readings for the 4 year’s of noon records for 10 Aug, 11 Aug, 12 Aug, 13 Aug, and 14 Aug?
“heat loss from the ocean at that assumed air temperature hunidity, and pressure. ”
and wind speed ??
IIRC evaporation is proportional to square of wind speed. S-B is T^4 . Any non linearity will mean that using averages is not correct. Whatever formula you come up with you should apply it directly and average ( or sum ) the resulting heat losses.
Well, the idea of calculating each hour’s data independently has merit.
Each of the four losses vary differently: convection losses and evaporation losses are directly proportional to ocean surface temperature, air temperature, relative humidity and wind speed squared. (Evaporation losses are zero if the surface is ice-covered.) Long wave radiation losses (from open ocean or from ice-covered surfaces) vary by ocean surface temperature^4, air temperature^4, ice surface temperature^4 and relative humidity. Ice surface temperature adds in conduction losses from below, proportional to ice thickness.
Evaporation as a function of wind speed is a sublinear function, at least once the wind speed is great enough for turbulence to develop or if convection is occuring. Expect something similar to the difference between temperature and windchill as a function of wind speed.
This is a very interesting discussion. It is like trying to assess the accuracy of the statement that the average American family has 2.5 children. While mathematically it may be accurate, it still represents an impossible representation of any actual family.
For the purpose of adding a touch of humor to this discussion Hoyt……..there is a small time window, when a mother in labor is delivering her 3rd child, that an actual family can have 2.5 children. It’s a very fleeting thing, but entirely possible.
Mark S Johnson,
Until the child is completely out of the birth canal (born), and has taken its first breath, it is not actually counted as a living child. Indeed, for purposes of expected longevity from birth, unless the baby survives the first year, it is not included in actuarial tables.
Careful Clyde, don’t let a rabid pro-lifer hear what you just said.
Hoyt,
So, probably what should be stated is that the average American family typically has two or three children. Short of a Solomon-like decision to cut one baby in half, humans come in whole numbers and it makes more sense to describe an average family in the units they come in.
Ah, but what if the reality is that families typically have either one child, or a larger family of four children? This is where averaging fails us. As I posted earlier, meaning is in the details. Averaging obscures details.
Hoyt,
I do agree with you that averaging hides a multitude of ‘sins.’
“if the coarse measurement is sufficiently fine (i.e., 10x to 30x smaller that the true variation in the quantity being measured)”
And there’s the rub! We are asked to accept fractions of a degree Celsius per decade as measurable fact. Would anybody actually claim that the individual “coarse” measurements are better resolved than 0.1°C over decades at any one spot? They are good enough to approximate the absolute temperature (which swings by 15-20 deg. or more daily in most places, thus 10+ times the accuracy of the measurement), but hardly for detecting minuscule long-time drifts.
I once witnessed a group of ME PhDs doing vibration analysis on an exotic machine. They were taking data samples that were in the 10’s of minutes duration with A/Ds running at 4-8 kHz. These were 14 bit pxi modules I think. They seriously flubbed up the grounding and had to deal with some line noise as well but they learned to ignore the 60hz and harmonics lines. One day one of them was pondering a sharp spike at 1hz, and some lesser ones at 3,5, and 7. The spike was several bits below the resolution of the A/D, low microvolts. I personally was surprised that you could even see anything but noise at that level but it was a clear spike. When I offered that the source of the anomaly was the blinking light on the front panel of the signal conditioner, and offered an argument for the plausibility of my theory, I was laughed out of the room. If they figured it out I never knew about it.
Moral, it’s hard to argue an expert out of a position if he thinks he is more expert than you. Can be wrong on both counts but no matter.
A big thanks to Kip Hansen for a very clear explanation of the use of statistics and ‘averaging’ in the realm of Climate Fraud.
There is more than one person using the moniker “Phil” on this thread.
Phil ==> (all of you) — In truth, that’s why it is simply better to use full real names — there are other “Kip Hansens”s in the world — at least one in the Hollywood movie business — but only one that writes and comments here and in the NY Times.
The way stats work, I’ve been told, is that averaging the quantuum doesn’t change the accuracy, but doing the same for the anomaly does.
The problem I see is the item to be measured changes. So there is no fixed item being changed. Which means, I think, the error bar doesn’t change even for anomalies.
If an item is fixed but the measurement fluctuates around the correct number in a random way, I see repeated measurements averaging to the “real” value. But if each surface or SST spot and time is different, why would the anomalies average to a more accurate value?
The assumption for a global value has to be that WITHOUT external, i.e. CO2 forcing, the temperatures averaged over the planet would be unchanging. If there were even a cyclic variation of years in length, this assumption would be invalid. Which would mean variations in the global temperature would have larger uncertainties than presented.
I would like to see a discussion of the statistical probability of the actual temperature history within the error bars. We see the center line but could the wander be reasonablly ANYTHING within a 1.0 r value?
Every interpretation of our global temperature changes relies on an expectation of thermal stability without the influence of our forcings of interest. It’s good to say, because of a-CO2 it is X. But if Gore and Mann were to show a “non-CO2″history, I’ll bet we’d be skeptical. There would be too much stability these days shown for the average citizen to believe.
Douglas ==> This:
“The problem I see is the item to be measured changes. So there is no fixed item being changed. Which means, I think, the error bar doesn’t change even for anomalies.”
is exactly our point of agreement.
Bartemis October 15, 2017 at 11:43 am
“Nick is correct. It is well established statistical theory that averaging of quantized signals can improve accuracy.”
…………………………
Bartemis, the correct sentence is that ” … that averaging of quantized signals can improve PRECISION.”
You, Nick and others should not be propagating this fallacy when all it shows is that there is a problem of which you are a part. It is time for you to learn.
Example 1. The Hubbard Telescope went into orbit with an error. Wrong mathematics gave a mirror with wrong curvature. Operators could take as many repeat measurements of ab object as they wanted, with NO IMPROVEMENT to accuracy. To correct the accuracy, another optical element was sent to space and fitted.
Example 2. The measurement of radiation balance at the top of the atmosphere has been performed by several satellites. See the problem –
http://www.geoffstuff.com/toa_problem.jpg
Simple eyeballing shows that there are accuracy problems from one satellite to another. The precision seems quite high. Over a short time, if the signal does not vary, there appears to be a small amount of noise and repeated sampling along the X-axis is doing just what I claim in the correct sentence above. Precision is being improved by repeated sampling of data from any one satellite. Accuracy is untouched by repeated sampling.
These examples are not hard to digest. Why, then, is there such a problem for stats fanciers to to get their brains around the analogs to these examples when dealing with Kip’s examples of ground thermometry and sea level measurement. I have given vivid pictures to help comprehension and I now repeat what I wrote above, for emphasis. There is NO WAY I can be shown to be wrong, but contaminated minds will possibly try.
“Repeated quote. “It will be a happy future day when climate authors routinely quote a metrology measurement authority like BIPM (Bureau of Weights and Measures, Paris) in their lists of authors. Then a lot of crap that now masquerades as science would be rejected before publication and save us all a lot of time wading through sub-standard literature to see if any good material is there.”
Geoff – The Hubble inaccuracy could not be corrected because correcting it required knowledge that was unavailable, i.e., a full mathematical description of the optical aberrations.
For this: “Simple eyeballing shows that there are accuracy problems from one satellite to another.”
It shows there are biases in each measurement set. Recall the assumptions of the model:
1) the data are homogeneous
2) the measurements are unbiased
3) the underlying signal is traversing quantization levels rapidly and independently
Bias runs afowl of assumption #2. If different types of instruments were used, that runs afowl of assumption #1.
Under the assumptions that I outlined and repeated above, the nature of quantization error is known, and accuracy can be improved. It is the assumptions that are the problem in the application at hand, not the process.
Here is a practical example. I have a constant quantity I want to estimate based on measurements. Let’s make it
K = 100.3
I am measuring this signal with a measurement that is polluted by a sinusoidal signal of amplitude 20, and a period of 20 samples, and then quantized to the nearest whole number.
The sinusoid will ensure that quantization levels are traversed rapidly, so that the error model holds reasonably well. I will average measurements 20 samples at a time to ensure that the sinusoidal signal is suppressed.
In MATLAB, I will construct my data set as follows:
x=round(100.3 + 20*sin((pi/10)*(ones(1000,1)*(1:20)+2*pi*rand(1000,1)*ones(1,20))));
The random phase ensures I have a different sample set for each row-wise trial.
I take the mean over the rows of this matrix:
y=mean(x,2);
The mean of y should be close to 100.3:
mean(y) = 100.2817
The estimated standard deviation is
std(y) = 0.0698
I expect the estimated standard deviation to be near the expected value
1/sqrt(20*12) = 0.0645
and, it is.
What is the uncertainty range of each measurement? Do you get measurements ranging from say 101 to 99. That is the point we are making. How do you tell which ones are correct? Your little experiment doesn’t seem to have any uncertainty built into it.
Bartemis,
You claimed, “The Hubble inaccuracy could not be corrected because correcting it required knowledge that was unavailable, i.e., a full mathematical description of the optical aberrations.”
Perhaps you could explain how the corrective mirror was manufactured if they didn’t have “…a full mathematical description of the optical aberrations.”
‘Perhaps you could explain how the corrective mirror was manufactured if they didn’t have “…a full mathematical description of the optical aberrations.”’
Good point. It appears I was hasty in my dismissal. Correcting the images via deconvolution with the optical response might have been possible. Indeed, googling “correcting hubble images via deconvolution” gives a plethora of references, and it apparently was done with some success.
A workshop paper from 1990 concluded:
So, the COSTAR was not so much an issue of correcting the images as it was of enhancing the SNR. Keep in mind, the servicing mission was in 1993, which was 24 years ago. There have been significant advances in algorithms and computing power since then. Perhaps, if it happened today, it could all be fixed with software.
Bart, there was never any “inaccuracy” in the Hubble telescope. It was designed to be “nearsighted” from the start. The first few years of operation of the Hubble was not to peer out into outer space, but rather to be used as the best spy satellite ever put into orbit. After the “spooks” had their fill of using it, they initiated the designed “repair” with the Shuttle to restore the device to be a true astronomical telescope.
Kip,
You, as many contributors to WUWT, have once again provided me with real world information to present to my classroom full of young minds. This information about accuracy and precision plays right into a topic I spend at least 5 weeks on in high school junior level Chemistry. While my students won’t get to see much in the way of biochemistry or nuclear chemistry, they will be able to calculate mole’s/atoms/grams, convert anything into a corresponding unit, and comprehend the periodic table and periodicity.
You are absolutely correct that this topic is woefully under represented in STEM courses, because it does not appear to be fully understood by upper elementary and jr. high math teachers, and isn’t taught with any rigor if at all. High school STEM teachers assume our students have been taught this subject so it is glossed over. I made that mistake my first semester, but will not do it again. I tell my students that we will not move on until they have mastered the various sub-sets of scientific measurements and calculations. In order to motivate the learning, they are shown how the grade system will be enforced, and how consequential it will be to their final grade of the semester.
We toured a petrochemical refinery two weeks ago. As the HSE Super was driving my group around the refinery and talking about what is done here and there I pointed out various capacity placards and asked my students if they could do the calculations if they knew the dimensions of the vessels and piping and could they then calculate the weight of the contents if they knew the density of the liquids, etc. It nearly brought tears to my eyes as I watched the connections to the classroom instruction click into place and my students were going, “Yeah! Mr. Dees, we could do that using…., then … , and the sig figs should be about….”
This is yet another article from WUWT that I’ll be shamelessly borrowing to drive home another topic in science to encourage critical thinking skills.
Though I am usually in agreement with the articles presented here, I do carefully select those that are straight up science and evidence based. I will not use articles which contain excessive bias or opinionated editorial in my classroom. As much as I’d like to shape the opinions of my students to match my own, that isn’t my job. My job is to build critical thinking skills using the scientific method and grow an understanding of chemistry, physics, and environmental science.
Thank you Anthony Watts and Kip Hanson.
PRDJ
PRDJ ==> Very high praise indeed. I shamelessly admit to having planned to be a HS Science teacher myself when I was a youth. Maybe this is my second chance….in a way.
I’m not exactly a youth myself. I’ve entered the classroom after spending two decades in water treatment for drinking, wastewater, and power plants (pre-treatment, water/steam, and discharges). I reckoned it was time to pass on the wisdom and try to affect a change in the classroom curricula to better reflect what these kids need in the “real world”.
The first discussion of this paper went fairly well. Two more class periods today and then finish up tomorrow. I’ll let you know how it goes.
PRDJ ==> Thanks, Teach’ !
Here’s the report succinct as possible.
The students with strong reading comprehension and vocabulary understood the concepts and were able to report out in a cohesive and comprehensive manner.
Those that struggle with reading comprehension and have weak vocabulary… well, they struggled.
Ultimately, I asked them to agree or disagree with the message. Their choice did not determine their grade. What determines their final assignment grade is how strongly they defended their position. I directed them to put in the title and your name to access a digital copy of the paper then access and peruse the embedded links to assist in defending their chosen position.
While it may sound like a wishy-washy assignment, I find these students to be ingrained in rote regurgitation of information and weak in critical thinking skills. Therefore, I have assignments such as this after building their knowledge base about specific topics.
I’m still grading their papers, but wanted to report out as promised.
Thanks again, Kip, for your contributions here.
PRDJ ==> Thanks for the update,very interesting. I am a huge fan of teaching critical thinking skills over the memorizing of facts — and a huge fan of practicing mental math skills (quick down-and-dirty estimation of numerical data).
From time to time, I have made this point on Dr Spencer’s blog. I consider that people fail to appreciate the significance of the fact that at no time during the time series thermometer reconstruction, are we repeatedly measuring the same data, or even from the same site, or even using the same equipment, or even using the same practice and procedure of measurement…
The stations that composed the sample set in 1880 are not the very same stations as composed the sample set in 1900, which in turn are not the very same stations as composed the sample set in 1920, which in turn are not the very same stations as composed the sample set in 1940, which in turn are not the very same stations as composed the sample set in 1960, and so on and so forth.
And so rests the case.
Exactly right Richard Verney. Exactly.
To quote from the “Scientist and Engineer’s Guide to Digital Signal Processing”, by S.W. Smith:
“Accuracy is the difference between the true value and the mean of the underlying process that generates the data. Precision is the spread of the values, specified by the standard deviation, the signal-to-noise ratio or the CV.”
If the ‘underlying process that generates the data’ is random error (i.e. the uncertainty is aleatoric), then, in accordance with the Central Limit Theorem, the spread of values of an increasing number of measurements will decrease, resulting in increased precision. Whether or not this also results in an increase in accuracy, however, is a matter of calibration (only for a perfectly calibrated instrument will the mean value of repeated measurements tend to the true value and, hence, precision equate to accuracy). The problem with calibration uncertainty is that it is epistemic rather than aleatoric. Therefore, no amount of repeated measurement will shed any light on its magnitude. Consequently, the extent to which a given level of precision can be taken as a measure of accuracy remains an open question. Certainly, if the level of uncertainty associated with calibration error exceeds the uncertainty associated with imprecision then you may be in deep doo-doo. In view of the controversy surrounding the calibration of the temperature and sea-level measurements, I would have thought the levels of precision being quoted were highly dubious.
John ==> I think that’s what I said….:-)
The StatsFolk (and the signal processing folks) are entitled to all the precision they can squeeze out of long divisions to many decimal places, averaging their heads off to their hearts content.
But if the original measurements are only vaguely accurate, then the means, though precise, are also vaguely accurate.
Precisely!
Boo!
🙂 Another scum with a bad pun… Ohh! It just makes my toes quiver!
I’ve just returned to my comment and realise I had made a simple gaff. I had meant to say that the spread of values of the mean will decrease, i.e. the standard error of the mean decreases. I also misrepresent the relevance of the central limit theorem. However, my main point remains. One can expect random errors to cancel out but systematic errors will not. Uncertainty is often underestimated because the epistemic component is overlooked. I am not sure whether I am agreeing with Kip because I am unsure what is meant by the phrase ‘vaguely accurate’.
Fascinating conversation. It seems like both “sides” of this conversation have excellent points to make, and although I am not a statistics expert, I would make to make an analogy to something that I do somewhat understand and ask wiser heads than mine if it applies to this conversation as well…….
The world of digital video is filled with Analog-to-Digital converters to capture image “data” and then Digital-to-Analog converters that do the reverse, so that we can see the original source with as much fidelity as possible.
The quality of the “data” – and the ultimate reproduction of it – depend greatly on the frequency of the samples being taken and precision of the sample being generated. There is also a spacial component in this process, with the largest number of “pixels” (or sample locations) producing the best data and ultimately the best reproduction.
It seems to me like this is almost exactly the same thing as trying to store and then “recreate” climate data so it can be used to “estimate” things like the average global atmospheric temperatures or average global sea height.
In the real world, these three variables together can generate many objectionable artifacts if the sampling method is not matched to the data trying to be captured and reproduced.
To state the obvious, 12-bit video sampling yields data that is far more “accurate” than 8-bit samples.
And 240 Hertz sampling is more accurae on fast moving objects than 60 Herz sampling.
And 4K spatial sampling reveals more detail than a 640×480 sample on a complex image.
The analogies to this topic seem straight forward
The number of bits used to sample video is equivalent to whether you are using Whole numbers (or not) to measure temperature. The Frame rates in video is equivalent to how often you measure the temperature (by the second, minute, hour, day, week,etc) The Spatial distribution (what video calls resolution) is equivalent to how many different locations you sample the temperature.
My experience has been than it is absolutely possible to “improve” the accuracy of the reproduction by “averaging” many samples over time. After all, this is how digital time exposures work. Therefore, it should be possible (?) to more accurately observe climate trends than the accuracy of the initial measurements would imply.
The devil is in the details, of course, and some algorithms used to “improve” picture quality can easily create objectionable artifacts in the picture that are not really there.
Thoughts?
Mike Fayette ==> Appreciate your input. The trouble with analogies…..is that they are analogies.
In A-to-D video conversion, you are capturing one “screenful” of image data, one bit (loosely) or pixel, at a time, trying to find the best estimate for that pixel from the still analog image, selecting the3 stills at some frequency (more being better) , you do this pixel by pixel at your chosen resolution by some sort of averaging technique, once you’ve got all the pixels for a screenful, you can save it or throw it at a video screen and we have a still image. enough still images delivered quickly enough gives us a motion picture.
I hope I have that right — been a while since I digitized live video coverage at the Masters Golf Tournament (Tiger Woods’ first year there) for internet broadcast (local only, for Lou Gerstner, IBM CEO). [ Gerstner’s comment upon seeing my demo of the tech? “That’s going to cost me a lot of money.”]
Trying to calculate global mean sea level is more like trying to capture an accurate screenful of data with ONE average value — and results in “lite gray” very precisely, but not a very accurate representation of the still image, no less a motion picture.
Here is the difference. Supose your sampling device delivered ‘color’ from green to yellow and you don’t know which is accurate at each sample. This is what measurement uncertainty means. You can average the values but are you sure you’re getting the right color when you do this?
“Thoughts?”
Yes, it’s a similar problem, but much, much more difficult.
In the case you describe, there’s a right answer, but with analogues (proxies) like temperature and sea level, there’s no reference. No “right” answer to compare with. There are calibrated instruments that measure these things but none of them purport to measure global values. As a result, there’s nothing to compare them with.
The basic ideas behind global average temperature or global sea level, depend on the belief there is such a thing, and that remains to be seen. What does “global temperature” mean? How should it be measured? By averaging local temps? Willfully combining apples and oranges? Many think not. From a measurement theory perspective, it just doesn’t make a lot of sense, but it’s the best we can do with the tools we have.
I think it’s more important to question precision, as Kip has done. We have no way to measure “global temperature” or “global sea surface” with the precision that’s being presented to us. It simply can’t be done. The underlying data prevent it.
Kip, it would help a lot if you stopped using the work uncertainty to mean accuracy; they’re different. I thought you understood that and that it was the point of your essay?
Accuracy follows individual measurements, the mean of a repeated measure is only as accurate as the individual measure. Accuracy isn’t improved by arithmetic operations, but precision may be, assuming the measurement error is normally distributed.
No matter how many folks like Nick confuse the correct treatment of accuracy and precision those facts don’t change. You can try to improve that situation by encouraging use of the right words.
of course that would be “word” uncertainty. A simple typing inaccuracy…
Bartleby ==> In a perfect world, and in a perfect language, each and every word would have only one very accurate and precise meaning. You are right, of course, that not everyone means the same thing when they use the word uncertainty.
However, even in the rarefied field of science, there are multiple meanings for individual words.
It is quite right to speak of “measurement uncertainty” — under these definitions:
I have tried to point out that temperature measurements expressed as an integer only are in reality a RANGE, and thus “we don’t know” the value that existed, may have been measured, but was not recorded to the right of the decimal point.
From the same document above”
“two numbers are really needed in order to quantify
an uncertainty. One is the width of the margin, or interval. The other is a confidence level, and
states how sure we are that the ‘true value’ is within that margin.”
For temperatures as recorded modernly, we have an INTERVAL of 1 degree (in Fahrenheit, 0.55 degrees in Celsius) to very near a 100% certainty (ignoring all the variables in instrument error). The INTERVAL is the RANGE, and represents the uncertainty, because we don;t know exactly where in the interval the actual measurement lies.
For Tide Gauges, the INTERVAL results form the instrument itself — as explained in the essay, with a bit of the confidence removed by the method of averaging 181 1-sewcond values, tossing out 3-sigma outliers, and re-calculating a mean.
Kip –
OK, if you’d like to distinctly use the words “Uncertainty” and “Confidence” to describe the two, rather than “Accuracy” and “Precision” I suppose I can support it. I’d encourage you to print a small thesaurus making the choice clear. I disagree that’s a good choice since throughout statistics texts you’ll find the terms “precision”, “uncertainty” and “confidence” freely interchanged, while “accuracy” means only one thing.
I use the more common terms “accuracy” and “precision” or “confidence. It means that I consider “Accuracy” the best term to use to describe the known error and either “precision” or “confidence” to describe the error of estimate, the observational error using a calibrated instrument.
We can know (ironically through repeated measure) the accuracy of an instrument, such as a tide gauge or a thermometer. We assume that accuracy is fixed and that if the instrument is used precisely, the observations made using it will always fall within the range described, usually with a 68% probability (1 sigma).
Further, we assume there will be error introduced in any single observation using that instrument. In most cases, the observational error is assumed infinite for any single measure and is reduced as a function of the number of measures made. So, while “accuracy” of the instrument is never improved by multiple observation, the “precision” of the measure is increased with the count of observations. This assumption is based on the idea that observation error is normally distributed and multiple observations, when averaged, will tend towards the true value or “mean” (average).
As long as you make a very large point of distinguishing “accuracy” from “precision” (or in your case “confidence”) in some way there’s no problem. But make sure you communicate the difference to your readers and also make all efforts to be consistent in your usage?
Bart ==> Actually, I chose the words in the essay — and used far too many examples already. Those who read the essay should find it easy to understand as long as they are not blindered by too much brainwashing with “only one definition allowed”.
My editor already eliminated about half the redundant use of verbiage originally included to make sure everyone understood the concepts I was trying to get across, even if they may have been indoctrinated with a contrary idea.
If you’ve read all 389 previous comments above — you would find that further discussion of the same point(s) is not going to add to the conversation.
(Sorry — just tired of the repeated insistence that I use one particular set of nomenclature from one narrow field.)
Bartleby,
For me, the word “precision” implies the number of significant figures that can be assigned to a measurement. Whereas, “uncertainty” implies a range such as +/- 1 SD. “Confidence,” to me. brings in a subjective probability assessment that tells us something is highly probable, highly improbable, or somewhere in between.
There exists one international standard for expression of uncertainty in measurement:
“The following seven organizations supported the development of the Guide to expression of uncertainty, which is published in their name:
BIPM: Bureau International des Poids et Measures
IEC: International Electrotechnical Commission
IFCC: International Federation of Clinical Chemistry
ISO: International Organization for Standardization
IUPAC: International Union of Pure and Applied Chemistry
IUPAP: International Union of Pure and Applied Physics
OlML: International Organization of Legal Metrology ..”
The standard is freely available. I think of it as a really good idea to use that standard for what should be obvious reasons. Even some climate scientists are now starting to realize that international standards should be used. See:
Uncertainty information in climate data records from Earth observation:
“The terms “error” and “uncertainty” are often unhelpfully conflated. Usage should follow international standards from metrology (the science of measurement), which bring clarity to thinking about and communicating uncertainty information.”
Well, all I can say is that “certainty” and “uncertainty” in statistics are terms derived arithmetically and shouldn’t be confused with “accuracy”. They are used to establish accuracy in an instrument, but they’re distinct (operationally) from the confidence or precision used to describe the use of such an instrument.
I wrote a description of this difference in a direct reply, I hope I made it clear.
Essentially, “accuracy” of an instrument is pretty much it’s known “precision”, if that makes any sense. Use of the instrument makes up the measure of its observational “error” or “experimental precision” if that helps.
The accuracy of a calibrated instrument reflects its ability to, when correctly used, produce an observation “precise” to the stated range.
The precision of a measurement taken using such a device is determined by observational error, which is reduced by repeated measure.
I don’t know how else to say it.
S or F ==> They don’t have a single word for the concept I am trying to communicate specifically. If you find one in there, you let me know.
The lack of an Internationally agreed-upon assigned nomenclature for the concept changes nothing.
It is fully legitimate to identify a shortcoming or a flaw in Guide to the expression of uncertainty in measurement.
Personally, I am aware of one issue with it that should be taken care of: The way it kind of recognizes ´subjective probability´. I have plans for writing a post on that issue.
However, I think you will need to pay extreme attention to definitions and clarity of your argument to be able to do that. If you use terminology that is already defined in that standard while meaning something else, you will run into problems.
This post contains some principles that I think you will need to observe closely to succeed with your arguments:
https://dhf66.wordpress.com/2017/08/06/principles-of-science-and-ethical-guidelines-for-scientific-conduct-v8-0/
Wish you the best of luck with your effort. 🙂 SorF
Bart ==> Apparently you haven’t read all 390 (now) comments above — and if I were writing a statistics essay, I would use their particular, somewhat peculiar, nomenclature. Your definitions are quite correct, of course, and repeated ad nauseam by stats students young and old above.
I am, however, writing about measurement, not statistics. Statistics is about probability. Think “engineering”.
I’m afraid that this horse has been beaten far beyond death — into horse puree — and must now be left for the street sweeper to clean up.
Kip,
I’m aware of the difficulty, I’m retired now but was once a practicing statistician involved in the design of experiments so I have a practical understanding of exactly what you’re trying to say.
I can’t claim to have read all 360 comments above, but I did read quite a few and some are mine. I only brought this up after reading an extended (and rather pointless I might add) debate with Nick Stokes, who apparently can’t differentiate between the concept of an accurate measure and a precise one. As mentioned, unless you agree on terms you’ll end with “horse puree” as you say 🙂
For those with a strong interest in this topic, I think that the Guide to Expression of Uncertainty (the link to which is provided above courtesy of Science or Fiction), is well worth the time to read it.
Probably the most salient point to be found in it is as follows:
“4.2.7 If the random variations in the observations of an input quantity are correlated, for example, in time, the mean and experimental standard deviation of the mean [AKA standard error of the mean] as given in 4.2.1 and 4.2.3 MAY BE INAPPROPRIATE ESTIMATORS (C.2.25) of the desired statistics (C.2.23). In such cases, the observations should be analysed by statistical methods specially designed to treat a series of correlated, randomly-varying measurements.”
It is generally acknowledged that both temperature time-series and sea level time-series are auto-correlated. Thus, the caution should be taken to heart by those who are defending the position that the best estimate of uncertainty for temperature and sea level is the standard deviation divided by the square root of the number of observations.
“Nick Stokes, who apparently can’t differentiate between the concept of an accurate measure and a precise one”
No quotes given. I haven’t talked about either very much. The frustration of these threads is that people persistently talk about the wrong problem. And it’s obviously the wrong problem. There is no citation of someone actually making repeated measurements of the same thing to try to improve precision. What is in fact happening is the use of repeated single measurements of different samples to estimate a population mean. And the right language there is sampling error and bias, and that is what I’ve been talking about.
Mr Stokes, the issue is not measuring an unchanging population mean, but a changing mean temperature, derived from instruments with a limited degree of both accuracy and precision. The dispute is whether claiming greater resolution in determining that change than is inherent in the instruments.
“the issue is not measuring an unchanging population mean”
Often it is measuring a mean that changes slowly relative to the measurement frequency. You can see it as a version of smoothing. In any case, the issues are known. What it isn’t is a metrology problem. It is a matter of combining separately measured samples (which have expected variability) to estimate some kind of expected mean or smoothed value.
Nick ==> And you are entitled to “estimate some kind of expected mean or smoothed value” in the way you describe. What the method can’t do is claim find a highly accurate mean far beyond the accuracy of the original measurements.
With all the statistical nitpicking of commenters above, Kip Hansen makes the extremely valid point that if someone tries to claim that sea level is rising at a rate of 2 mm per year or 4 mm per year, what people are really debating is whether the “average” sea level rose one increment of the instrument (20 mm) in 10 years or in 5 years.
Of course, the data cited by Kip showed that the water level at Battery Park rose 0.40 meter in a half hour, meaning that over the 6 hours or so between low and high tide, the water level could rise 2 meters or more (then return back to the original value 6 hours or so later).
If the long-term “average” sea level is rising at 2 to 4 mm/year, it might take several decades to sort out the slow-rising “signal” from the huge “noise” of twice-daily fluctuations two orders of magnitude larger, and twice-monthly fluctuations (due to phases of the moon) in the amplitude of the twice-daily fluctuations.
Then, if it is found that a 1-meter sea-level rise would flood a coastal city, in a few decades we could find out whether the city has 250 years or 500 years to build a 1-meter high seawall to protect the city. Most cities could afford to wait, in order to determine whether the investment is necessary.
Steve –
I think it’s a bit more complicated than that.
An instrument like a thermometer or a tide gauge has a known accuracy, one that’s based on repeated measures using the device. So a tide gauge might be said to be accurate +/- 2cm. The resulting measure taken with the instrument is said to fall somewhere between +2 and -2 cm of any observation. The actual number is presumed to be somewhere within that range.
But the value itself is assumed to come from a normal distribution. So if it’s reported as “x” cm, it’s expected to lie somewhere between x-2 and x+2 cm, with a 68% probability. As the measure is repeated, confidence the value falls within that range goes up and we say the estimate of the true value is more “precise”. So after some number of repeated measures we can have more confidence (perhaps as much as 98%) that the value is correct, but that doesn’t change the range at all. We’re more confident the value is x, but only within the range of x +/- 2cm.
This is hard to do in English. I hope that makes sense.
Maybe an example:
We use an instrument with an accuracy of +/- 2cm once to measure a value. The value is assumed (according to the accuracy of the device) to fall within the stated range with a 68% probability; it will be within that range 68% of the time.
We repeat the measure and collect the readings, then average them. As the number of observations increases, our confidence in the observations increase and the value of x changes. In the end, after many repeated measures, we can say we’re confident the value “x” is within the range of x+/- 2cm with more than 68% confidence. The accuracy of the measure hasn’t changed, but it’s precision has.
Bart ==> I do appreciate your participation here … and your support on the issue at hand.
Kip it’s difficult for me to compare my attempts to convey basic measurement theory with the strength and necessity of you essay.
Anything I can do to help. I seriously appreciate your efforts.
Kip – I forgot to thank you for putting up with the typos 🙂 There’s no “edit” button on this board and I have learned to depend on post hoc editing. What can I say? I’m a bad, bad typist…
Firstly, thank you Kip Hansen for a great post and for taking the time to respond to so many comments in this long thread.
I totally agree with Kip and appreciate his very well written and simple explication of this aspect of metrology.
I’ve come to this post late and have read all the comments above but only wanted to interject, if I wasn’t repeating what had been said already.
Kip describes an obvious truth to me, that is a well know, well documented and well understood issue for practitioners in the real world.
I’m a visual person and the commonly used image of a target to symbolise the – accuracy/precision – issue, is the first thing I thought of. But when doing this mental exercise, one can easily see that there is a need for a third term! I drew pictures of each “target” with the aim of making a cartoon example that would make it very clear to a layman. I was sure that anybody who had given this more than a moments thought would also have come up with the same result, so I did just a little research:
Hope this helps to visualise the issues.
http://i63.tinypic.com/a40kye.gif
cheers,
Scott
*BIPM International Vocabulary of Metrology (VIM)
**I’ve adapted my graphic from the Wikipedia commons image.
Scott ==> Beauty. From the standard you quote:
The “measurand” is “A quantity intended to be measured. An object being measured.”
The only time “accuracy” applies to a the mean of a set of measurements is when they are of the same measureand.
As you can tell from this and other essay, I use a lot of images to help communicate the issue I write about.
Like your modified Accuracy/Precision trio.
@Kip – While I´m well aware that it would take you weeks to digest all comments, I´ll make a comment nevertheless. Take your time, if interested. 🙂
I would stick to the standard -Wikipedia is not the standard. Definitions matters – a lot:
«1.2 This Guide is primarily concerned with the expression of uncertainty in the measurement of a well-defined physical quantity — the measurand — that can be characterized by an essentially unique value. If the phenomenon of interest can be represented only as a distribution of values or is dependent on one or more parameters, such as time, then the measurands required for its description are the set of quantities describing that distribution or that dependence.» – GUM
To investigate the issue of rounding, I think the measurand can be defined as:
The average of the temperature measurements by 2000 thermometers,
where:
these thermometer has a resolution of 0,10 DegC,
where
these thermometers has an uncertainty of 0,01 DegC at 95% confidence level
where:
the thermometers are not drifting,
where:
these thermometers are used to measure a continuous variable,
where:
the variable is random in the measurement range of the thermometer,
where:
all the measurements are uncorrelated
and where:
each reading is rounded to the nearest integer,
In that case, rounding of each measurement to the nearest integer does not cause a significant increase the uncertainty of the average that would be obtained from taking the average of the original unrounded reading of the thermometers.
If one condition or premise is added or changed, the conclusion may no longer be valid and will have to be reconsidered.
Whether the average temperature is representative for the average temperature of the earth is an entirely other question that certainly deserves some consideration. And whether the so-called raw measurements really are unadjusted, and whether the adjustments to the temperature data are valid, are serious questions.
«D.1.1 The first step in making a measurement is to specify the measurand — the quantity to be measured; the measurand cannot be specified by a value but only by a description of a quantity. However, in principle, a measurand cannot be completely described without an infinite amount of information. Thus, to the extent that it leaves room for interpretation, incomplete definition of the measurand introduces into the uncertainty of the result of a measurement a component of uncertainty that may or may not be significant relative to the accuracy required of the measurement.» – GUM
Actually, global average temperature is not defined in the Paris agreement. And climate scientists keep changing their measurement of ´it´ all the time without properly defining their products.
Science or Fiction,
You remarked,
“these thermometers are used to measure a continuous variable,
where:
the variable is random in the measurement range of the thermometer,…”
These two statement strike me as being contradictory. Am I misunderstanding what you meant?
You go on to offer, “all the measurements are uncorrelated”. This seems to be at odds with your first statement. For a time-series, at a particular site, there will be auto correlation. That is, one does not expect temperature to change instantaneously some ten’s of degrees, which would be the case if the variable were truly random. Any particular site can be considered a sample, which is composited for the global average. Therefore, all the sites will exhibit a degree of autocorrelation, which will vary depending on the season and location.
Kip==> Thank you again for taking the time to respond. You say:
Sure, I do agree with you but wanted to tease out the very first principles.
If you aim at one target and take one shot – one measurement of one measurand – you can only talk about the result in terms of trueness (the proximity of the measurement result to the true value) or how close the bullet hole is to the centre of the target. You can not yet talk about accuracy because that requires a knowledge of precision. Precision is the repeatability of the measurement, it equates to a cluster of shots – or “the closeness of agreement among a set of results”.
I simply meant, that more than one shot is required either of the same target or single shots on multiple targets before you can determine precision. Once you have a “measure” of precision and trueness you can begin to talk about accuracy.
The reason precision is a necessary condition for accuracy is because it is impossible to separate random error – or a random truth in this case – from systematic error. A gun bolted to a test bench hitting the bullseye in one shot can not speak to accuracy because precision hasn’t been tested. The next several shots hitting all over the target or all shots going through the hole in the bullseye will illuminate the situation however!
If there are two sides to the argument in the thread above, I think it is because people are talking at cross purposes. One side is arguing precision and the other trueness and they are confusing either term with accuracy.
To restate, the trueness of your tide gauge example, equates to the target’s bullseye – the instantaneous water level outside the stilling well. While the the 181 (1 millimetre resolution) “cluster”, provides precision the +/-2cm calibrated range of the apparatus added to recorded value represents its real accuracy and true range.
And I agree, averaging large numbers of these values will only produce a spurious accuracy because, although a quasi-precision might be gained, trueness is lowered and thus accuracy lost!
cheers,
Scott
@ur momisugly Clyde Spencer
I was just trying to define a hypothetical set of conditions to make it easier to understand that rounding of the readings per se will not be a problem. However, as you point out, the conditions are not true for a real world attempt to measure global average temperature by a number of thermometers. There is a large number of issues with an attempt to measure global average temperature. I´m most concerned about the adjustments. How adjustments for the urban heat effect apparently increases the temperature trend rather than reduces it is one of the things that makes me wonder. I don´t think rounding per se is the most significant problem. Even though rounding to the nearest integer could be a significant problem if the temperature range is small compared to the size of the rounding.