Guest Essay by Kip Hansen
Temperature and Water Level (MSL) are two hot topic measurements being widely bandied about and vast sums of money are being invested in research to determine whether, on a global scale, these physical quantities — Global Average Temperature and Global Mean Sea Level — are changing, and if changing, at what magnitude and at what rate. The Global Averages of these ever-changing, continuous variables are being said to be calculated to extremely precise levels — hundredths of a degree for temperature and millimeters for Global Sea Level — and minute changes on those scales are claimed to be significant and important.
In my recent essays on Tide Gauges, the question of the durability of original measurement uncertainty raised its toothy head in the comments section.
Here is the question I will try to resolve in this essay:
If original measurements are made to an accuracy of +/- X (some value in some units), does the uncertainty of the original measurement devolve on any and all averages – to the mean – of these measurements?
Does taking more measurements to that same degree of accuracy allow one to create more accurate averages or “means”?
My stated position in the essay read as follows:
If each measurement is only accurate to ± 2 cm, then the monthly mean cannot be MORE accurate than that — it must carry the same range of error/uncertainty as the original measurements from which it is made. Averaging does not increase accuracy.
It would be an understatement to say that there was a lot of disagreement from some statisticians and those with classical statistics training.
I will not touch on the subject of precision or the precision of means. There is a good discussion of the subject on the Wiki page: Accuracy and precision .
The subject of concern here is plain vanilla accuracy: “accuracy of a measurement is the degree of closeness of measurement of a quantity to that quantity’s true value.” [ True value means is the actual real world value — not some cognitive construct of it.)
The general statistician’s viewpoint is summarized in this comment:
“The suggestion that the accuracy of the mean sea level at a location is not improved by taking many readings over an extended period is risible, and betrays a fundamental lack of understanding of physical science.”
I will admit that at one time, fresh from university, I agreed with the StatsFolk. That is, until I asked a famous statistician this question and was promptly and thoroughly drummed into submission with a series of homework assignments designed to prove to myself that the idea is incorrect in many cases.
Let’s start with a simple example about temperatures. Temperatures, in the USA, are reported and recorded in whole degrees Fahrenheit. (Don’t ask why we don’t use the scientific standard. I don’t know). These whole Fahrenheit degree records are then machine converted into Celsius (centigrade) degrees to one decimal place, such as 15.6 °C.
This means that each and every temperature between, for example, 72.5 and 71.5 °F is recorded as 72 °F. (In practice, one or the other of the precisely .5 readings is excluded and the other rounded up or down). Thus an official report for the temperature at the Battery, NY at 12 noon of “72 °F” means, in the real world, that the temperature, by measurement, was found to lie in the range of 71.5 °F and 72.5 °F — in other words, the recorded figure represents a range 1 degree F wide.
In scientific literature, we might see this in the notation: 72 +/- 0.5 °F. This then is often misunderstood to be some sort of “confidence interval”, “error bar”, or standard deviation.
It is none of those things in this specific example of temperature measurements. It is simply a form of shorthand for the actual measurement procedure which is to represent each 1 degree range of temperature as a single integer — when the real world meaning is “some temperature in the range of 0.5 degrees above or below the integer reported”.
Any difference of the actual temperature, above or below the reported integer is not an error. These deviations are not “random errors” and are not “normally distributed”.
Repeating for emphasis: The integer reported for the temperature at some place/time is shorthand for a degree-wide range of actual temperatures, which though measured to be different, are reported with the same integer. Visually:
Even though the practice is to record only whole integer temperatures, in the real world, temperatures do not change in one-degree steps — 72, 73, 74, 72, 71, etc. Temperature is a continuous variable. Not only is temperature a continuous variable, it is a constantly changing variable. When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another. Further, any and all values in the range shown above are equally likely — Nature does not “prefer” temperatures closer to the whole degree integer value.
[ Note: In the U.S., whole degree Fahrenheit values are converted to Celsius values rounded to one decimal place –72°F is converted and also recorded as 22.2°C. Nature does not prefer temperatures closer to tenths of a degree Celsius either. ]
While the current practice is to report an integer to represent the range from integer-plus-half-a-degree to integer-minus-half-a-degree, this practice could have been some other notation just as well. It might have been just report the integer to represent all temperatures from the integer to the next integer, as in 71 to mean “any temperature from 71 to 72” — the current system of using the midpoint integer is better because the integer reported is centered in the range it represents — this practice, however, is easily misunderstood when notated 72 +/- 0.5.
Because temperature is a continuous variable, deviations from the whole integer are not even “deviations” — they are just the portion of the temperature measured in degrees Fahrenheit normally represented by the decimal fraction that would follow the whole degree notation — the “.4999” part of 72.4999°F. These decimal portions are not errors, they are the unreported, unrecorded part of the measurement and because temperature is a continuous variable, must be considered evenly spread across the entire scale — in other words, they are not, not, not “normally distributed random errors”. They only reason they are uncertain is that even when measured, they have not been recorded.
So what happens when we now find the mean of these records, which, remember, are short-hand notations of temperature ranges?
Let’s do a basic, grade-school level experiment to find out…
We will find the mean of a whole three temperatures; we will use these recorded temperatures from my living room:
11:00 71 degrees F 12:00 72 degrees F 13:00 73 degrees F
As discussed above, each of these recorded temperatures really represent any of the infinitely variable intervening temperatures, however I will make this little boxy chart:
Here we see each hour’s temperature represented as the highest value in the range, the midpoint value of the range (the reported integer), and as the lowest value of the range. [ Note: Between each box in a column, we must remember that there are an infinite number of fractional values, we just are not showing them at this time. ] These are then averaged — the mean calculated — left to right: the three hour’s highest values give a mean of 72.5, the midpoint values give a mean of 72, and the lowest values give a mean of 71.5.
The resultant mean could be written in this form: 72 +/- 0.5 which would be a short-hand notation representing the range from 71.5 to 72.5.
The accuracy of the mean, represented in notation as +/- 0.5, is identical to the original measurement accuracy — they both represent a range of possible values.
Note: This uncertainty stems not from the actual instrumental accuracy of the original measurement, which is a different issue and must be considered additive to the accuracy discussed here which arises solely from the fact that measured temperatures are recorded as one-degree ranges with the fractional information discarded and lost forever, leaving us with the uncertainty — a lack of knowledge — of what the actual measurement itself was.
Of course, the 11:00 actual temperature might have been 71.5, the 12:00 actual temperature 72, and the 13:00 temperature 72.5. Or it may have been 70.5, 72, 73.5.
Finding the means kiddy-corner gives us 72 for each corner to corner, and across the midpoints still gives 72.
Any combination of high, mid-, and low, one from each hour, gives a mean that falls between 72.5 and 71.5 — within the range of uncertainty for the mean.
Even for these simplified grids, there are many possible combinations of one value from each column. The means of any of these combinations falls between the values of 72.5 and 71.5.
There are literally an infinite number of potential values between 72.5 and 71.5 (someone correct me if I am wrong, infinity is a tricky subject) as temperature is a continuous variable. All possible values for each hourly temperature are just as likely to occur — thus all possible values, and all possible combinations of one value for each hour, must be considered. Taking any one possible value from each hourly reading column and finding the mean of the three gives the same result — all means have a value between 72.5 and 71.5, which represents a range of the same magnitude as the original measurement’s, a range one degree Fahrenheit wide.
The accuracy of the mean is exactly the same as the accuracy for the original measurement — they are both a 1-degree wide range. It has not been reduced one bit through the averaging process. It cannot be.
Note: For those who prefer a more technical treatment of this topic should read Clyde Spencer’s “The Meaning and Utility of Averages as it Applies to Climate” and my series “The Laws of Averages”.
And Tide Gauge Data?
It is clear that the original measurement accuracy’s uncertainty in the temperature record arises from the procedure of reporting only whole degrees F or degrees C to one decimal place, thus giving us not measurements with a single value, but ranges in their places.
But what about tide gauge data? Isn’t it a single reported value to millimetric precision, thus different from the above example?
The short answer is NO, but I don’t suppose anyone will let me get away with that.
What are the data collected by Tide Gauges in the United States (and similarly in most other developed nations)?
The Estimated Accuracy is shown as +/- 0.02 m (2 cm) for individual measurements and claimed to be +/- 0.005 m (5 mm) for monthly means. When we look at a data record for the Battery, NY tide gauge we see something like this:
|Date Time||Water Level||Sigma|
Notice that, as the spec sheet says, we have a record every six minutes (1/10th hr), water level is reported in meters to the millimeter level (4.639 m) and the “sigma” is given. The six-minute figure is calculated as follows:
“181 one-second water level samples centered on each tenth of an hour are averaged, a three standard deviation outlier rejection test applied, the mean and standard deviation are recalculated and reported along with the number of outliers. (3 minute water level average)”
Just to be sure we would understand this procedure, I emailed CO-OPS support [ @ email@example.com ]:
To clarify what they mean by accuracy, I asked:
When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.
And received the answer:
That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time.
[ Note: In a separate email, it was clarified that “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.” ]
The question and answer verify that both the individual 1-second measurements and the 6-minute data value represents a range of water level 4 cm wide, 2 cm plus or minus of the value recorded.
This seemingly vague accuracy — each measurement actually a range 4 cm or 1 ½ inches wide — is the result of the mechanical procedure of the measurement apparatus, despite its resolution of 1 millimeter. How so?
NOAA’s illustration of the modern Acoustic water level tide gauge at the Battery, NY shows why this is so. The blow-up circle to the top-left shows clearly what happens at the one second interval of measurement: The instantaneous water level inside the stilling well is different than the instantaneous water level outside the stilling well.
This one-second reading, which is stored in the “primary data collection platform” and later used as part of the 181 readings averaged to get the 6-minute recorded value, will be different from the actual water level outside the stilling well, as illustrated. Sometimes it will be lower than the actual water level, sometimes it will be higher. The apparatus as a whole is designed to limit this difference, in most cases, at the one second time scale, to a range of 2 cm above or below the level inside the stilling well — although some readings will be far outside this range, and will be discarded as “outliers” (the rule is to discard all 3-sigma outliers — of the set of 181 readings — from the set before calculating the mean which is reported as the six-minute record).
We cannot regard each individual measurement as measuring the water level outside the stilling well — they measure the water level inside the stilling well. These inside-the-well measurements are both very accurate and precise — to 1 millimeter. However, each 1-second record is a mechanical approximation of the water level outside the well — the actual water level of the harbor, which is a constantly changing continuous variable — specified to the accuracy range of +/- 2 centimeters. The recorded measurements represent ranges of values. These measurements do not have “errors” (random or otherwise) when they are different than the actual harbor water level. The water level in the harbor or river or bay itself was never actually measured.
The data recorded as “water level” is a derived value – it is not a direct measurement at all. The tide gauge, as a measurement instrument, has been designed so that it will report measurements inside the well that will be reliably within 2 cm, plus or minus, of the actual instantaneous water level outside the well – which is the thing we wish to measure. After taking 181 measurements inside the well, throwing out any data that seems too far off, the remainder of the 181 are averaged and reported as the six-minute recorded value, with the correct accuracy notation of +/- 2 cm — the same accuracy notation as for the individual 1-second measurements.
The recorded value denotes a value range – which must always be properly noted with each value — in the case of water levels from NOAA tide gauges, +/- 2 cm.
NOAA quite correctly makes no claim that the six-second records, which are the means of 181 1-second records, have any greater accuracy than the original individual measurements.
Why then do they make a claim that monthly means are then accurate to +/- 0.005 meters (5 mm)? In those calculations, the original measurement accuracy is simply ignored altogether, and only the reported/recorded six-minute mean values are considered (confirmed by the author) — the same error that is made as with almost all other large data set calculations, applying the inapplicable Law of Large Numbers.
Accuracy, however, as demonstrated here, is determined by the accuracy of the original measurements when measuring a non-static, ever-changing, continuously variable quantity and which is then recorded as a range of possible values — the range of accuracy specified for the measurement system — and cannot be improved when (or by) calculating means.
Take Home Messages:
- When numerical values are ranges, rather than true discrete values, the width of the range of the original value (measurement in our cases) determines the width of the range of any subsequent mean or average of these numerical values.
- Temperatures calculated from ASOS stations however are recorded and reported temperatures as ranges 1°F wide (0.55°C), and such temperatures are correctly recorded as “Integer +/- 0.5°F”. The means of these recorded temperatures cannot be more accurate than the original measurements –because the original measurement records themselves are ranges, the means must be denoted with the same +/- 0.5°F.
- The same is true of Tide Gauge data as currently collected and recorded. The primary record of 6-minute-values, though recorded to millimetric precision, are also ranges with an original accuracy of +/- 2 centimeters. This is the result of the measurement instrument design and specification, which is that of a sort-of mechanical averaging system. The means of tide gauge recorded values cannot be made more accurate the +/- 2 cm — which is far more accurate than needed for measuring tides and determining safe water levels for ships and boats.
- When original measurements are ranges, their means are also ranges of the same magnitude. This fact must not be ignored or discounted; doing so creates a false sense of the accuracy of our numerical knowledge. Often the mathematical precision of a calculated mean overshadows its real world, far fuzzier accuracy, leading to incorrect significance being given to changes of very small magnitude in those over-confident means.
# # # # #
Author’s Comment Policy:
Thanks for reading — I know that this will be a difficult concept for some. For those, I advise working through the example themselves. Use as many measurements as you have patience for. Work out all the possible means of all the possible values of the measurements, within the ranges of those original measurements, then report the range of the means found.
I’d be glad to answer your questions on the subject, as long as they are civil and constructive.
# # # # #