Guest Essay by Kip Hansen
Introduction:
Temperature and Water Level (MSL) are two hot topic measurements being widely bandied about and vast sums of money are being invested in research to determine whether, on a global scale, these physical quantities — Global Average Temperature and Global Mean Sea Level — are changing, and if changing, at what magnitude and at what rate. The Global Averages of these ever-changing, continuous variables are being said to be calculated to extremely precise levels — hundredths of a degree for temperature and millimeters for Global Sea Level — and minute changes on those scales are claimed to be significant and important.
In my recent essays on Tide Gauges, the question of the durability of original measurement uncertainty raised its toothy head in the comments section.
Here is the question I will try to resolve in this essay:
If original measurements are made to an accuracy of +/- X (some value in some units), does the uncertainty of the original measurement devolve on any and all averages – to the mean – of these measurements?
Does taking more measurements to that same degree of accuracy allow one to create more accurate averages or “means”?
My stated position in the essay read as follows:
If each measurement is only accurate to ± 2 cm, then the monthly mean cannot be MORE accurate than that — it must carry the same range of error/uncertainty as the original measurements from which it is made. Averaging does not increase accuracy.
It would be an understatement to say that there was a lot of disagreement from some statisticians and those with classical statistics training.
I will not touch on the subject of precision or the precision of means. There is a good discussion of the subject on the Wiki page: Accuracy and precision .
The subject of concern here is plain vanilla accuracy: “accuracy of a measurement is the degree of closeness of measurement of a quantity to that quantity’s true value.” [ True value means is the actual real world value — not some cognitive construct of it.)
The general statistician’s viewpoint is summarized in this comment:
“The suggestion that the accuracy of the mean sea level at a location is not improved by taking many readings over an extended period is risible, and betrays a fundamental lack of understanding of physical science.”
I will admit that at one time, fresh from university, I agreed with the StatsFolk. That is, until I asked a famous statistician this question and was promptly and thoroughly drummed into submission with a series of homework assignments designed to prove to myself that the idea is incorrect in many cases.
First Example:
Let’s start with a simple example about temperatures. Temperatures, in the USA, are reported and recorded in whole degrees Fahrenheit. (Don’t ask why we don’t use the scientific standard. I don’t know). These whole Fahrenheit degree records are then machine converted into Celsius (centigrade) degrees to one decimal place, such as 15.6 °C.
This means that each and every temperature between, for example, 72.5 and 71.5 °F is recorded as 72 °F. (In practice, one or the other of the precisely .5 readings is excluded and the other rounded up or down). Thus an official report for the temperature at the Battery, NY at 12 noon of “72 °F” means, in the real world, that the temperature, by measurement, was found to lie in the range of 71.5 °F and 72.5 °F — in other words, the recorded figure represents a range 1 degree F wide.
In scientific literature, we might see this in the notation: 72 +/- 0.5 °F. This then is often misunderstood to be some sort of “confidence interval”, “error bar”, or standard deviation.
It is none of those things in this specific example of temperature measurements. It is simply a form of shorthand for the actual measurement procedure which is to represent each 1 degree range of temperature as a single integer — when the real world meaning is “some temperature in the range of 0.5 degrees above or below the integer reported”.
Any difference of the actual temperature, above or below the reported integer is not an error. These deviations are not “random errors” and are not “normally distributed”.
Repeating for emphasis: The integer reported for the temperature at some place/time is shorthand for a degree-wide range of actual temperatures, which though measured to be different, are reported with the same integer. Visually:

Even though the practice is to record only whole integer temperatures, in the real world, temperatures do not change in one-degree steps — 72, 73, 74, 72, 71, etc. Temperature is a continuous variable. Not only is temperature a continuous variable, it is a constantly changing variable. When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another. Further, any and all values in the range shown above are equally likely — Nature does not “prefer” temperatures closer to the whole degree integer value.
[ Note: In the U.S., whole degree Fahrenheit values are converted to Celsius values rounded to one decimal place –72°F is converted and also recorded as 22.2°C. Nature does not prefer temperatures closer to tenths of a degree Celsius either. ]
While the current practice is to report an integer to represent the range from integer-plus-half-a-degree to integer-minus-half-a-degree, this practice could have been some other notation just as well. It might have been just report the integer to represent all temperatures from the integer to the next integer, as in 71 to mean “any temperature from 71 to 72” — the current system of using the midpoint integer is better because the integer reported is centered in the range it represents — this practice, however, is easily misunderstood when notated 72 +/- 0.5.
Because temperature is a continuous variable, deviations from the whole integer are not even “deviations” — they are just the portion of the temperature measured in degrees Fahrenheit normally represented by the decimal fraction that would follow the whole degree notation — the “.4999” part of 72.4999°F. These decimal portions are not errors, they are the unreported, unrecorded part of the measurement and because temperature is a continuous variable, must be considered evenly spread across the entire scale — in other words, they are not, not, not “normally distributed random errors”. They only reason they are uncertain is that even when measured, they have not been recorded.
So what happens when we now find the mean of these records, which, remember, are short-hand notations of temperature ranges?
Let’s do a basic, grade-school level experiment to find out…
We will find the mean of a whole three temperatures; we will use these recorded temperatures from my living room:
11:00 71 degrees F 12:00 72 degrees F 13:00 73 degrees F
As discussed above, each of these recorded temperatures really represent any of the infinitely variable intervening temperatures, however I will make this little boxy chart:

Here we see each hour’s temperature represented as the highest value in the range, the midpoint value of the range (the reported integer), and as the lowest value of the range. [ Note: Between each box in a column, we must remember that there are an infinite number of fractional values, we just are not showing them at this time. ] These are then averaged — the mean calculated — left to right: the three hour’s highest values give a mean of 72.5, the midpoint values give a mean of 72, and the lowest values give a mean of 71.5.
The resultant mean could be written in this form: 72 +/- 0.5 which would be a short-hand notation representing the range from 71.5 to 72.5.
The accuracy of the mean, represented in notation as +/- 0.5, is identical to the original measurement accuracy — they both represent a range of possible values.
Note: This uncertainty stems not from the actual instrumental accuracy of the original measurement, which is a different issue and must be considered additive to the accuracy discussed here which arises solely from the fact that measured temperatures are recorded as one-degree ranges with the fractional information discarded and lost forever, leaving us with the uncertainty — a lack of knowledge — of what the actual measurement itself was.
Of course, the 11:00 actual temperature might have been 71.5, the 12:00 actual temperature 72, and the 13:00 temperature 72.5. Or it may have been 70.5, 72, 73.5.
Finding the means kiddy-corner gives us 72 for each corner to corner, and across the midpoints still gives 72.
Any combination of high, mid-, and low, one from each hour, gives a mean that falls between 72.5 and 71.5 — within the range of uncertainty for the mean.

Even for these simplified grids, there are many possible combinations of one value from each column. The means of any of these combinations falls between the values of 72.5 and 71.5.
There are literally an infinite number of potential values between 72.5 and 71.5 (someone correct me if I am wrong, infinity is a tricky subject) as temperature is a continuous variable. All possible values for each hourly temperature are just as likely to occur — thus all possible values, and all possible combinations of one value for each hour, must be considered. Taking any one possible value from each hourly reading column and finding the mean of the three gives the same result — all means have a value between 72.5 and 71.5, which represents a range of the same magnitude as the original measurement’s, a range one degree Fahrenheit wide.
The accuracy of the mean is exactly the same as the accuracy for the original measurement — they are both a 1-degree wide range. It has not been reduced one bit through the averaging process. It cannot be.
Note: For those who prefer a more technical treatment of this topic should read Clyde Spencer’s “The Meaning and Utility of Averages as it Applies to Climate” and my series “The Laws of Averages”.
And Tide Gauge Data?
It is clear that the original measurement accuracy’s uncertainty in the temperature record arises from the procedure of reporting only whole degrees F or degrees C to one decimal place, thus giving us not measurements with a single value, but ranges in their places.
But what about tide gauge data? Isn’t it a single reported value to millimetric precision, thus different from the above example?
The short answer is NO, but I don’t suppose anyone will let me get away with that.
What are the data collected by Tide Gauges in the United States (and similarly in most other developed nations)?

The Estimated Accuracy is shown as +/- 0.02 m (2 cm) for individual measurements and claimed to be +/- 0.005 m (5 mm) for monthly means. When we look at a data record for the Battery, NY tide gauge we see something like this:
| Date Time | Water Level | Sigma |
| 9/8/2017 0:00 | 4.639 | 0.092 |
| 9/8/2017 0:06 | 4.744 | 0.085 |
| 9/8/2017 0:12 | 4.833 | 0.082 |
| 9/8/2017 0:18 | 4.905 | 0.082 |
| 9/8/2017 0:24 | 4.977 | 0.18 |
| 9/8/2017 0:30 | 5.039 | 0.121 |
Notice that, as the spec sheet says, we have a record every six minutes (1/10th hr), water level is reported in meters to the millimeter level (4.639 m) and the “sigma” is given. The six-minute figure is calculated as follows:
“181 one-second water level samples centered on each tenth of an hour are averaged, a three standard deviation outlier rejection test applied, the mean and standard deviation are recalculated and reported along with the number of outliers. (3 minute water level average)”
Just to be sure we would understand this procedure, I emailed CO-OPS support [ @ co-ops.userservices@noaa.gov ]:
To clarify what they mean by accuracy, I asked:
When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.
And received the answer:
That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time.
[ Note: In a separate email, it was clarified that “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.” ]
The question and answer verify that both the individual 1-second measurements and the 6-minute data value represents a range of water level 4 cm wide, 2 cm plus or minus of the value recorded.
This seemingly vague accuracy — each measurement actually a range 4 cm or 1 ½ inches wide — is the result of the mechanical procedure of the measurement apparatus, despite its resolution of 1 millimeter. How so?

NOAA’s illustration of the modern Acoustic water level tide gauge at the Battery, NY shows why this is so. The blow-up circle to the top-left shows clearly what happens at the one second interval of measurement: The instantaneous water level inside the stilling well is different than the instantaneous water level outside the stilling well.
This one-second reading, which is stored in the “primary data collection platform” and later used as part of the 181 readings averaged to get the 6-minute recorded value, will be different from the actual water level outside the stilling well, as illustrated. Sometimes it will be lower than the actual water level, sometimes it will be higher. The apparatus as a whole is designed to limit this difference, in most cases, at the one second time scale, to a range of 2 cm above or below the level inside the stilling well — although some readings will be far outside this range, and will be discarded as “outliers” (the rule is to discard all 3-sigma outliers — of the set of 181 readings — from the set before calculating the mean which is reported as the six-minute record).
We cannot regard each individual measurement as measuring the water level outside the stilling well — they measure the water level inside the stilling well. These inside-the-well measurements are both very accurate and precise — to 1 millimeter. However, each 1-second record is a mechanical approximation of the water level outside the well — the actual water level of the harbor, which is a constantly changing continuous variable — specified to the accuracy range of +/- 2 centimeters. The recorded measurements represent ranges of values. These measurements do not have “errors” (random or otherwise) when they are different than the actual harbor water level. The water level in the harbor or river or bay itself was never actually measured.
The data recorded as “water level” is a derived value – it is not a direct measurement at all. The tide gauge, as a measurement instrument, has been designed so that it will report measurements inside the well that will be reliably within 2 cm, plus or minus, of the actual instantaneous water level outside the well – which is the thing we wish to measure. After taking 181 measurements inside the well, throwing out any data that seems too far off, the remainder of the 181 are averaged and reported as the six-minute recorded value, with the correct accuracy notation of +/- 2 cm — the same accuracy notation as for the individual 1-second measurements.
The recorded value denotes a value range – which must always be properly noted with each value — in the case of water levels from NOAA tide gauges, +/- 2 cm.
NOAA quite correctly makes no claim that the six-second records, which are the means of 181 1-second records, have any greater accuracy than the original individual measurements.
Why then do they make a claim that monthly means are then accurate to +/- 0.005 meters (5 mm)? In those calculations, the original measurement accuracy is simply ignored altogether, and only the reported/recorded six-minute mean values are considered (confirmed by the author) — the same error that is made as with almost all other large data set calculations, applying the inapplicable Law of Large Numbers.
Accuracy, however, as demonstrated here, is determined by the accuracy of the original measurements when measuring a non-static, ever-changing, continuously variable quantity and which is then recorded as a range of possible values — the range of accuracy specified for the measurement system — and cannot be improved when (or by) calculating means.
Take Home Messages:
- When numerical values are ranges, rather than true discrete values, the width of the range of the original value (measurement in our cases) determines the width of the range of any subsequent mean or average of these numerical values.
- Temperatures calculated from ASOS stations however are recorded and reported temperatures as ranges 1°F wide (0.55°C), and such temperatures are correctly recorded as “Integer +/- 0.5°F”. The means of these recorded temperatures cannot be more accurate than the original measurements –because the original measurement records themselves are ranges, the means must be denoted with the same +/- 0.5°F.
- The same is true of Tide Gauge data as currently collected and recorded. The primary record of 6-minute-values, though recorded to millimetric precision, are also ranges with an original accuracy of +/- 2 centimeters. This is the result of the measurement instrument design and specification, which is that of a sort-of mechanical averaging system. The means of tide gauge recorded values cannot be made more accurate the +/- 2 cm — which is far more accurate than needed for measuring tides and determining safe water levels for ships and boats.
- When original measurements are ranges, their means are also ranges of the same magnitude. This fact must not be ignored or discounted; doing so creates a false sense of the accuracy of our numerical knowledge. Often the mathematical precision of a calculated mean overshadows its real world, far fuzzier accuracy, leading to incorrect significance being given to changes of very small magnitude in those over-confident means.
# # # # #
Author’s Comment Policy:
Thanks for reading — I know that this will be a difficult concept for some. For those, I advise working through the example themselves. Use as many measurements as you have patience for. Work out all the possible means of all the possible values of the measurements, within the ranges of those original measurements, then report the range of the means found.
I’d be glad to answer your questions on the subject, as long as they are civil and constructive.
# # # # #
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
I’m no expert in measurement, just thinking:
First the instrument has to be calibrated. This process probably has a normally distributed outcome, but to the end user of an instrument this is meaningless. So the instrument has has a fixed accuracy x+-e_cal
The instrument will make measurements with errors normally distributed with respect to the calibration error.
Repeated measurements of the same quantity with this instrument will make it possible to reduce this error (e_norm).
The absolute error will be x+-e_cal+-e_norm and cannot be reduced below +-e_cal even if e_norm were averaged to 0.0.
If you use all calibrated instruments simultaneously to measure the same quantity, the sum of the calibration error and e_norm should be reduced by averaging because the calibration process was assumed to have normal errors. The same should apply if you were measuring different quantities and calculated their average.
The example with Fahrenheit is different since this deals with the quantisation error which indeed cannot be reduced for one and the same instrument by averaging.
But if you assume the calibration process to have normally distributed errors, using an ensemble of those instruments should make it possible to overcome this threshold.
While I consider myself a “sceptic”, so far these musings leave me in principle on the side of those who claim the error can be reduced by averaging (considering the scenario of global temperature measurement (whatever that may mean in the end)).
Did the thermometer read 0.0 deg C in a slush of distilled/deionized water at SLP (sea level pressure)?
Did the thermometer read 100.0 deg C in rolling/boiling distilled/deionized water at SLP?
If these two calibration points are a perfect fit (or not, record the error and note it on any subsequent measurements made), then the thermometer is “calibrated”. This applies to checking the accuracy of Hg or alcohol thermometers but can be used to calibrate “thermistors” and RTD’s.
Question: When I had to make water/wastewater outfall temperature checks, one must use a “calibrated” and/or “traceable” thermometer that is verified/calibrated yearly by ASTM standards. If I fail to do so, my records are tossed out, my company is fined (NOV), I might lose my job and possibly my license to treat water or wastewater or in cases of “pencil whipping” one could lose their freedom. I practiced due diligence for 20 years and left water/wastewater for the classroom with a clean record.
When a climate scientist is feeding data into policy decisions which affect hundreds of millions or billions of people…. who is checking their calibration, measurement consistency, etc.? What is the consequence of failure of due diligence? We see from the response to Mr. Watts, et. al., site checking project that the policy makers and the policy feeders want little oversight. Too bad those that should be providing the oversight are also the one’s creating the records.
You are entirely correct, and thanks for pointing this out.
I just realized about Kip’s exact example above – one Tide Station – he is entirely correct that for that one tide station you cannot exceed the precision of that tide station because it’s ONE station and the calibration accuracy cannot be better than the precision of the instrument calibrated. So I apologize to Kip here, I was wrong about that detail. It takes at least 30 tide stations independently calibrated to exceed the accuracy and precision of the tidal measurement instrument (or the temperature measurement instrument).
I still strongly believe (from professional opinion) that as long as the calibration sources have a precision and accuracy better than that of the measurement instruments, and the calibrations are independent, that the accuracy and precision of the global average temperature and tide levels exceeds that of individual instruments.
Here’s a new problem though: from a signal processing interpretation, the calibration interval induces a noise at the frequency of the inverse of that interval, and the noise has a level corresponding to the instrument precision. And that calibration interval could alias with the horrible boxcar averaging methods used by climate scientists, further creating errors in the data. Yikes.
Peter
krmmtoday writes:
While I consider myself a “sceptic”, so far these musings leave me in principle on the side of those who claim the error can be reduced by averaging (considering the scenario of global temperature measurement (whatever that may mean in the end)).
Although I agree, you should take note of how little has been accomplished in this thread despite the considerable praise heaped upon the essay. The OP expends many words upon a triviality, namely that multiple measurements where true values are confined to within a finite interval around the reported value will average to an estimate necessarily confined to an interval equal to the average of the raw measurement intervals. (the intervals need not be equal, though they are in KH’s example). This could be deduced from a single line of mathematics. The discussion goes off the rails when KH makes the following claim:
/i If each measurement is only accurate to ± 2 cm, then the monthly mean cannot be MORE accurate than that — it must carry the same range of error/uncertainty as the original measurements from which it is made. /bAveraging does not increase accuracy. /I/b
The bolded claim is an equivocation between the total range of possible mean values vs the accuracy (i.e. the probable range) of deviation of the mean from that estimated by averaging recorded results. Imagine the counterpart of the above statement for measurements collected with measurement errors known a priori to be additive gaussian random variables. Since the “possible range” of a gaussian rv is +/- infinity none of the meaurements are informative according to the “possible range” criterion and neither is the average as it is also gaussian by implication. Sound persuasive? In order to talk meaningfully about the accuracy of a statistical estimate we have to talk probabilities and it doesn’t answer the mail to make statements like “this article is about measurements not probabilities”. To say something informative about an estimated mean value we have to know something meaningful about the probability distribution of the error of the estimated average. Thats what probabilities are for … to make quantitative statements about achieved real world values in the presence of various forms of uncertainty including sampling and measurement errors. You cannot pinch-hit for probabilities with interval arithmetic. To estimate parameters of the probability distribution of the error of a data average we need to consult some representation of the joint probability distribution of the measurement errors. That is what we need to know and it is all we need to know. But how can we make meaningful statements about something as potentially complicated as the multivariate probability distribution of instrument sampling errors? The answer arises from instrument design and testing, followed by calibration and data quality assurance procedures in instrument use.
The instrument designer strives to: 1. Reduce systematic error and drift by design and calibration and 2. Increase sample to sample measurement error independence and decorrelation of random uncontrollable errors, often by adjusting the sample rate to conform to typical variations in signal to levels comparable to sampling error levels. These efforts can succeed in making the usual 1/n reduction in estimated variance a viable approximation but inevitably runs afoul of any residual instrument bias and if n is pushed too far, e.g. by oversampling, might materially underestimate error variance from residual correlations of measurement error. Fortunately the former error can be estimated from repeated measurement of known constant signals from lab standards and the latter can checked against observed variances of signal to ensure that the variance of the estimated mean is dominated by signal variation rather than sampling noise. A typical example of joint compromise would be to add a random dither signal to avoid quantization biases from signals whose multi-sample variation is below the LSB of a digitizer at the expense of of an increase (but useful increase) in random error variance etc. In any case use of the statistical formulae requires knowledge of the signals being interrogated, the instruments employed, adequate support from testing and, in the end, a non-conspiratorial attitude toward the actual errors encountered in practice. This latter attitude, although it routinely is and should be under constant surveillance, is altogether customary in science and engineering generally, but appears to be a very “hard sell” for many on this board.
I have no illusions that anything that I may write will change the evident opinion of many readers of WUWT i.e. that numerous practioners of “climate science” (I hate that term) are utterly sunk in confirmation bias. But skeptics need to acknowledge that Judith Curry’s “uncertainty monster” cuts both ways and the only way out of the morass is the concrete accomplishments of well founded science that can achieve real, not manufactured, consensus amongst competent and well informed participants. I don’t think that the OP of this thread has materially advanced that cause.
Carl:
I see you have some signal processing background and I agree with what you say, but I disagree that the article has done nothing. Despite the article being wrong, It’s inspired some of us to think about all sorts of interesting sources of errors.
Also, in regards to a single station, Kip is right, though he doesn’t state correctly WHY he is right. He’s correct because the accuracy of calibration cannot exceed the precision of the instrument being calibrated.
Of course, often calibration is done with finer-grained measurements than station reporting, and we can also independently calibrate multiple stations and thus get a gaussian curve of calibration errors and then get more accuracy (approaching the accuracy of the calibration source), but that’s not discussed in the article either. Has anyone looked into calibration methods for thermometers and tide stations 100 years ago?
Furthermore, instruments are re-calibrated at some interval (such as yearly), which introduces noise at the level of the precision of the instrument and at the frequency inverse to the calibration interval. Which WILL affect low frequency noises and potentially introduce aliasing, false positives on step change detection, etc. Changes in calibration methods will iinduce even lower frequency noise, which will show up as a trendline.
Don’t get me started on how horrible it is to draw a trend line on time-series data. It violates Nyquist and should not be done.
Peter
The root cause of so much advocting to and fro is simply that averaging can improve accuracy AND it can’t. BUT it depends on the kind of error, the kind of data, and the context of use.
In general use it is best to NOT use an average to attempt to remove error as it gives false accuracy if used wrongly and most people don’t know when to use it, and not. Thus the rules taught to me in chemistry class to carry error range forward unchanged.
Yet IFF you have an extensive property (and temperature is NOT one) you can use an average of measurements OF THE SAME THING OR EVENT (and sequential temperature readings over time or from different places are not the same air mass) to remove RANDOM error (but not systemactic error).
https://chiefio.wordpress.com/2011/07/01/intrinsic-extrinsic-intensive-extensive/
So both sides are technically correct that sometimes averaging can increase accuracy, but also most times it can not.
Averaging temperature data is fundamentally broken due to the three things already listed. Temperature is an intrinsic property. The measurements are of different air masses so not measuring the same air with 10 thermometers to remove the random errors between them. Then the third problem is that much of the error is systematic (humidity problem in electronic sensors of one type, change from one class of intrument (LIG) to another (fast response electronic), change from whitewash to latex paint, aging of latex paint, etc.) so not removed by averaging anyway.
In short, under very limited circumstances and done with great care, some types of measurement can be improved by averaging. BUT, unfortunately, temperature data is not that type, the measurements are not of the same thing, and the dominant errors are systematic so not improved anyway.
The use of averages to “improve” temperature accuracy is hopelessly broken and wrong, but getting warmers to see why is nearly impossible, in part due to the existance of examples where other data can be improved. (like gravity (one thing, an extrinsic property) measured many times to remove random errors between the measurements). That the specific does not generalize to temperatures at different times and places with systematic equipment error escapes them.
“Temperature is an intrinsic property.”
“gravity (one thing, an extrinsic property) measured many times to remove random errors”
Actually, gravity is intensive, or intrinsic. But the point of the distinction is that an intensive property, when integrated over space, becomes extensive. Density ρ is intensive, but when integrated over space becomes mass, extensive. The product ρg is intensive, but integrated becomes weight (force), extensive.
The process of averaging temperature is integrating over a surface, so the integral is extensive (not quite, perhaps, because it isn’t over a volume, but if you took it as representing a surface layer it would be extensive).
“The measurements are of different air masses “
The whole point is to measure different air masses, to get an estimate of an extensive property by sampling. Think of trying to estimate a gold ore body. You drill for samples, and measure gm/ton, an intensive property. You get as many as you can, at different and known spatial locations. Then you integrate that over space to see how much gold you have. People invest in that. The more samples, the better coverage you have, but also the more the effect of incidental errors of each local collection of ore are reduced in the final figure.
Systematic error is usually reduced by averaging. Only some have aged latex, so the effect on the average is reduced. But a bias remains and affects the average. People make a lot of effort to identify and remove that bias.
NS,
The point of drilling as many cores as can be afforded reasonably is to be sure that the volume is not seriously undersampled. Assuming that the gold has a normal distribution (probably an invalid assumption) a single sample could represent any point on the probability distribution curve. With more samples, it is more likely that the samples will fall within the +/- 1 SD interval, and the average will give an accurate estimate of the mean. That speaks to the accuracy of the volume estimate, which is of concern to investors. In this analogy, precision is of lesser concern than accuracy.
The gold analogy breaks down with respect to temperatures and sea levels because in the gold case the attempt is to estimate the total FIXED quantity of gold in the ore body. The point of contention in the climatology issues is whether the annual average value changes over time are real, or are an artifact of sampling error. Because the annual average changes are typically to the right of the decimal point, the precision becomes important. Before one can even apply the standard error of the mean (IF it IS valid!) there has to be agreement on what the standard deviation of the sampled population is. The approach of using monthly averages to calculate annual averages strongly filters out extreme values, which affect the standard deviation. I have made the argument that based on the known range of Earth land temperatures, the standard deviation of diurnal variations is highly likely to be several tens of degrees.
Climatologists are worried about the weight of the fleas on a dog when they aren’t sure of the weight of the dog without the fleas.
NS,
You said, “Systematic error is usually reduced by averaging.” I seriously question the validity of that claim. I can imagine situations where it might happen. However, one of the most serious examples of systematic bias in climatology is the orbital decay of satellites, resulting in the temperatures being recorded at increasingly earlier times, which were not random! Systematic error can be corrected if its presence is identified, and can be attributed to some measurable cause. The point of calling it “systematic” is that it is NOT random and generally not amenable to correction by averaging.
” The point of calling it “systematic” is that it is NOT random and generally not amenable to correction by averaging.”
The way averaging reduces error is by cancellation. The only way it can completely fail to do that is if there is no cancellation – ie all errors are the same, in the same direction. With a small number of sensors, as with satellite, that can happen. But systematic error due say to aged latex, some will have it and some not. If you average a set of readings where half are affected by aging, the average will reflect about half the effect.
Correct, but one would never plot a trendline on the gold sample data and use that analysis to make an investment on the next plot over, which is what the climate scientists are asking us to do…
NS,
You said, ” People make a lot of effort to identify and remove that bias.” The classic study done by Anthony demonstrates that they either failed frequently in their effort, or didn’t make the effort as you claim.
If temperature data that are systematically high, because of poor siting, are averaged with other data, the bias is diluted. [Does homogenization propagate this bias beyond a single site?] However, the other side of the coin is that the ‘good’ data are corrupted. Neither is the same as “cancelling,” as when the variations are random.
In the situation of the aging latex paint, that means EVERY station is subject to a degradation which is ongoing and continuous, with it being worse for old stations. That is, there is an increasing bias or trend built into every site!
Stokes and others,
I’d like to draw your attention to a quote from “An Introduction to Error Analysis,” (Taylor, 1982), p.95:
“We saw in Section 4.4 that the standard deviation of the mean [aka standard error of the mean] {sigma sub bar x} approaches zero as the number of measurements N is increased. This result suggested that, if you have the patience to make an enormous number of measurements, then you can reduce the uncertainties indefinitely, without having to improve your equipment or technique. We can now see that this is not really so. Increasing N can reduce the RANDOM component, {delta k sub random} = {sigma sub bar k} indefinitely. But any given apparatus has SOME systematic uncertainty, which is NOT reduced as we increase N. It is clear from (4.25) that little is gained from further reduction of {delta k random} once {delta k random} is smaller than {delta k systematic.} In particular, the total {delta k} can never be made less than {delta k systematic}. This simply confirms what we already guessed, that in practice a large reduction of the uncertainty requires improvements in techniques or equipment in order to reduce both the random and the systematic errors in each single measurement.”
Basically, without a careful definition of the measurand, and identification of the types and magnitude of all the uncertainties, and rigorous assessment of the calculated statistics and their uncertainties, one is not justified in categorically stating that the accuracy and precision of the mean annual global temperature or mean sea level is simply the (unstated) standard deviation divided by the square root of the number of measurements.
Clyde ==> “But any given apparatus has SOME systematic uncertainty, which is NOT reduced as we increase N. ”
When measurements are intentionally given as ranges, that IS the systematic uncertainty, by definition. The system is to state the measurement as a range, within which the true value certainty resides — equally certain at ANY point inside the range.
Clyde,
“I’d like to draw your attention to a quote “
This seems to go on endlessly. That is the wrong problem!!! It is not the situation in any kind of climate/sea level context that has been raised. The quote concerns the problem of trying to improve the accuracy of a single measurement by repetition. The climate problem is the estimation of a population mean by averaging many single measurements of different things. OK, they may be all temperatures. And at one site, they may be measured o different days with the same instrument. But measuring today’s max, and then tomorrow’s, is not a repeated measure in the sense of Taylor. No-one expects those measures to get the same result.
Taylor’s text is here. The text starts by specifying that repeated measurement of the same thing is his topic.
The important difference to climate is that there is now not just one “systematic uncertainty”. There are thousands, and they themselves will be in different directions and will be much reduced in the final average. There may be a residual component common to all the samples. That is the bias.
In global temperature, the main defence against bias is the formation of anomalies. That removes consistent bias. Then you only have to worry about circumstances in which the bias changes. That is what homogenisation is all about.
NS,
I think it is you who does not understand. You said, “The quote concerns the problem of trying to improve the accuracy of a single measurement by repetition.” That is wrong. The quote is about estimating the value of a fixed parameter by taking multiple measurements. It boils down to “…the estimation of a population mean by averaging many single measurements…” The diameter of a ball bearing and a fictitious representative global temperature are analogous problems except that, in the case of a variable, the parameter supposedly being measured is changing and becomes part of the systematic component, increasing the inherent uncertainty. Also, the simple SODM is not appropriate for data sets, such as time-series, that are correlated.
I think that you have lost sight of the fact that the point of contention (such as claimed by Mark S Johnson) is whether or not the SODM can be increased indefinitely in order to provide sufficient precision to say that the annual temperature difference between year 1 and year 2 is statistically significant.
You claim, “There are thousands, and they themselves will be in different directions and will be much reduced in the final average.” That is an unproven assumption, and without quantitative proof.
You further claim, “In global temperature, the main defence against bias is the formation of anomalies.” Anomalies correct for elevation and climate differences. However, without determining what all the systematic errors are, and quantifying them, you are on thin ice to claim that they will all cancel out. They could just as easily be additive. You simply don’t know, and are hoping that they cancel out.
Thank you for the link. It looks like a different edition of Taylor than what I have in my library and I will compare the two.
There’s a pernicious inability here to distinguish between the inherent measurement error that persists in individual data points and the strongly reduced effect manifest in temporal or aggregate averages.
To expand upon the above point in the relevant context of continuous-time geophysical signals, consider the measurement M(t) at any time t to consist of the true signal value s(t) plus the signal-independent measurement error or noise n(t). At no instant of time will the noise disappear from the measurements. According to the Parseval Theorem, the total variance of the measurements will always be the sum of the respective variances of signal and noise.
But if we take the temporal average of =
+ , the contribution of the noise term will tend to zero for unbiased, zero-mean noise, thereby greatly improving the available statistical estimate of signal mean value. A similar reduction in variance takes place whenever the averaging is done over the aggregate of sampled time-series (station records) within a spatially homogeneous area.What seems to confuse signal analysis novices is the categorical difference between instantaneous measurements and statistical constructs such as the mean of measurements–which is always the result not of direct measurement but of statistical estimation.
Moderator:
Something went totally awry in trying to post my additional comment ten minutes ago. A mathematical formula was strangely distorted in the preview window and all subsequent text was overstruck. Please post anyway and I’ll make necessary corrections subsequently.
The correct reading of the beginning of the second paragraph is:
All of the subsequent overstriking should be ignored. To clarify the second paragraph further, append the sentence:
Kip,
I share your respect for Nick and I am impressed by his understanding of the probabilistic quantification of uncertainty (though I do not always share his use of terminology). In contrast, I am no statistician (as my previous fumbled posts ably demonstrate). Nevertheless, I do know that probability theory is most definitely not the only mathematical instrument available for the quantification of uncertainty. See, for example, Info-gap analysis and possibility theory. I could add Dempster-Shafer Theory (although this is strictly-speaking an extension of subjective probabilities) and fuzzy logic (although its advocates claim that it embraces probability theory). I would be interested to hear Nick’s views on this subject area. For example, the following paper is offered as an example of the approaches that are being developed in order to address uncertainties that are not amenable to probability theory:
https://link.springer.com/article/10.1007/s11023-017-9428-3
I don’t think the debate is settled by deciding between probabilistic and non-probabilistic approaches. Both are needed for a comprehensive treatment of uncertainty.
John ==> Thanks for the link — very interesting — it is fascinating to me that anyone would honestly think, in the real world, that the only way to think about, deal with, or quantify uncertainty is with probability theory. It shakes my faith in sanity and common sense.
John,
“See, for example, Info-gap analysis and possibility theory”
Info-gap analysis is more commonly called info-gap decision theory. It does not quantify uncertainty, but the costs (or benefits) of uncertainty. You can then rig up a what-if sequence to calculate some kind of worst case exposure, with no associated likelihood. Possibility theory is more like Kip’s interval notion, but with fractional values. Wiki describes it as an extension of fuzzy logic. But the important thing is that by itself, it is not useful. If all things are possible, we’ve learnt nothing. To make it useful, you need a second number, the necessity. A key quote from Wiki:
“The intersection of the last two cases is {\displaystyle \operatorname {nec} (U)=0} \operatorname {nec}(U)=0 and {\displaystyle \operatorname {pos} (U)=1} \operatorname {pos}(U)=1 meaning that I believe nothing at all about {\displaystyle U} U. Because it allows for indeterminacy like this, possibility theory relates to the graduation of a many-valued logic, such as intuitionistic logic, rather than the classical two-valued logic.”
We’re not in Kansas any more.
But the main thing is, they aren’t quantifying ucertainty, but something else. If you want to press the relevance here, you would have to show a scientific problem involving averaging on which it gave sensible results.
Nick,
Okay, point taken Nick. To be precise, I should have said that Info-gap Decision Theory is a technique that models uncertainty in a non-probabilistic manner in order to determine a robust strategy. My point still stands, however, that probability theory is not the only game in town and there are circumstances (many of which are highly relevant to climate change) when such non-probabilistic techniques are more applicable for calculating how to proceed under uncertainty. It is not always possible (or, at least, it may sometimes be inadvisable) to model uncertainty using probability theory.
As far as possibility theory is concerned, I am not sure what point you are trying to make with your wiki quote. At the end of the day, possibility theory is a non-probabilistic technique. It does not employ a probability density function (pdf) but a so-called possibility density function (πdf). Probability plays no role in the way in which uncertainty is modelled. Pointing out possibility theory’s kinship with fuzzy logic is simply to compare it to another non-probabilistic technique. Also, I presume you had meant to say that, by itself, ‘possibility’ is useless – not ‘possibility theory’.
I think your claim that possibility theory ‘does not quantify uncertainty but something else’ depends upon whether you see confidence as the key indicator of uncertainty. In possibility theory, confidence in the predicate A, for the proposition ‘x is A’, is given by the difference between the possibility of A and the possibility of the compliment of A. Given the relationship between possibility and necessity, this works out as:
Confidence(A) = Possibility(A) + Necessity(A) – 1
I’m happy to read that as a quantification of uncertainty.
I wonder if your insistence that possibility theory would have to find application in ‘a scientific problem involving averaging on which it gave sensible results’ betrays a probabilistic bias in your definition of uncertainty. Nevertheless, I offer the following two links that I trust will satisfy your curiosity
:
https://link.springer.com/chapter/10.1007/3-540-34777-1_40?no-access=true
http://home.iitk.ac.in/~partha/possibility
The first links to a research paper that uses possibility theory to analyse uncertainties associated with parameter perturbation in climate modelling. The second link proposes applications within transport analysis.
Best regards
John,
Thanks for the links. They do seem to be trying to get to useful applications. I haven’t been able to get the full text of the first, but it seems to be using possibility language for a bayesian outcome. I’ll read the second more carefully, and keep trying to get the full text of the first.
My Kansas comment referred to the suggestion that possibility only made sense in a multi-valued logic. That’s a big switch for ordinary thinking about uncertainty.
Kip,
You say that it shakes your faith in sanity and common sense but, to be fair, before probability theory came along, no-one was thinking methodically about uncertainty at all. Since then it has enjoyed enormous success, to the extent that many would have their faith in sanity and common sense shaken to hear you express your views 🙂
Nevertheless, despite probability theory’s success, practitioners and philosophers alike are still unsure just what probability is! Thankfully, a more mature view of uncertainty is fast emerging. It’s just a shame that the revolution hasn’t reached the IPCC yet. Here’s another link that I think you will love:
https://link.springer.com/article/10.1007%2Fs10670-013-9518-4
John ==> There is nothing wrong with probability theory that applying it only where it is correctly applicable (even under its own rules) doesn’t solve….it is not a universal panacea.
It is this: “and I insist there is nothing else. ” [but probability theory — to deal with original measurement uncertainty] that gives me the intellectual heebie-jeebies.
Exactly.