Durable Original Measurement Uncertainty

Guest Essay by Kip Hansen

Introduction:

Temperature and Water Level (MSL) are two hot topic measurements being widely bandied about and vast sums of money are being invested in research to determine whether, on a global scale, these physical quantities — Global Average Temperature and Global Mean Sea Level — are changing, and if changing, at what magnitude and at what rate. The Global Averages of these ever-changing, continuous variables are being said to be calculated to extremely precise levels — hundredths of a degree for temperature and millimeters for Global Sea Level — and minute changes on those scales are claimed to be significant and important.

In my recent essays on Tide Gauges, the question of the durability of original measurement uncertainty raised its toothy head in the comments section.

Here is the question I will try to resolve in this essay:

If original measurements are made to an accuracy of +/- X (some value in some units), does the uncertainty of the original measurement devolve on any and all averages – to the mean – of these measurements?

Does taking more measurements to that same degree of accuracy allow one to create more accurate averages or “means”?

My stated position in the essay read as follows:

If each measurement is only accurate to ± 2 cm, then the monthly mean cannot be MORE accurate than that — it must carry the same range of error/uncertainty as the original measurements from which it is made. Averaging does not increase accuracy.

It would be an understatement to say that there was a lot of disagreement from some statisticians and those with classical statistics training.

I will not touch on the subject of precision or the precision of means. There is a good discussion of the subject on the Wiki page: Accuracy and precision .

The subject of concern here is plain vanilla accuracy: “accuracy of a measurement is the degree of closeness of measurement of a quantity to that quantity’s true value.” [ True value means is the actual real world value — not some cognitive construct of it.)

The general statistician’s viewpoint is summarized in this comment:

“The suggestion that the accuracy of the mean sea level at a location is not improved by taking many readings over an extended period is risible, and betrays a fundamental lack of understanding of physical science.”

I will admit that at one time, fresh from university, I agreed with the StatsFolk. That is, until I asked a famous statistician this question and was promptly and thoroughly drummed into submission with a series of homework assignments designed to prove to myself that the idea is incorrect in many cases.

First Example:

Let’s start with a simple example about temperatures. Temperatures, in the USA, are reported and recorded in whole degrees Fahrenheit. (Don’t ask why we don’t use the scientific standard. I don’t know). These whole Fahrenheit degree records are then machine converted into Celsius (centigrade) degrees to one decimal place, such as 15.6 °C.

This means that each and every temperature between, for example, 72.5 and 71.5 °F is recorded as 72 °F. (In practice, one or the other of the precisely .5 readings is excluded and the other rounded up or down). Thus an official report for the temperature at the Battery, NY at 12 noon of “72 °F” means, in the real world, that the temperature, by measurement, was found to lie in the range of 71.5 °F and 72.5 °F — in other words, the recorded figure represents a range 1 degree F wide.

In scientific literature, we might see this in the notation: 72 +/- 0.5 °F. This then is often misunderstood to be some sort of “confidence interval”, “error bar”, or standard deviation.

It is none of those things in this specific example of temperature measurements. It is simply a form of shorthand for the actual measurement procedure which is to represent each 1 degree range of temperature as a single integer — when the real world meaning is “some temperature in the range of 0.5 degrees above or below the integer reported”.

Any difference of the actual temperature, above or below the reported integer is not an error. These deviations are not “random errors” and are not “normally distributed”.

Repeating for emphasis: The integer reported for the temperature at some place/time is shorthand for a degree-wide range of actual temperatures, which though measured to be different, are reported with the same integer. Visually:

Even though the practice is to record only whole integer temperatures, in the real world, temperatures do not change in one-degree steps — 72, 73, 74, 72, 71, etc. Temperature is a continuous variable. Not only is temperature a continuous variable, it is a constantly changing variable. When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another. Further, any and all values in the range shown above are equally likely — Nature does not “prefer” temperatures closer to the whole degree integer value.

[ Note: In the U.S., whole degree Fahrenheit values are converted to Celsius values rounded to one decimal place –72°F is converted and also recorded as 22.2°C. Nature does not prefer temperatures closer to tenths of a degree Celsius either. ]

While the current practice is to report an integer to represent the range from integer-plus-half-a-degree to integer-minus-half-a-degree, this practice could have been some other notation just as well. It might have been just report the integer to represent all temperatures from the integer to the next integer, as in 71 to mean “any temperature from 71 to 72” — the current system of using the midpoint integer is better because the integer reported is centered in the range it represents — this practice, however, is easily misunderstood when notated 72 +/- 0.5.

Because temperature is a continuous variable, deviations from the whole integer are not even “deviations” — they are just the portion of the temperature measured in degrees Fahrenheit normally represented by the decimal fraction that would follow the whole degree notation — the “.4999” part of 72.4999°F. These decimal portions are not errors, they are the unreported, unrecorded part of the measurement and because temperature is a continuous variable, must be considered evenly spread across the entire scale — in other words, they are not, not, not “normally distributed random errors”. They only reason they are uncertain is that even when measured, they have not been recorded.

So what happens when we now find the mean of these records, which, remember, are short-hand notations of temperature ranges?

Let’s do a basic, grade-school level experiment to find out…

We will find the mean of a whole three temperatures; we will use these recorded temperatures from my living room:

11:00     71 degrees F

12:00     72 degrees F

13:00     73 degrees F

As discussed above, each of these recorded temperatures really represent any of the infinitely variable intervening temperatures, however I will make this little boxy chart:

Here we see each hour’s temperature represented as the highest value in the range, the midpoint value of the range (the reported integer), and as the lowest value of the range. [ Note: Between each box in a column, we must remember that there are an infinite number of fractional values, we just are not showing them at this time. ] These are then averaged — the mean calculated — left to right: the three hour’s highest values give a mean of 72.5, the midpoint values give a mean of 72, and the lowest values give a mean of 71.5.

The resultant mean could be written in this form: 72 +/- 0.5 which would be a short-hand notation representing the range from 71.5 to 72.5.

The accuracy of the mean, represented in notation as +/- 0.5, is identical to the original measurement accuracy — they both represent a range of possible values.

Note: This uncertainty stems not from the actual instrumental accuracy of the original measurement, which is a different issue and must be considered additive to the accuracy discussed here which arises solely from the fact that measured temperatures are recorded as one-degree ranges with the fractional information discarded and lost forever, leaving us with the uncertainty — a lack of knowledge — of what the actual measurement itself was.

Of course, the 11:00 actual temperature might have been 71.5, the 12:00 actual temperature 72, and the 13:00 temperature 72.5. Or it may have been 70.5, 72, 73.5.

Finding the means kiddy-corner gives us 72 for each corner to corner, and across the midpoints still gives 72.

Any combination of high, mid-, and low, one from each hour, gives a mean that falls between 72.5 and 71.5 — within the range of uncertainty for the mean.

Even for these simplified grids, there are many possible combinations of one value from each column. The means of any of these combinations falls between the values of 72.5 and 71.5.

There are literally an infinite number of potential values between 72.5 and 71.5 (someone correct me if I am wrong, infinity is a tricky subject) as temperature is a continuous variable. All possible values for each hourly temperature are just as likely to occur — thus all possible values, and all possible combinations of one value for each hour, must be considered. Taking any one possible value from each hourly reading column and finding the mean of the three gives the same result — all means have a value between 72.5 and 71.5, which represents a range of the same magnitude as the original measurement’s, a range one degree Fahrenheit wide.

The accuracy of the mean is exactly the same as the accuracy for the original measurement — they are both a 1-degree wide range. It has not been reduced one bit through the averaging process. It cannot be.

Note: For those who prefer a more technical treatment of this topic should read Clyde Spencer’s “The Meaning and Utility of Averages as it Applies to Climate” and my series “The Laws of Averages”.

And Tide Gauge Data?

It is clear that the original measurement accuracy’s uncertainty in the temperature record arises from the procedure of reporting only whole degrees F or degrees C to one decimal place, thus giving us not measurements with a single value, but ranges in their places.

But what about tide gauge data? Isn’t it a single reported value to millimetric precision, thus different from the above example?

The short answer is NO, but I don’t suppose anyone will let me get away with that.

What are the data collected by Tide Gauges in the United States (and similarly in most other developed nations)?

The Estimated Accuracy is shown as +/- 0.02 m (2 cm) for individual measurements and claimed to be +/- 0.005 m (5 mm) for monthly means. When we look at a data record for the Battery, NY tide gauge we see something like this:

Date Time	Water Level	Sigma
9/8/2017 0:00	4.639	0.092
9/8/2017 0:06	4.744	0.085
9/8/2017 0:12	4.833	0.082
9/8/2017 0:18	4.905	0.082
9/8/2017 0:24	4.977	0.18
9/8/2017 0:30	5.039	0.121

Notice that, as the spec sheet says, we have a record every six minutes (1/10^th hr), water level is reported in meters to the millimeter level (4.639 m) and the “sigma” is given. The six-minute figure is calculated as follows:

“181 one-second water level samples centered on each tenth of an hour are averaged, a three standard deviation outlier rejection test applied, the mean and standard deviation are recalculated and reported along with the number of outliers. (3 minute water level average)”

Just to be sure we would understand this procedure, I emailed CO-OPS support [ @ co-ops.userservices@noaa.gov ]:

To clarify what they mean by accuracy, I asked:

When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.

And received the answer:

That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time.

[ Note: In a separate email, it was clarified that “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.” ]

The question and answer verify that both the individual 1-second measurements and the 6-minute data value represents a range of water level 4 cm wide, 2 cm plus or minus of the value recorded.

This seemingly vague accuracy — each measurement actually a range 4 cm or 1 ½ inches wide — is the result of the mechanical procedure of the measurement apparatus, despite its resolution of 1 millimeter. How so?

NOAA’s illustration of the modern Acoustic water level tide gauge at the Battery, NY shows why this is so. The blow-up circle to the top-left shows clearly what happens at the one second interval of measurement: The instantaneous water level inside the stilling well is different than the instantaneous water level outside the stilling well.

This one-second reading, which is stored in the “primary data collection platform” and later used as part of the 181 readings averaged to get the 6-minute recorded value, will be different from the actual water level outside the stilling well, as illustrated. Sometimes it will be lower than the actual water level, sometimes it will be higher. The apparatus as a whole is designed to limit this difference, in most cases, at the one second time scale, to a range of 2 cm above or below the level inside the stilling well — although some readings will be far outside this range, and will be discarded as “outliers” (the rule is to discard all 3-sigma outliers — of the set of 181 readings — from the set before calculating the mean which is reported as the six-minute record).

We cannot regard each individual measurement as measuring the water level outside the stilling well — they measure the water level inside the stilling well. These inside-the-well measurements are both very accurate and precise — to 1 millimeter. However, each 1-second record is a mechanical approximation of the water level outside the well — the actual water level of the harbor, which is a constantly changing continuous variable — specified to the accuracy range of +/- 2 centimeters. The recorded measurements represent ranges of values. These measurements do not have “errors” (random or otherwise) when they are different than the actual harbor water level. The water level in the harbor or river or bay itself was never actually measured.

The data recorded as “water level” is a derived value – it is not a direct measurement at all. The tide gauge, as a measurement instrument, has been designed so that it will report measurements inside the well that will be reliably within 2 cm, plus or minus, of the actual instantaneous water level outside the well – which is the thing we wish to measure. After taking 181 measurements inside the well, throwing out any data that seems too far off, the remainder of the 181 are averaged and reported as the six-minute recorded value, with the correct accuracy notation of +/- 2 cm — the same accuracy notation as for the individual 1-second measurements.

The recorded value denotes a value range – which must always be properly noted with each value — in the case of water levels from NOAA tide gauges, +/- 2 cm.

NOAA quite correctly makes no claim that the six-second records, which are the means of 181 1-second records, have any greater accuracy than the original individual measurements.

Why then do they make a claim that monthly means are then accurate to +/- 0.005 meters (5 mm)? In those calculations, the original measurement accuracy is simply ignored altogether, and only the reported/recorded six-minute mean values are considered (confirmed by the author) — the same error that is made as with almost all other large data set calculations, applying the inapplicable Law of Large Numbers.

Accuracy, however, as demonstrated here, is determined by the accuracy of the original measurements when measuring a non-static, ever-changing, continuously variable quantity and which is then recorded as a range of possible values — the range of accuracy specified for the measurement system — and cannot be improved when (or by) calculating means.

Take Home Messages:

When numerical values are ranges, rather than true discrete values, the width of the range of the original value (measurement in our cases) determines the width of the range of any subsequent mean or average of these numerical values.
Temperatures calculated from ASOS stations however are recorded and reported temperatures as ranges 1°F wide (0.55°C), and such temperatures are correctly recorded as “Integer +/- 0.5°F”. The means of these recorded temperatures cannot be more accurate than the original measurements –because the original measurement records themselves are ranges, the means must be denoted with the same +/- 0.5°F.
The same is true of Tide Gauge data as currently collected and recorded. The primary record of 6-minute-values, though recorded to millimetric precision, are also ranges with an original accuracy of +/- 2 centimeters. This is the result of the measurement instrument design and specification, which is that of a sort-of mechanical averaging system. The means of tide gauge recorded values cannot be made more accurate the +/- 2 cm — which is far more accurate than needed for measuring tides and determining safe water levels for ships and boats.
When original measurements are ranges, their means are also ranges of the same magnitude. This fact must not be ignored or discounted; doing so creates a false sense of the accuracy of our numerical knowledge. Often the mathematical precision of a calculated mean overshadows its real world, far fuzzier accuracy, leading to incorrect significance being given to changes of very small magnitude in those over-confident means.

# # # # #

Author’s Comment Policy:

Thanks for reading — I know that this will be a difficult concept for some. For those, I advise working through the example themselves. Use as many measurements as you have patience for. Work out all the possible means of all the possible values of the measurements, within the ranges of those original measurements, then report the range of the means found.

I’d be glad to answer your questions on the subject, as long as they are civil and constructive.

# # # # #

0 0 votes

Article Rating

514 Comments

Inline Feedbacks

View all comments

The Reverend Badger

October 15, 2017 3:39 am

Well this is an interesting topic!

Thanks to Kip for a very clear and well thought out explanation. Easy peasy or so I thought. Just reminded me about all the maths and statistics I studied when I was 17/18/19 and like Kip showed us I used to work out homework type examples myself. Pencil and paper, basic calculator. I didn’t even see a scientific calculator until 1977.

Well I thought all this was basic, elementary, simple. Easy to grasp. Foundational stuff for any STEM degree course. Ingrained in the brains of all those who graduated, known by all PhDs. FUN-DER-MENTAL.

Apparently NOT !!!!!!!!!!!!!!!!!

Where oh where to start? I literally have no idea now. If only Nick were one of my students, we will have him stay behind and sit with all the others (it’s going to be a big room) with just a pencil and paper and a basic calculator. Unfortunately I expect the homework will not be even attempted as the students will just start arguing with the teacher (again).

I’ll have a look through my library and see if I can find some of the older books on metrology, probably still got a few somewhere. May be coming back later with some references after the weekend.

TLDR Kip right, all the “others” SO SO WRONG.

vukcevic

October 15, 2017 4:11 am

Apologies for this being OT, but I thought it may predict exact path of hurricane Ophelia,
[snip – you thought wrong -mod]

Greg

Reply to vukcevic

October 15, 2017 4:46 am

Yes, that line caught my eye yesterday. Very strange. In the animation you can see that the strong winds blow towards that line then stop instead of diverting. I can only assume that means that the horizontal component more or less goes to zero and the air goes straight up.

There was nothing like this on the recent tropical cyclones.

Greg

Reply to Greg

October 15, 2017 4:49 am

here is a visual of the cloud :
http://www.meteofrance.com/integration/sim-portail/satellite_commentee/image.jpg

Clear warm air to the south, meeting cold air blowing down from the north.

Greg

Reply to Greg

October 15, 2017 4:54 am

Here is an animation, that line does not seem to be the future storm track, the whole system is moving eastwards and the storm is dispersing.

http://www.meteofrance.com/previsions-meteo-france/animation/satellite/europe

Steven Mosher

October 15, 2017 4:26 am

” and vast sums of money are being invested in research to determine whether, on a global scale, these physical quantities — Global Average Temperature and Global Mean Sea Level — are changing, and if changing, at what magnitude and at what rate”

Ah no. Not vast sums. Hardly anything at all. GISS spends less than a 1/4 man year on temps.
last I looked CRU was maybe a Post doc.
Cowtan and Way.. volunteer.
Berkeley earth, all volunteer.

Not vast sums at all.

The other efforts “re analysis” which Judith Curry takes as the gold standard, is also cheap
and some folks even make money of it.

Kip, you have no valid points. I only pray the Red team asks you to Join.
That would doom it.

Greg

Reply to Steven Mosher

October 15, 2017 4:57 am

which says a lot about your impartiality and objectivity.

Don K

Reply to Greg

October 15, 2017 5:41 am

No one has ever accused Steven of impartiality, objectivity, (or civility). The issue is the extent to which he is correct. I

n this case, I think he hass simply misunderstood what Kip said. I mean, what is climate modeling, but an attempt to project future temps? I’m told that the modeling is not cheap. Likewise, the principle use of satellite Radar Altimetry seems to be projecting sea level rise. At a first approximation, nothing associated with satellites is cheap. Ever

Greg

Reply to Greg

October 15, 2017 6:09 am

He is deliberately missing the point. If GISS spend half man year trying to ‘adjust’ the temperature record to fit their climate models, they still need the data collection and that is a global network of 100,000s of meteo stations, deployment and maintenance of floating and anchored sensors, ships records etc. etc.

Before Climategate came out CRU were being paid $1 million per year to maintain the land surface record and that was just archiving ( which they failed to do and the cat got it ) and processing.

He also wilfully ignored the whole question MSL despite having included it in the snip he quoted.

Reply to Steven Mosher

October 16, 2017 9:38 pm

lol @ur momisugly Mosher. Someone was describing your more recent posts the other day as drive-bys. Pretty lazy drive-by. I hope you and kip know one another and that you’re just razzing him.

Kip Hansen

Author

Reply to RW

October 17, 2017 3:26 pm

RW ==> Mosher is himself (which is often not a compliment in his case). He has nothing to say about the content of the essay — so he goes off about whether or not how much money is being spend on determining global means….apparently, though he is on the BEST team, he doesn’t get paid (or paid much). Fair enough, he may complain all he wants about that.

Of course, I don’t get paid either….but I don’t complain.

Mike Jonas

Editor

October 15, 2017 4:56 am

Thanks, Kip. Re temperature: I think it is important for everyone to understand that in this particular post you are only addressing a small part of the total problem. [You know this of course, and you have addressed some other issues in other posts.]. Correct me if I have got it wrong, but …..
1. Your analysis addresses only the temperature measurements that are made. It makes no allowance for the temperature measurements that are not made. Temperature measurements that are not made include missing entries from an existing station and missing stations. By “missing stations” I mean the areas in which there are no stations at all, areas that are too different to their nearby stations’ locations to be represented by those stations, and changes in the set of stations over time. All of those temperature measurements that are not made have to be estimated, and that introduces significant further error.
2. The temperatures being measured are not necessarily the temperatures that are required for climate purposes. For example, all temperature measurements in urban areas have to be corrected for UHE (Urban Heat Effect). This again introduces significant further error, because UHE is not fully understood, the factors required for accurate correction are not available, and some of the methods being used by the providers of some temperature sets are quite simply wrong. Some argue that UHE is insignificant because urban areas are such a small part of the total surface area, but this argument is incorrect because urban stations’ readings are used to estimate the temperature measurements that are not made as per 1 above. UHE is only one such source of error, other sources include land-use changes, aircraft movements at airports, pollution, poor station siting, etc.
3. The temperature measurements that are made are not necessarily correct, ie. not necessarily within the 1 deg F range that you describe. Depending on whether a station is or was automated, there could be human or equipment error. This tends to be dealt with by ignoring outliers, but this simply adds to the set of temperatures that are not made as per 1 above, it does not trap those errors which leave readings within the accepted range, and it risks eliminating genuinely unusual temperatures.
4. The temperature inside the station may vary from the temperature outside, if for example there is a fault with the station’s design or siting or changes in its condition.

The end result of all of the above is that the (in)accuracy ranges that you describe are only a small part of the total inaccuracy.

NB. This is not in any way a criticism of Kip’s post. Kip’s post addressed one particular issue only. All I am doing is making sure that others understand that the issue addressed by Kip in this post relates to just a small part of the total temperature error.

Kip Hansen

Author

Reply to Mike Jonas

October 15, 2017 8:50 am

Mike Jonas ==> Yes, Mike, all those other sources of inaccuracy and uncertainty are ADDITIVE to the most basic of inaccuracy, the original measurement inaccuracy or uncertainty.

Steve Case

October 15, 2017 4:58 am

Kip – you have a difficult row to hoe, here’s the first two sentences from the Executive Summary from Chapter five of the IPCC Fourth Assessment Report: Climate Change 2007 (AR4):

The oceans are warming. Over the period 1961 to 2003, global ocean temperature has risen by 0.10°C from the surface to a depth of 700 m.

Really? Two place accuracy for the entire globe over 42 year period? When you’re dealing with people who are in charge and also write that sort of non-sense it gives you hopeless feeling. How many reviewers with a PhD put their stamp of approval on that over the top sophistry?

Chas

October 15, 2017 5:23 am

The uncertainty of GPS position doesn’t seem to decrease simply as root n:
https://www.syz.com/gps/gpsaveraging.html
Unfortunately this interesting study doesn’t look at the accuracy of the position.

Don K

Reply to Chas

October 15, 2017 6:58 am

That IS an interesting paper

“The uncertainty of GPS position doesn’t seem to decrease simply as root n:”

One quick answer is (probably) that some of the errors are due to things like mis-estimates of ionospheric delay and satellite position that tend to average out over time. In technospeak, observations that are close together in time tend to have errors that are correlated.

I expect that’s not the full story.

philohippous

Reply to Chas

October 15, 2017 11:27 am

Chas- as far as I know, the GPS systems delivers a coordinate with a fixed variability due to the way the signals have to be analysed. Before GPS went beyond the US military the position was “fuzzed” so it did not represent a random distribution around a centerpoint but a level probability for around the point of something like 6-10meters. Now they aren’t fuzzing the output and the coordinates represent a point anywhere between 6-10 centimeters. You can’t query the position several times and the coordinates returned don’t have an random Gaussian probability of being anywhere in the box. There is no statistical centerpoint as when a measurement has a Gaussian distribution probability.

Very similar to what Kip is talking about.

Most the methods used by the Standards organization above deal with measurements where individual measures can be expected to have a Gaussian distribution- chemical tests, electrical measurements, conventional surveying, engineering measurements, etc. That doesn’t apply when the measurement is rounded to an arbitrary figure. There are several good posts on how the Australian BOM goofed when they introduced electronic thermometers. The WMO standard requires averaging measurements over 10 minutes, to mimic the previously used mercury thermometers. The AMO was picking the highest reading found within 1 second and using that as the average. Then, for awhile, they had low limits programmed in the data gathering that were well above reasonable limits. -10.5C was automatically reported at -10.C in several areas that had routinely in the past reported -12,-13, -14.

Kip Hansen

Author

Reply to Chas

October 15, 2017 3:15 pm

Chas ==> The complexity of GPS positioning is explained in this 90-minute movie. And, yes, it ain’t easy.

Directed^Energy (@Ike_Kiefer)

October 15, 2017 5:35 am

Comment on Temperature:

Temperature provides very limited information about the energy state of any system not in complete thermal equilibrium. A temperature reading is a highly localized measurement of instantaneous kinetic energy. But the very existence of weather proves the Earth we are measuring is not in thermal equilibrium. To approximate the actual energy state of the large volume of atmosphere or water represented by a single thermometer, we would have to know a lot more about the heat capacities and thermal conductivities and thermal gradients present throughout that volume. And to have any hope of accurately approximating with a single pseudo temperature value the energy state of the dynamically changing entire Earth’s surface at any single moment, we should need a much more uniform and dense distribution of thermometers than we have today.

Comment on Accuracy:

I would reframe the debate above in the following terms.

1. Consensus does not necessarily equal truth.

2. Measurements are analogous to opinions: they each have some degree of truth and some degree of ignorance/error.

3. Averaging ignorant opinions leads to consensus. Averaging erroneous measurements leads to consensus.

4. Averaging more ignorant opinions or erroneous measurements firms the consensus but does not force the consensus to converge toward truth.

5. The a priori assumption that ignorance or error is random and self-cancelling rather than correlated/biased is unscientific, and likely ignorant and erroneous in its own right.

Kip is essentially right. Rick C PE is more precisely correct. Mind your significant figures.

Bill Marsh

Editor

October 15, 2017 5:38 am

Late to the discussion and perhaps (most likely I think) my understanding of Stats has declined considerably since my graduate work in that area 40 years ago, but I’m having trouble accepting this statement by Kip as valid.
“When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another.”

I may be misunderstanding the concept of ‘independence’ as used.I think that the temperature at 11:01 is dependent to some extent on the temperature at 11:00, at least in the physical world. There has to be some physical limit to how much the temperature can change in one minute. If the measurements are dependent it makes things mathematically very ugly, very quickly.

ristvan

Reply to Bill Marsh

October 15, 2017 8:21 am

There are two ideas here. Repeatedly measuring a board deals only with measurement error, presumed normally disrributed and very tractable via the law of large numbers. Measuring temperature at a place at different times is NOT ‘the same board’, so errors do NOT wash out. Kip’s point. But you are correct the measurements will be autocorrelated. Many time series in economics are autocorrelated. This introduces a number of complications into the statistics concerning them. For one example. See McKitrick’s paper on the statistical significance of the various pauses in temperature anomaly time series.

Kip Hansen

Author

Reply to ristvan

October 15, 2017 8:53 am

DC Cowboy and ristvan ==> Ristvan is correct — it is simply the point that one is not measuring one thing many times — which then allows the Law of Large Numbers and other statistical ideas to be applied to the multiple results. One measurement — one result — no large numbers.

Dave in Canmore

Reply to ristvan

October 15, 2017 2:12 pm

“Repeatedly measuring a board deals only with measurement error, presumed normally disrributed and very tractable via the law of large numbers. Measuring temperature at a place at different times is NOT ‘the same board’, so errors do NOT wash out.”

Not to be a jerk to the contrarians, but I’m truly amazed that this is so difficult to grasp! If you follow along the thought process in the real world with real instruments measuring real changing things, it’s pretty simple to see just WHY the errors don’t wash out. Divorce the numbers from context and I supposed it becomes harder to see why it doesn’t work.

Jim Gorman

Reply to Bill Marsh

October 15, 2017 11:59 am

It’s not the temps that are independent, it is the measurements. You are not measuring the same thing at both times.

Kip Hansen

Author

Reply to Jim Gorman

October 15, 2017 4:43 pm

Jim Gorman ==> I concede — the temperatures are not truly independent at that scale — it is the measurements that are.

tty

Reply to Bill Marsh

October 16, 2017 6:43 am

You are right. The measurements are not independent measurements of the same quantity. They are strongly autocorrelated.

rckkrgrd

October 15, 2017 5:48 am

We seem to be discussing the resolution of gauges. There is also the conversion errors, observation error, recording errors, bias and prejudices of recording personnel, to name a few. Together with siting problems, I would think that surface gauge averages indicating a .85 degree of warming to be totally meaningless. Surface observations such as frost-free periods or river ice breakup dates are probably more reliable indicators if records were available for long enough periods. Glacial melt-backs and sea ice extent do not seem to me to be a reliable indicator. After all, the ice in my drink will continue to melt without warming the room as long as the room temperature is above the freezing point. In fact, the melting ice would cool the room a tiny amount.
https://wattsupwiththat.com/2011/01/22/the-metrology-of-thermometers/

Leo Smith

October 15, 2017 5:52 am

I will quietly say it one more time.
If the measurements of a single physical quantity are randomly distributed about the ‘true’ value. the error in the mean of the results is the square root of the error of the actual samples.

This fact is actually used to make digital audio recording sound better. Digital audio sampling is accurate to the last bit, and values near that bit will be consistently too low, or too high. By adding about a bit of random noise, the samples AVERAGE OUT to the correct value with a greater precision than is possible using a single digital sample on its o0wn.

Simple thought experiment. You want to represent 7.5 using only integers and averages.

One sample of 8 and one sample of 7 gives you 7.5

In fact any real (decimal) number can be represented by the average of an (infinite, if needs be) sum of integers. That’s a similar case to the ‘fractions versus decimals’ argument.

Going back to the audio example, we have a CONSISTENT error. And that means that without the addition of randomness to randomize the error, trying to measure 7.49 for example, nets us 7, all the time, forever.

And that seems to be the key misunderstanding. Consistent error and normal probability errors. Consistent errors will stay the same no matter how many samples we take. If the thermometer is mis-calibrated and is reading a degree low, no amount of readings will improve the result.

On the other hand a sample of 1000 thermometers dipped in the same bucket, as long as they have a random error distribution, will.

The ‘average temperature of the earth’ has meaning because we give it meaning. It probably means ‘to a very close approximation the average over a year of many perfect thermometers readings taken every ten seconds, at 1km intervals over the surface of the earth’.

The less readings there are and the more imperfect the thermometers the less meaningful that average is.

Expressed as a rise in error bar size. BUT to deny that that average is more meaningful than a single measurement taken once, is um, Scientific and Mathematical Denialism frankly.

Statistics is hard. I hated it more than any other maths. It’s regularly abused, BUT its hugely useful if you know what you are doing. Unfortunately most people don’t, and I don’t exclude myself.

But I do know the very basics, and that’s what I have tried to illustrate here. The difference between consistent bias on all measurements, and random error probability. One can be averaged out, the other cannot be.

Kip seems to confuse the two

Greg

Reply to Leo Smith

October 15, 2017 6:38 am

Leo: “On the other hand a sample of 1000 thermometers dipped in the same bucket, as long as they have a random error distribution, will.”

But what we have is 1000 thermometers dipped into 1000 different buckets. So the calibration and construction errors will still average out but are no longer measuring the same thing.

We have several thousand gridcells with variable numbers of readings done by varying methods.

how does that affect the stats.?

The “meaning” we attribute to the mean is not arbitrary , it is being taken as cast-iron indication of the supposed effect of GHE ie it is not just a question of how good is the mean as a mean ( an “expectation value” ) it is implicitly a calorimeter : measure of the total heat energy.

Crispin in Waterloo but really in Beijing

Reply to Greg

October 16, 2017 9:53 am

Greg:

“So the calibration and construction errors will still average out but are no longer measuring the same thing. ”

We have no idea if it is true they will average out. Manufacturers are under no obligation to create instruments that report results that are randomly distributed between the error limits. It is far more likely they tune it from one side and stop tuning with if gets inside the limits them move to the next one.

Bill Marsh

Editor

Reply to Leo Smith

October 15, 2017 6:42 am

Your evidence that the error distribution for thermometer measurements is random? Not saying it isn’t, just that I’ve seen no proof that it is or is not. That would be an interesting study.

Kip Hansen

Author

Reply to Leo Smith

October 15, 2017 8:58 am

Leo ==> You cannot average out the fact that the original measurement record is a range. While you may get as precise an answer as you wish through averaging (long division), you do not eliminate the original range. Do the little experiment described in the Author’s Comment section.

This is not a matter of statistics — which deals with probabilities. This is a matter of measurement.

Peter Sable

Reply to Kip Hansen

October 15, 2017 10:05 pm

But what we have is 1000 thermometers dipped into 1000 different buckets. So the calibration and construction errors will still average out but are no longer measuring the same thing.

Yes, they are measuring something that is the same thing – the signal over that time frame of 1,000 buckets. That’s exactly how your sound system works. Take 1,000 samples of the sound with an 8-bit A/D. Then average them. You get *one* output sample that has 13 bits of resolution. (8 + log2(sqrt(oversampletimes)). Do that at 20Mhz and suddenly your terrible 8-bit A/D is not so bad for 20Khz signals.

You are trading off time resolution with measurement resolution. This is standard signal processing work. If you want formal proof, it’s done in the first year of EE graduate school in typically the digital signal processing class. Hope you like math…

Or you could look at some pictures. This datasheet from Atmel explains it fairly well: http://www.atmel.com/Images/doc8003.pdf

Peter.

Tom Halla

October 15, 2017 5:58 am

To throw something into the discussion, it would appear Nick Stokes and Mark Johnson are reifiing “average temperature” and “average sea level,. presuming that the concepts have a sort of Platonic essence outside the procedure used to derive them. Reification has occurred in psychology with terms like “intelligence quotient”, where practitioners fall down a metaphoric rabbit hole when they forget that the number is the result of tests with some repeatability with a given subject, and some correlation with other tests purportedly measuring the same thing.
Nick S and MSJ, one is measuring different things multiple times, not the same thing multiple times. So I agree with Kip Hansen, that the average has no more precision than any one of the separate measurements. Plato is treacherous as a guide to reality.

Michael Moon

Reply to Tom Halla

October 15, 2017 6:33 am

MSJ and Stokes-as-he-ever-was need the Central Limit Theorem to apply to temperatures and sea level data, so they just keep saying it does, no matter whether it makes any sense or not. If you have measured air temperature at one time and one place, you have measured one temperature, one time. You cannot measure it again, as it is not the same next time.

Michael Moon

Reply to Michael Moon

October 16, 2017 4:15 pm

Peter Sable,

time se·ries
tīm ˈsirēz/
nounStatistics
noun: time series; plural noun: time series; modifier noun: time-series
a series of values of a quantity obtained at successive times, often with equal intervals between them.

Thermometers and tide gauges do not measure “a quantity.” They measure different things each reading. Flail away at it all you like, no CLT…

LdB

Reply to Tom Halla

October 15, 2017 7:38 am

+10 you have realized one of the two real problem the Stokes group haven’t addressed. No-one asked them for their calibration because they are dealing with things that are proxies which is what you are trying to describe.

I don’t agree with your conclusion however, because I simply don’t know or see the answer to how much has been done. I am fence sitting and I am interested to see what the scientists did with this problem because it’s technical it’s not something you will find in the general media.

Even out in the real world It is solvable, you see Ligo has to do this sort of thing. I am genuinely interested to see how climate science has dealt with it.

Peter Sable

Reply to LdB

October 15, 2017 10:08 pm

The Central Limit Theorem as applied to time series signals is simply trading time-resolution for measurement resolution. If you have 365 days worth of temperature, you know the year’s average temperature (one number) to a pretty good degree of measurement accuracy – better than the individual measurements. You know nothing extra about what the temperature was on Feb 2nd.

If you don’t believe this, then you must sell any digital sound or video equipment you have and go find some vintage analog gear, because this principle is how all modern digital A/V gear works…

Peter Sable

Reply to LdB

October 16, 2017 7:04 am

to a pretty good degree of measurement accuracy

Ooops, editor escape. Should have said precision.

Accuracy is a different problem, though I’ll argue you get the same central limit theorem effect by independently calibrating 1000 thermometers. Unless one can prove a bias effect in such calibration that varies over time… (but that’d be a different topic)

LdB

Reply to LdB

October 19, 2017 7:12 am

What you now need to do is look at when the Central Limit Theorem breaks which is quite often in Signal Processing. A reasonable start is https://people.eecs.berkeley.edu/~wkahan/improber.pdf
It does analysis of a a number of papers which failed because they relied on the CLT and it isn’t universally true.

MikeP

October 15, 2017 6:10 am

Kip, You correctly point out that the probability is flat across each original data point interval and that when you average a number of points that possible outcomes span the same width interval. What you left out is that the probability is no longer flat, as the extreme cases only occur for one possible combination of values whereas the middle can be found from many combinations.

Kip Hansen

Author

Reply to MikeP

October 15, 2017 9:03 am

MikeP ==> Probabilities is the subject of statistics. Measurement is measurement — the extreme values are as equally possible and probably as any other value in the range — and only appear to be extreme because the integer that has been selected as the record is already the mean of the range, before any measurements have taken place.

Mark S Johnson

Reply to Kip Hansen

October 15, 2017 9:05 am

And for the $64,000 question Kip, is statistical sampling a measurement?

Nick Stokes

Reply to Kip Hansen

October 15, 2017 10:07 am

Kip,
Your tide example involved sigma’s, which you wrongly interpreted as ranges. They are probability moments. In science, you can’t get away from them. You normally don’t have ranges at all, and they are not what proper scientists mean when they talk about error. They mean standard deviation, or standard error.

MikeP

Reply to Kip Hansen

October 15, 2017 1:34 pm

sorry, but when you average, the only way to get an extreme result is if every measurement happens to be at the same extreme end. You can get the middle result a multitude of ways. The more data points averaged, the closer the average is to Gaussian, even though every single point being averaged represents an interval.

Kip Hansen

Author

Reply to Kip Hansen

October 15, 2017 3:31 pm

Nick Stokes ==>` Yours:

Your tide example involved sigma’s, which you wrongly interpreted as ranges. They are probability moments.

I admire your knowledge and usually your opinion, but in this case, you are in error.

The tide example does not deal with sigmas at all (except that NOAA CO-OPS lists a sigma for each of the 181 1-second measurement sets. )

The range of the tide gauge data stems from the specification and design of the tide gauge instrument itself, and its accuracy is specified by NOAA in its official documents. In order to ensure that they were not using sigmas or SDs or CIs, I specifically queried this point before writing here….I will repeat my question and their answer;

The six-minute figure is calculated as follows:

“181 one-second water level samples centered on each tenth of an hour are averaged, a three standard deviation outlier rejection test applied, the mean and standard deviation are recalculated and reported along with the number of outliers. (3 minute water level average)” (from NOAA’s spec sheet)

Just to be sure we would understand this procedure, I emailed CO-OPS support [ @ur momisugly co-ops.userservices @ur momisuglynoaa.gov ]:

To clarify what they mean by accuracy, I asked:

“When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.”

And received the answer:

“That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time. “

[ Note: In a separate email, it was clarified that “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.” ]

The +/- 2cm is NOT a sigma, nor is it a Standard Deviation, a Confidence Interval, or “error bars”. It is the accuracy of the original measurement, in the first instance, and attached to the six-minute mean as the accuracy of those means. They say accuracy, and they mean accuracy.

Nick Stokes

Reply to Kip Hansen

October 15, 2017 9:18 pm

Kip,
“They say accuracy, and they mean accuracy.”
Well, they say estimated accuracy. But there is nothing in what they say (as opposed to what you put to them) that is inconsistent with the normal understanding that if a number is expressed a±b, then b is a sigma – a normal deviation within which 2/3 of occurrences would be found. From your email chain, I agree that their explicit remark about sigma was about the 6 minute numbers, not the 2cm estimate. Sorry for getting that wrong.

The fact that they do reduce 2cm to 5mm for the monthly average is consistent with that a±b interpretation. You insist that they mean intervals and so get it wrong. I think the evidence for that is weak; much more likely is that they use the standard interpretation as a sigma, and it reduces as the number of readings averaged grows.

Kip Hansen

Author

Reply to Nick Stokes

October 16, 2017 7:11 am

Nick ==> In actual fact, they arrive at the monthly reading through simple “add and divide” finding of the mean using all the six-minute records for the month (I confirmed this by test). What they do as well is ignore the original measurement uncertainty altogether, and treat each six-minute mean as if it were a discreet, exactly accurate measurement and decide upon SD through standard statistical theory.

Of course, my view is that this is incorrect — some of the professors and engineers reading here agree with me — some others with you.

I am positive that your view represents the standard statistical approach — I just believe it is inapplicable to these types of situations, creating inappropriately “precise” means and changes in them that are not physically supportable.

garymount

October 15, 2017 6:17 am

I am considering coding up a computer model to exercise the various concepts of temperature measurements as discussed in this article. With an idealized model, probably consisting of various continuum math equations so a known exact answer is known a priori, then testing various simulated taking the temperature of scenarios, I hope to definitively confirm if Kip is right or if his detractors have a case.
I will have to sketch out on paper for a while how to proceed. How to simulate someone measuring a temperature within my model. What various forms of varying temperature models to use and how statistics might be used.

Greg

Reply to garymount

October 15, 2017 6:49 am

“I hope to definitively confirm if Kip is right or if his detractors have a case.”

Bad objective, likely to lead to confirmation bias or non reporting of a negative result. This is exactly what is wrong with climatology.

I would suggest you set out to improve your understanding of how errors accumulate. and combine. If you think you will find +/-0.5 degrees I can tell you now you won’t, but I would say do your test, fully document your model and preferably make it available for others to play with. There may be improvements to be made to how you simulate uncertainties and measure errors. I do think you will find current estimations of uncertainty are optimistic, so the effort could be worthwhile.

Greg

Reply to Greg

October 15, 2017 7:04 am

Sorry, I misread what I quoted. Hoping to have a positive indication one way or the other is fine. Like I said it is well worth looking into.

Steve from Rockwood

October 15, 2017 6:47 am

Nick Stokes is talking about purely random error. Systematic error does not reduce by the number of readings. There is also the problem of averaging readings from different areas, each with its own error.

Greg

Reply to Steve from Rockwood

October 15, 2017 7:09 am

Also, if you have two random errors which are orthogonal ( independent causes ) you still have two errors whose uncertainties need to be combined.

LdB

Reply to Greg

October 15, 2017 8:49 am

OR you could have a number of localized results distorting the average. What they have not talked about is any sort of main effect testing which has stunned me given they are statistics peeps. They have said absolutely nothing about the behaviour of the sample space.

Nick Stokes

Reply to Steve from Rockwood

October 15, 2017 10:01 am

“Nick Stokes is talking about purely random error.”
So is Kip. There is no systematic error in either of his examples.

The thing is, it isn’t easy to be systematic. In any large dataset, the errors are apt to be uncoordinated. It would be hard to coordinate. If there is systematic error, that is bias. And yes, it doesn’t reduce under averaging, which is why people make strenuous efforts to identify and remove it.

Clyde Spencer

Reply to Nick Stokes

October 15, 2017 2:33 pm

NS,
The stated range of +/- 2cm implies that each and every site has its unique systematic bias. The manufacturers are warranting that none in the array exceed that value. That is the only reasonable explanation for tide gauges that have a precision of 1 mm, but an accuracy of 20 mm.

Kip Hansen

Author

Reply to Nick Stokes

October 15, 2017 3:44 pm

Nick and Clyde ==> Actually, I am not really talking about BIAS or ERROR. The mechanical device, the tide gauge itself, the stilling well design and real world function, its actual performance, means that even though the acoustic sensor inside the stilling well can discern the height of water inside the stilling well to 1 mm precision and accuracy, the overall design can only return an instantaneous accuracy of the water level outside the stilling well to +/- 2 cm. NOAA tries to limit the inaccuracies by finding the MEAN of 181 1-second measurements, throwing away 3-sigma outliers, and recalculating the Mean and the Sigma of that 181 (minus outliers) data set. They properly assign an ACCURACY of the six-minute recorded values at the same +/- 2 cm.

My guess is that certain water conditions bias high, others bias low, currents one way bias high in one direction and low in another, or some other design consideration that limits the overall ACCURACY of the instrument to +/- 2 cm. I would expect that the manufacturers of this equipment must prove through actual field testing, that their tide gauge meets the specification of returning an actual-in-use accuracy of +/- 2 cm (after being allowed to discard all 3-sigma 1-second measurements).

Clyde Spencer

Reply to Nick Stokes

October 15, 2017 8:08 pm

Kip,
I would agree that one could have a bias that is dependent on the tide. Someone else has pointed out that because of time delay, the water in the stilling well probably lags what the water outside is doing.

Kip Hansen

Author

Reply to Nick Stokes

October 15, 2017 8:40 pm

Clyde ==> We could make one of these stilling wells with transparent tubing and put it in our bathtub or swimming pool, get some kids to splash about, and we would see just what NOAA illustrates in the image used in the essay. Mechanically, the water inside must lag the water outside. So, yes, that’s my guess as part of the reason NOAA specs the whole system at +/- 2 cm for instantaneous measurements, and not the resolution of the acoustic sensor — the whole sensor system is in fact 40 times less accurate than the sensor itself.

Crispin in Waterloo but really in Beijing

Reply to Steve from Rockwood

October 16, 2017 10:08 am

We term these ‘systematic errors’ and ‘experimental errors’. One expects that experimental errors are normally distributed unless there is good cause to say otherwise. Systematic errors are irreducible because they are built into every measurement.

If a thermocouple is mis-calibrated by 2 degrees, all the readings are out by two degrees and are not made ‘more correct’ by take many more readings. Similarly a thermocouple that is ‘within spec’ is not necessarily giving randomly varying readings around a perfect result, it is giving results within the specific limits.

Expecting sea level readings to be normally distributed around the true level is like expecting waves on the ocean to be sine waves. Given the 20mm range in the original, that 5mm long term range is unsupportable.

Greg

October 15, 2017 6:51 am

Global temps are continually “corrected” yet the uncertainty is always the same. This means that the method of assessing the uncertainty is not accounting for all the errors. The “adjustments” which were deemed necessary were not in the original error model !

Phil

October 15, 2017 6:57 am

This discussion is fun. In Physical Chemistry 101 we had a lab, naturally. One of the experiments was to measure the acceleration of gravity with paper tape apparatus. There were 200+ students split into teams of 3 or 4. The focus was on experimental error. Measuring the distance between dots on the tape with a steel rule graduated in 2mm ticks, a stop watch showing full seconds and tenths, We had to measure the accuracy of each measurement by multiple measurements of two marks and measuring the time. The overall results for the class were quite good, something like 9.65-9.9 m/sec^2 with a standard deviation of The range between teams was not very good, something like 8.9-10.3m/sec2. The reports had to include a total estimated error- the final range of possible error summed according to the equation. With all the individual acceleration measurments only about half fell within +/- 1 std deviation(fat tails), but every measurement but one fell withing the total error estimates.

The Prof specifically said when reviewing the results that the point was to not be to focused on the result but to have a realistic estimate of how far wrong an experiment could be. The other point made was that a few teams had results within.1m/sec^2, while others made as many tries and got a spread of .5m/sec^2- the difference between accuracy and precision.

More or less to the point Kip was making. Measurements are not the same as what you are measuring and how measurements are made and what they actually represent. Measuring m/sec^2 is pretty trivial. Trying to estimate something as insubstantial the the global average temperature trend is getting to the point of meaningless since we can’t even begin to know how all the variables involved affect the result, and the GAT, which is an extensive measurement has a very convoluted and undefinded relationship to what the climate actually does.

Donald L. Klipstein

Reply to Phil

October 15, 2017 9:44 pm

Suppose this experiment with the same many measuring teams was repeated in a place where gravity was .2 m/sec^2 stronger or weaker, or the local gravity changed by .2 m/sec^2 due to something that is a matter of fiction. What is the expectation of this hypothetical change of gravity being detected, and with how high % of confidence? How much would the gravity in your lab have to change for the sum of your student teams to have detected a change with 95% confidence that the change occurred within +/- 99.9%, +200/-99.9%, or +/- 50% of the gravity change indicated by your teams? With consideration that your low accuracy high precision teams are probably in good shape to detect a change in what they are measuring inaccurately with high precision? (This reminds me of a story by someone young trying out shooting a gun at a gun range, misinterpreting how to use the old tech sight, and shooting a tight small cluster whose size was smaller than its center’s distance from the center of the target.)

Gary Pearse

October 15, 2017 7:08 am

Kip, you have a gift to make complicated things simple. I dare say your classic stat
critic has stepped back. It’s not that he or she wouldn’t understand, of course, but rather in a negligent reading of your point, saw only a violation of a well established principle that actually wasn’t at issue in the case you were describing.

The main criticism I would have with the unhappy statistician is, having read your piece negligently, this person then resorts to an insult re your competence in science. I’m not sure I would have confidence in a statistician who has a habit of not reading carefully before arriving at such an outrageous conclusion, unless you have since received a sincere apology.

Kip Hansen

Author

Reply to Gary Pearse

October 15, 2017 9:05 am

Gary Pearse ==> Very kind, sir. Thank you.

Clyde Spencer

Reply to Kip Hansen

October 15, 2017 2:36 pm

Kip,
Gary is from the Old School, where civility was still considered a virtue. I’m not sure about the “unhappy statistician.”

AZeeman

October 15, 2017 7:29 am

The argument is about measurements which vary in time and value such as temperature versus measurements which vary in value only such as hole size.
One measurement has two dimensions, time and temperature. The other measurement has one dimension, size.
When measuring temperature it’s impossible to increase accuracy by taking multiple measurements since the temperature varies in time. The only way to increase accuracy is by taking multiple measurements using multiple thermometers at exactly the same time and averaging them. This eliminates the time dimension and makes it possible to use single dimension tools like averaging and standard deviation to estimate the error.
When a item is measured in a metrology lab, extreme measures are taken to keep things like the temperature of the measuring instrument and the item being measured at the same temperature. Air drafts are blocked and everything is handled as little as possible so that there are no variations of dimension over time. With the time dimension eliminated, it is now possible to make repeated single dimension measurements at different intervals using a single measuring instrument.
The only way that single thermometer measurements can be averaged is when the temperature is known to be stable over the measuring interval such as when a thermometer is calibrated using a known stable temperature such as the triple point of ice. Repeated measurements can be made and single dimension average and standard deviation can be calculated.
Which brings up another point. Assume that you are calibrating a digital thermometer which reads to 1 degree and it’s calibrated using a precision temperature source accurate to 1/100th degree. The readings are 100, 100, 100, 100, 100, 100, etc. What is the accuracy of the thermometer?
You have a measuring stick which has 10 foot units and use it measure the average population height. The readings are 0, 0, 0, 0, 0, etc. What is the average population height and standard deviation?

Clyde Spencer

Reply to AZeeman

October 15, 2017 2:47 pm

AZeeman,
You asked, “What is the average population height and standard deviation?” The obvious conclusion is that there is a minimum measurement interval that is required to improve measurements. That minimum is what will produce a different value each time. The implication is also that even with that minimum fiducial increment, there is a limit to how much the accuracy or precision can be improved. One cannot take an infinite number of samples or measurements and get infinite accuracy or precision!

Directed^Energy (@Ike_Kiefer)

October 15, 2017 7:31 am

Comment on Temperature:

Comment on Accuracy:

I would reframe the debate above in the following terms.

1. Consensus does not necessarily equal truth.

2. Measurements are analogous to opinions: they each have some degree of truth and some degree of ignorance/error.

3. Averaging ignorant opinions leads to consensus. Averaging erroneous measurements leads to consensus.

4. Averaging more ignorant opinions or erroneous measurements firms the consensus but does not force the consensus to converge toward truth.

5. The a priori assumption that ignorance or error is random and self-cancelling rather than correlated/biased is unscientific, and likely ignorant and erroneous in its own right.

Kip is essentially right. Rick C PE is more precisely correct. Mind your significant figures.

HankHenry

October 15, 2017 8:11 am

Can you measure “global temps” by measuring averaged air temperatures or even combined air and sea temperatures? (Notice the use of the word “combined” rather than “averaged.”) It seems all global temps are good for is for observing trends. The true temperature of the earths surface is much lower than what air temperatures indicate due to the extreme cold of the ocean abyss. Trenberth has acknowledged this when he claims that deficiencies in models are due to heat hiding in the oceans.

Kip Hansen

Author

October 15, 2017 8:19 am

Readers ==> I am not ignoring your comments — but I have other obligations Sunday mornings (Eastern Time). I will be with you in a couple hours and try to address your concerns.

I have a few minutes now, and will start at the top, working my way down. Thanks for your patience. — kh

Taylor Ponlman

Reply to Kip Hansen

October 15, 2017 8:52 am

Kip, here’s a thought experiment that might help. Say I’m trying to get the average height of a human male, and I measure 1000 randomly selected males, with an accuracy of 1 inch. Averaging all those measurements gives me a ‘suspected’ average height. Now I measure 1000 males a second time, but I can’t guarantee the same set of 1000 is in my second sample average. I keep repeating this 100 times. If I average those 100 measures, can I say the result is any more accurate? Intuitively, I don’t think so, since each time I’m not measuring the same thing. On the other hand, if I measured the SAME 1000 males 100 times, I could potentially feel better about the result, actual measurement errors tending to cancel out.

Crispin in Waterloo but really in Beijing

Reply to Taylor Ponlman

October 16, 2017 10:15 am

Taylor

I would say that your accuracy remains the same, and the experimental error is reduced. What some above refuse to accept is that these two things are additive. Vastly reducing the experimental error doesn’t nothing to reduce the uncertainty about the measurement which is an inherent property of the apparatus.

Kip Hansen

Author

Reply to Taylor Ponlman

October 16, 2017 10:57 am

Taylor ==> Thanks for joining in — my view would be that you can get better and better, more precise averages, but would need to note that your more precise average was only accurate to +/- 1 inch.

Steve Richards

October 15, 2017 8:49 am

Is the ‘global surface temperature’ construct valid or meaningful at all?

A crude example:

Air temperature in Canada increases this year by 1 degree, air temperature in Australia decreases by 1 degree, global average temperature remains unchanged.

Canada is predominantly forest and plains and cool, Australia is predominantly desert and hot.

My crude example exposes the myth that a single figure for a worldwide temperature can represent something useful.

Increasing the number of temperature sensors and adding increasingly devious ways to merge temperature together does not make the single figure more useful.

Trying to say anomalies gets round the problem, is a bit like saying ‘I know the answer is wrong, but I get the same result no matter what method I use, therefore it doesn’t matter how we do the sums…’.

I’m with Kip.

Clyde Spencer

Reply to Steve Richards

October 15, 2017 2:52 pm

Steve Richards,
I have previously advocated monitoring climate zones for changes. They can be aggregated subsequently with weighted averages, but we could see more readily if all areas are changing, and if they are, which are changing most. It would certainly give us a better understanding of what is happening than using a single number!

Peter Sable

Reply to Steve Richards

October 15, 2017 10:16 pm

The global temperature can be measured with great precision. The Central Limit Theorem ensures this. Kip’s above analysis is completely flawed in this regard, which is too bad because he does other good work.

Global temperature cannot be measured with great accuracy over spans of centuries. Far too many biases, known an unknown. I believe the satellite data, that’s about it, and given the recent changes, I keep increasing the error bars in my head… if y’all would stop adjusting things, your error bars would be far more believable.

As to whether it matters whether the global temperature changes by 1.0degC? Not at all noticeable. If it were to change by 10degC we would likely all notice. I note this scale of ‘notice-ability” is non-linear. Which is why the small changes we see are useless metrics to look at, and are being extrapolated to the tune of billions of dollars of mis-spent money.

Kip Hansen

Author

Reply to Peter Sable

October 16, 2017 7:37 am

Peter ==> “The global temperature can be measured with great precision. ” That part is true, then add back on the original measurement uncertainty to get a valid picture of the accuracy of that global mean.

All this signifies is that while they might be able to detect changing precise means in the hundredths of a degree, the accuracy being +/-0.5°F, we can’t be sure that the temperature has actually changed until the change at least exceeds the uncertainty in measurement.

This is why Mosher is right, we know it is warmer now than in the depths of the LIA, maybe warmer now than in 1960, but not sure is it is warmer now than in 1998.

The Reverend Badger

October 15, 2017 8:53 am

Reductio ad absurdum.

Let us assume that taking multiple low accuracy measurements and applying some mathematical treatment to the results CAN derive a figure more accurate than the individual low accuracy measurements. A figure closer to the actual reality. If this is so then the increase in the accuracy must bear some relationship to the number of measurements taken for a given mathematical treatment.

We assume the mathematical treatment is fixed. We simply increase the number of measurements. If the accuracy increases does it increase forever, in other words if we apply a massive , say trillion size, number of readings will we achieve , for example for temperature, a resolution of 0.0000001 degree? If this were to be true then we would not have to waste time and energy in designing more accurate instruments. We could simply continue to use the low accuracy ones and take more readings.

Want to know the length of a piece of steel to 1/10thou of inch but only have a school boy’s ruler? Simple, just take 10,000,000 readings and use the special app on your iPhone. Nonsense? Yes, I think we will agree on that example. So therefore the conclusion is, assuming we CAN derive greater accuracy via increased number of readings, that there MUST be a limit. If there is a limit then there is a certain type of mathematical relationship between the increase in the accuracy and the number of readings.

This relationship must be a graph, showing increase in accuracy v number of readings. The graph must be something fundamental, not dependent on the thing being measured or physical units, it MUST be something mathematical. So for a given mathematical/statistical treatment of any measurement it would be possible to derive an exact number for the number of measurements necessary to increase the accuracy by, say, double. Let this number be N.

So we take Y measurements with an accuracy of , say, =+/- 10% i.e. the reading taken may differ by up to 10% from the REALITY.
By then taking (N x Y) readings we can, via mathematical treatment of them, derive a figure which is better, only +/- 5% variation from reality.

However my colleague who had more time to waste actually did take (N x Y) Original readings with the 10% accuracy instrument . So all he has to do is take N^2 x Y readings to double his accuracy via mathematics.

However if we now plot these 2 examples on our graph we have a straight line, indicating that you can increase the accuracy ad finitum, which we have already agreed is ridiculous.

Reductio ad Absurdum./ QED

Clyde Spencer

Reply to The Reverend Badger

October 15, 2017 3:11 pm

Reverend,
I don’t have a mathematical proof or citation to provide. However, I suspect that the practical improvement of precision is one, or at most two, orders of magnitude because of the requirement to have measuring increments that will result in getting different values each time a measurement is made. That is to say, if one has a measurement of a fixed value that has one uncertain figure beyond the last significant figure, that uncertainty can be resolved by multiple measurements. That suggests that 100 measurements is the practical limit for improving precision. On the other hand, for a very large population that is sampled to estimate the mean of a variable with a large range (e.g. temperature), where the measurements define a probability distribution of the value of the mean, probably the estimate of the mean can be improved with more than 100 measurements. However, in a practical sense, it seems to me that the standard deviation is more informative of the behavior of the variable than an estimate of the mean with improved accuracy. I don’t believe that one is justified in assigning more precision to the estimate of the mean of the variable than the precision of the original measuring instrument. I do address this issue with the Empirical Rule in one of my previous posts.

Frederick Colbourne

October 15, 2017 9:15 am

“When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another.”

Independent? This is time-series data with the expectation of serial correlation over time. If temperature is measured in various places, there will be serial correlation in space, including all three dimensions.

Temperature measurements are serially correlated in four dimensions implying dependence in four dimensions.

co2isnotevil

October 15, 2017 9:34 am

Speaking about unrealized uncertainty,

Consider that the consensus sensitivity of 0.8C +/- 0.4C per W/m^2 is expressed with +/-50% uncertainty and this doesn’t even include the additional 50% uncertainty added by the RCP scenarios.. We know that intelligent life exists elsewhere in the Universe with far more certainty than this (I would put the probability at well over 99%). Which of these two is settled?

« Previous 1 2 3 4 Next »

wpDiscuz

Guest Essay by Kip Hansen

First Example:

So what happens when we now find the mean of these records, which, remember, are short-hand notations of temperature ranges?

And Tide Gauge Data?

Take Home Messages:

Related Posts

At The Heartland Climate Conference: “What Is The Proof?”, Earth’s Energy Imbalance Edition

Ross McKitrick on Climate Models, Economic Impacts, and the DOE Report

Weaponizing Uncertainty: Climate Scientists Admit They Don’t Know—Then Demand You Obey Anyway

Earth’s Energy Imbalance – Part II