The Limits Of Uncertainty

Guest Post by Willis Eschenbach

In the comments to my post called “Inside the Acceleration Factory“, we were discussing how good the satellite measurements of sea surface heights might be. A commenter said:

Ionospheric Delay is indeed an issue. For Jason, they estimate it using a dual frequency technique. As with most everything in the world of satellite Sea Level Rise, there is probably some error in their estimate of delay, but its hard to see why any errors don’t ether cancel or resolve over a very large number of measurements to a constant bias in their estimate of sea level — which shouldn’t affect the estimate of Sea Level Rise.

Keep in mind that the satellites are making more than 1000 measurements every second and are moving their “target point” about 8km (I think) laterally every second. A lot of stuff really will average out over time.

I thought I should write about this common misunderstanding.

The underlying math is simple. The uncertainty of the average (also called the “mean”) of a group of numbers is equal to the standard deviation of the numbers (a measure of how spread out the numbers are), divided by the square root of how many numbers there are. In Mathspeak, this is

\frac{\sigma}{\sqrt{N}}

where sigma (σ) is the standard deviation and N is how many numbers we’re analyzing.

Clearly, as the number of measurements increases, the uncertainty about the average decreases. This is all math that has been well-understood for hundreds of years. And it is on this basis that the commenter is claiming that by repeated measurements we can get very, very good results from the satellites.

With that prologue, let me show the limits of that rock-solid mathematical principle in the real world.

Suppose that I want to measure the length of a credit card.

So I get ten thousand people to use the ruler in the drawing to measure the length of the credit card in millimeters. Almost all of them give a length measurement somewhere between 85 mm and 86 mm.

That would give us a standard deviation of their answers on the order of 0.3 mm. And using the formula above for the uncertainty of the average gives us:

\frac{0.3}{\sqrt{10000}} = 0.003

Now … raise your hand if you think that we’ve just accurately measured the length of the credit card to the nearest three thousandths of one millimeter.

Of course not. And the answer would not be improved if we had a million measurements.

Contemplating all of that has given rise to another of my many rules of thumb, which is:

Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.

Following that rule of thumb, if you are measuring say temperatures to the nearest degree, no matter how many measurements you have, your average will be valid to the nearest tenth of a degree … but not to the nearest hundredth of a degree.

As with any rule of thumb, there may be exceptions … but in general, I think that it is true. For example, following my rule of thumb I would say that we could use repeated measurements to get an estimate of the length of the credit card to the nearest tenth of a millimeter … but I don’t think we can measure it to the nearest hundredth of a millimeter no matter how many times we wield the ruler.

Best wishes on a night of scattered showers,

w.

My general request: when you comment please quote the exact words you are referring to, so we can avoid misunderstandings.

1 1 vote
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

335 Comments
Inline Feedbacks
View all comments
Brian
December 20, 2018 9:09 pm

Willis,

As an experimental scientist, I can say that your arguments here are mostly wrong. The measurement precision of an instrument has effectively no impact on how precisely or accurately a quantity can be measured. To see why let’s break down the various elements of your post.

With regard to precision (random error), we can take your credit card example. Let’s suppose that people measuring the credit card all get a number between 85 and 86 mm and their results cover the full range. As part of our measurement protocol, we tell them to choose not any number between the two but one or the other. They should make that choice based on which is closer to the actual length they observe. Our data will consist, then, of a seemingly random string of 85s and 86s. We are taking the average of this string of data, some of which are 85 and the rest of which are 86.

If we have ten such measurements, the calculated averages will be numbers like 85.5 (5, 5 split), 85.6 (4, 6 split), 85.7 (3, 7 split), etc. For ten thousand measurements, however, the numbers will be 85.5000 (5000, 5000 split), 85.5001 (4999, 5001 split), 85.5002 (4998, 5002 split), etc. With an arbitrarily large number of measurements, any average between 85 and 86 is possible. But how good are these averages?

The uncertainty you mentioned (sigma/sqrt(N)), also called the “standard deviation of the mean” or the “standard error,” tells us the precision or random error of the mean thus obtained. Do an experiment with N measurements. Calculate the mean. If later experiments are done with N measurements under the same conditions, the standard deviation of the mean (SDM) tells us how much those other means will vary from the first. In general, 95% of means obtained by all N-measurement experiments will be within 2 SDMs of each other. Thus, with an arbitrarily large number of measurements, we can get means from many experiments to be as consistent as we want. There is, in other words, no fundamental limit on how precisely we can determine the result (mean) of an experiment. The measurement precision of the apparatus used is irrelevant.

Now your objection to this would likely be “But that doesn’t guarantee that the mean is the right answer. We might get 85.567 +- 0.003 mm when the actual length is 85.329 mm. The mean is consistent but consistently wrong.”

This objection would be correct, and is an example of systematic error (inaccuracy). Every experiment is subject to many systematic errors of varying sizes, and these inaccuracies cause the results to be wrong. Systematic errors are governed by a fundamental principle similar to your incorrect rule of thumb (“you can’t squeeze more than one additional decimal out of an average of real-world observations”):

No systematic error smaller than the uncertainty (random error) can be identified.

This is the point of making more and more measurements. By reducing the uncertainty of the result, we are able to potentially uncover smaller and smaller systematic errors. A small uncertainty doesn’t guarantee that the experiment is that accurate, but it does allow for the experiment to be MADE that accurate by eliminating the systematic errors. Just as a falling tide uncovers more and more objects on the beach, a falling uncertainty uncovers more and more systematic error. And like objects on a beach, this allows the systematic errors to be identified and removed.

But how are systematic errors actually identified? By doing different KINDS of experiments. If the same quantity is measured by many different techniques, those techniques are likely to have different systematic errors. For temperature, this would mean using thermometers, satellites, balloons, etc. For sea level this might be tide gauges, laser altimetry, etc. If we make all those techniques precise enough (through N measurements), they will eventually disagree. We use that disagreement to identify the systematic errors causing the discrepancies. Once the systematic errors are identified and removed, the various methods will agree to high precision. AT THAT POINT, WE KNOW THE ACTUAL RESULT TO THAT LEVEL OF UNCERTAINTY. Again, our ability to do this does not depend in any fundamental way on the measuring precision of the apparatus.

In practice, of course, this process is slow and laborious, requiring a great deal of time and resources. In practice, our ability to reduce the uncertainty may depend on available technology and manpower. But the precision of the measuring device does not place any fundamental limit on the process.

Reply to  Willis Eschenbach
December 21, 2018 6:10 am

Willis, what say you to this specific contention:

“The measurement precision of an instrument has effectively no impact on how precisely or accurately a quantity can be measured.”

This seems to me to me saying having a ruler with only inches on it is no better or worse than having one with no graduations or one marked in millimeters.
I do not think any machinist or engineer would agree with this statement.

Brian
Reply to  Willis Eschenbach
December 21, 2018 9:19 pm

Willis,

No, measuring something by multiple methods doesn’t imply that the other methods have better resolution or precision (a micrometer vs. a ruler). There’s an advantage to be gained even if the other methods have identical resolution and precision to the first. The key, again, is choosing methods that likely have different systematic errors. That’s the only way that systematic errors can be identified and eliminated.

This key point is most definitely NOT what you said or implied. Yes, you focused on using only a ruler and said that taking more measurements doesn’t help you out. But even that claim misses the point of how measurements are done in science. Any single measuring device that purports to be accurate is calibrated against other reference devices, or against reference quantities. Every scientific measuring device must be traceable to fundamental measurement standards. A calibrated device and the procedure for using it already has its systematic errors defined and quantified through the use of multiple types of experiments. So even if you are using only the ruler, it is already backed by the multiple devices and experiments I mentioned.

The bottom line is that a ruler calibrated through use of a specific measurement procedure can have systematic errors much less than the resolution of the device. And such a procedure can involve a large number of measurements. By making this large number of measurements according to the designated calibration procedure, the measurement precision is indeed reduced to a small value that is both precise and accurate. And this is all with the use of a single device.

Brian
Reply to  Willis Eschenbach
December 22, 2018 12:31 pm

Willis,

You say “Yes, you can use all kinds of other techniques to improve your accuracy. But to do that, you have to use other techniques.”

The point you are ignoring is that ANY measuring device needs to be calibrated and it gets calibrated by using other techniques and devices. That doesn’t mean you have to use the other techniques as part of YOUR measurement. Look, you provided a ruler as your example. Presumably you weren’t assuming that the ruler is a piece of garbage. Presumably you think it works correctly according to its design. In that case, it’s been calibrated using other techniques. You can’t use the “other techniques” excuse to wiggle out of my point.

And you are just plain wrong about this. It is always possible to obtain much smaller systematic errors with a device AND A PRESCRIBED MEASUREMENT PROCEDURE than the device resolution. And once the systematic errors are small, the only thing preventing a highly accurate measurement is the random error, which is made smaller by repeated measurements. Under those conditions (where the systematic error has already been made small), high accuracy is obtained by repeated measurements.

Reply to  Willis Eschenbach
December 22, 2018 2:36 pm

Brian –> You are conflating a real world measurement with a lab experiment. You seem to be missing the whole point. Willis’ example doesn’t exactly apply to sea level measurements by satellite nor to temperature measurements.

In both cases you get only one measurement of one thing at one time. You don’t have a static substance that you can continue to measure or to diddle with in order to get a more accurate or precise measurement.

All the ADC and other discussions are fine, but they simply don’t apply.

Brian
Reply to  Willis Eschenbach
December 23, 2018 8:45 pm

Willis,

Getting systematic and random errors much less than the base resolution is done all the time and can be taken to extreme levels.

One good example is LIGO, the gravitational wave observatory. LIGO works on interferometry of 1-micron light, which means the base resolution (equivalent to the marking spacing on a ruler) is only 1 micron. Yet they measure deviations of space on the order of 10^-18 m, or 1 trillionth the base resolution. This is accomplished by reducing systematic errors (such as vibrations from earthquakes and traffic) to a very low level and then using statistics to reduce the random error. In this case, the repeated measurements required to gauge random error are provided by the high number of photons in the cavity. Much as I described in my first comment, this allows shifts of less than 1 millionth of a wavelength to be measured. So your statement that errors can be reduced to only about 1/10 the resolution of the device is wrong.

Micky H Corbett
Reply to  Brian
December 21, 2018 3:30 am

I too started in professional life doing a PhD in experimental physics. I then moved onto safety critical software then ion propulsion for satellites. I became very familiar with metrology and had to improve my scientific method.

So, talking multiple measurements will not change anything if you cannot determine that your sample distributions are identical. After all a necessary condition for the Central Limit Theorem that underpins the error of the mean is that samples are i.i.d. – Independent and identically distributed.

If the uncertainty of individual elements of your sample are sufficiently large the uncertainty in distribution goes up. In other words noise is greater than signal. For temperature measurements this is the case when looking at changes 0.1 degrees. For satellite measurements it may also be the case if a certain range of heights are required.

More measurements do not produce less uncertainty. It is due to the CLT.

Clyde Spencer
December 20, 2018 9:50 pm

Brian
For Willis’ example, very few if any people will record 85. Almost all will record 86. Thus, you will get a long string of 86s. If nobody records an 85, you will have actually reduced the precision by reporting 86, when it is clearly less than 86!

Brian
Reply to  Clyde Spencer
December 21, 2018 9:29 pm

Clyde,

That’s why I said this:

“Let’s suppose that people measuring the credit card all get a number between 85 and 86 mm and their results cover the full range,”

in keeping with Willis’ own description. That’s how he got SD = 0.3. My proposed measurement procedure for getting highly precise results actually depends on individual measurements being as imprecise as the device resolution to avoid systematic error (or bias) that can creep into people’s judgement.

December 20, 2018 10:41 pm

The above is n the eyes glazing over catogary.

Two things come to mind , Julie Ceasar was asked about the loyalty of his guards. He replied yes that is of interest, but who is going to guard the guards.

Second, Joseph Stalin is said to have said, It does not matter how many vote, it does matter who counts the votes”

All of the above is wide open to cherry picking and adjustments to the data.

MJE

Reply to  Michael
December 21, 2018 6:14 am

Which is why Alexander Hamilton introduced the Electoral College, never allowed to be all in the same room at the same time. Counting-statistics may be popular, yet when this brilliant idea actually works, the winner is called “populist”.

Steve Reddish
December 20, 2018 10:50 pm

“Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.

Following that rule of thumb, if you are measuring say temperatures to the nearest degree, no matter how many measurements you have, your average will be valid to the nearest tenth of a degree ”

I think you won’t even get an additional tenth. I think you will at best, get an additional 1/2 of of a degree. Suppose you ask a bunch of people to take a reading in whole degrees from a thermometer marked off in whole degrees. If the group overwhelmingly says the thermometer indicates 35 degrees, you must record 35 degrees.

If approximately 50% say it is 35, while the others say it is 36, you can be very confident rounding the temperature to 35.5 degrees. This is because you asked people simply to decide which whole degree the thermometer indicator was closest to. If nearly equal numbers decided each way, the indicator must be very close to midway between.

If 25% say 35 degrees, while 75% say 36 degrees, you may record the temperature as 35.5 degrees, but not as something like 35.7 degrees. This is because you asked for a reading to the nearest degree. All you know for sure is that most of the group determined the indicator was closer to 36 than 35. Because some said it was closer to 35, the indicator must be still very close to midway between 35 and 36. If the indicator actually was very close to what would be 35.7, the group would have overwhelmingly noted it was closer to 36 than 35.

By asking for a reading to a whole degree, you were asking people to judge only whether the reading was more or less than .5 of a degree. Thus, you can sometimes squeeze out an extra 1/2 degree, but no finer.

Automated equipment works in the same manner because that is what it has been set up to do.

SR

Steve Reddish
Reply to  Steve Reddish
December 20, 2018 10:57 pm

I forgot to put in my conclusion that multiple readings only increase the precision to 1/2 of the delineations the measuring device is marked to or capable of determining.

SR

Engineer Bob
December 20, 2018 11:59 pm

I have used a system which gets higher accuracy.

You need to add in a truly random noise source, whose standard deviation is some convenient small multiple of the measurement unit, and whose distribution is well known. For example, the thermal noise in a resistor.

Now the quantity of interest is represented by a population of different measurements. And the statistics of that population give a way to calculate the underlying value more precisely.

tty
Reply to  Engineer Bob
December 21, 2018 3:30 am

You are fooling yourself Bob if you think adding noise to series of measurement improves accuracy. What it does is to allow you to use a formula that isn’t applicable. In your case if you do an infinite number of measurements it will only allow you to get back to the original accuracy before you added the noise.

tty
Reply to  Willis Eschenbach
December 21, 2018 12:10 pm

He is still fooling himself. An infinite number of measurements would then allow him to get rid of the noise he added but would not improve the accuracy of the original measurement.

Reply to  tty
December 26, 2018 4:10 pm

You are fooling yourself Bob if you think adding noise to series of measurement improves accuracy

Funny, in electronics we call this “dithering” and it’s used all the time to turn correlated noise into uncorrelated noise due to quantization errors, which is exactly what’s happening in the case of the ruler.

The general topic is “quantization noise”, go ahead and Google it.

That quantization noise is white noise and you can get 1/2 bit resolution per halving of frequency (which is exactly analagous to the sigma/sqrt(n).

Without dithering, the quantization noise is correlated to the input signal and you are lucky if you get 3 bits or one decimal of improvement.

Of course in climate signals, unlike the ruler, there’s tons of dithering but who knows whether the dither signal is white noise.. often not. But I think making blanket statements about “at most one decimal place” isn’t correct either. Why not just estimate using the Hurst exponent that can be roughly derived from the data itself?

Reply to  Willis Eschenbach
January 1, 2019 4:02 pm

Peter, I use the Hurst exponent to correct for autocorrelation. However, I’ve never heard of using it to get a more accurate average of a number of measurements. Could you say a bit more about that?

If the Hurst exponent is between 0 and 0.5 you get a more accurate average because the signal has negative autocorrelation. This was discussed in the comments of your Hurst article.

That’s the principle behind delta-sigma converters. Negative feedback allows about one decimal place per oversample ratio in a 3-stage design. (though designers typically talk about bits, or powers of 2, instead of powers of 10).

Unless your measurement system can interact with the climate (hah!), you are not going to get negative autocorrelation from the measurement system! But I wouldn’t be surprised if negative autocorrelation appears somewhere in nature.

BTW on further reading estimating the Hurst exponent consistently requires about 20k samples, non trivial to say the least! So my proposition of “why not just estimate from the Hurst exponent” is not really practical in many cases. One might interpret that limit in a pessimistic fashion given the idea of the Null Hypothesis.

Hocus Locus
Reply to  Engineer Bob
December 21, 2018 4:07 am

You need to add in a truly random noise source,

Are you suggesting something like heating the credit credit above the temperature of the ruler until the edge of the card seems centered on a ‘pip’? And a measurement system that yields information ‘at’ the pips but not in between, like a balance scale for weights (or electrical +/- comparator with precision source)? And once the balance point is reached, the degree of temperature rise and known property of the material used to ‘back estimate’ where between the pips the card edge was at ambient/ruler temperature? Proxies like turtles all the way down?

Molecules are so skittery and slippery and everything is throbbing along an edge as long as the coastline of Britain. That’s why God created the photon, to count integral sheep and get some rest.

Joe H
December 21, 2018 1:28 am

Willis, I was researching this online recently. There is conflicting information but I believe it to be correct that measurement resolution is a systematic error and not a random one. Hence it doesn’t reduce by sqrt(N) but is irreducible. What is the uncertainty here? Some say one tenth of the smallest division (i.e. 0.1mm) but I’ve seen others say that is too low and around half this resolution (i.e. 0.5mm). Either way in your example, I do not believe the credit card length, using the ruler specified, can be known with an uncertainty less than 0.1mm no matter how many people measure it.

Reply to  Joe H
December 21, 2018 5:17 am

“Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.[1] In statistics, an error is not a “mistake”. Variability is an inherent part of the results of measurements and of the measurement process.
Measurement errors can be divided into two components: random error and systematic error.”
https://en.wikipedia.org/wiki/Observational_error#Systematic_versus_random_error

Joe H
Reply to  Menicholas
December 21, 2018 12:22 pm

I’ve no idea what your reply means M. I know the difference between error and uncertainty but that has absolutely nothing to do with my post. My post is about random error (reducible) v systematic error (irreducible) and whether or not measurement minimum resolution is a random or systematic error.

Michael Carter
December 21, 2018 1:29 am

Surely the accuracy of the satellite measurements are testable?

Take a the number of passes to required establish mean sea level according to the theory, then do the same a number of times within a short time-frame (months) . If the system is accurate the precision will be tight.

Not rocket science, surely?

I have the feeling that we are not being told everything. I would love to know what variation they are really getting.

Cheers

M

Reply to  Michael Carter
December 21, 2018 5:03 am

“If the system is accurate the precision will be tight”

This is not the case.
From the Wikipedia on accuracy and precision:

“A measurement system can be accurate but not precise, precise but not accurate, neither, or both. For example, if an experiment contains a systematic error, then increasing the sample size generally increases precision but does not improve accuracy. The result would be a consistent yet inaccurate string of results from the flawed experiment. Eliminating the systematic error improves accuracy but does not change precision.”

Additionally and separately:
Measuring the height of the ocean is not something like measuring a credit card, which holds still and does not change (much, although atoms are likely eroding from the edges and some slight thermal effects are occurring).
The ocean and the height of it never holds still and is never the same even in the same place over a short period of time.
And water blown or flowing away from one place winds up somewhere else, except when it does not (evaporation, precip, rivers, volumetric changes due to temp, etc)
Nor is it like doing lab experiments in which there is an “accepted value” to compare results against.
But we do have tide gages.

In case you missed it, there is a link to an essay from Kip Hansen which gets at these issues very succinctly.
https://wattsupwiththat.com/2017/10/14/durable-original-measurement-uncertainty/

Steve O
Reply to  Menicholas
December 21, 2018 5:18 am

Kip’s article is wrong, and the comments give it a thorough debunking. I’ll go so far as to say that it’s presence on WUWT degrades the credibility of WUWT.

Reply to  Steve O
December 21, 2018 5:47 am

“Wrong” is a broad term.
Do you mean to say everything he wrote is the opposite of true, or he made one or a few errors in his analysis?
One could read this thread and come away saying Willis is wrong, or he is correct, depending on who you decide to pay attention to.
His article and the discussion are important, for different reasons.
In every ne of these discussions, we have some people comparing apples to oranges, and others just mixing up terminology, and some others refuting what is accepted and taught in every science classroom in every university in the country, or at least used to be…who the heck knows what they are teaching now.
Saying something is wrong without being specific is wrong, as it unhelpful for a discussion.

IMO, this is wrong:

“Surely the accuracy of the satellite measurements are testable?
Take a the number of passes to required establish mean sea level according to the theory, then do the same a number of times within a short time-frame (months) . If the system is accurate the precision will be tight.
Not rocket science, surely? ”

If you were in charge, we would know the sea level and variations in the world ocean and the issue would be settled…is that your position?
Everyone else is an idiot?

Steve O
Reply to  Menicholas
December 21, 2018 8:47 am

Kip’s column had a main assertion which was 100% dead wrong, and proved wrong by math, as well as proved wrong by real world examples.

I don’t know much about satellite measurement technology and so can’t comment on it. But I know enough about statistics to point out mathematical errors.

Clyde Spencer
Reply to  Menicholas
December 21, 2018 1:38 pm

Steve O

You said, “I know enough about statistics to point out mathematical errors.” Then please do, instead of just asserting that Kip is wrong.

Reply to  Menicholas
December 22, 2018 2:58 pm

I agree with Clyde.
We mostly all seem to be here for a productive and informative conversation.
Simply saying that you know enough point out errors, or that math proved his main assertion is wrong, amounts to an unsupported opinion.
Opinions are fine, and you opinion may be correct.
But it is unhelpful to make such statements without being specific, unless you merely wish to express your opinion and be done with the discussion.
Kip’s post was a fairly long one, and the comments was one of the most extensive I have ever seen…over 500 separate comments!
I stated up top just what I am thinking now: There are a lot of disagreements on these threads regarding the topics of uncertainty, precision, accuracy, and when statistical analysis is and is not valid. And several reasons for the disagreements, including people talking about different things (apples vs oranges disagreements), people using inconsistent terminology or the wrong verbiage for what they are trying to say, whether measurements are of the same thing or different things if the thing being measured is temperatures or the level of the sea, and some others.
Lots of people here know plenty, and yet disagreements abound.
If we all make an effort to at least communicate effectively, we can at the very least have a chance of learning something new, or of giving someone else another way of looking at something, or even just offering an alternative viewpoint or an interesting perspective.
Of course, sometimes one might just wish to offer one’s own opinion and not spend time doing more than that. I do this sometimes.
In such cases I try to recognize and say when something is merely my own opinion or view, but all too often we find others offering their opinion as if they were facts or settled or incontrovertible.
I am not even sure if we disagree on anything, but I cannot tell unless you want to be specific.
Saying read the comments when there are over 500 of them, and saying they give a thorough debunking when they are anything but all in agreement, and not even taking the time to say WHICH of his assertions have supposedly been debunked…is unhelpful.
If there is one thing years at this site and others like it has taught us, it is that no one has a monopoly on correct opinions or valid arguments.
Warmistas who claim to be experts at math and science tell us that Earth is burning up and that we are doomed unless we believe that the matter is settled and we do what they say.
Some of them have PhDs in math, and yet many of us are certain they are full of crap and wrong

Reply to  Steve O
December 21, 2018 5:56 am

Steve O,
Without paying attention I thought your comment came from Michael Carter.
In any case, I never said that Kip was correct and his article settles the issues being discussed.
I said he lays out the issues succinctly.
I am not a statistician…and I am not going to be the one to settle these back and forth debates about what averaging can and cannot do.
For one thing various people keep using sloppy language.
Another is that the properties and stats describing a series of numbers are not the same as measuring something, that is changing with time, with a device.
Anyway, I have no interest in arguing about statistics, but only getting at how some people misuse them to further an agenda.

R Percifield
Reply to  Menicholas
December 21, 2018 9:01 am

Steve O,

Please tell us what is the incorrect assertion in Kip’s post. To say something is wrong and not defining the actual statement does not help with the correction, or discussion. Blanket statements are next to worthless in technical discussions.

Steve O
Reply to  Menicholas
December 27, 2018 4:32 am

“Measuring the air on different days in different places is not measuring the same thing.
You keep ignoring that!”

I’m not ignoring it. I’m actively disputing that it’s as important as you think it is.

Let’s say you measure the temperature, and then a second later measure it again. You take 3600 readings over the course of an hour. How important is it that you’re taking the reading at a different time, when the temperature is slightly different? It’s not important at all, especially when what you are interested in is the average temperature over the one hour period. It’s exactly what you need to do. You cannot throw out mathematical statistics just because you’re “measuring different things.” Math still applies!

It’s the same when determining the average temperature for a year-long period. You take 365 measurements of daily mid-range, covering every single day of the year. You can indeed determine the average of the mid-ranges for the year to within a decimal point of its true value, even with a wide error band. It is also true that you cannot know to within a decimal point the temperature/mid-range on any particular day. But nobody cares about any particular day.

I’m not some warmist troll. Climate science is absolutely sick with bad methodology, but it’s important that criticisms be scientifically sound in order to maintain credibility. There’s enough that is unsound that we don’t have to attack the parts that are rock solid.

Steve O
Reply to  Clyde Spencer
December 22, 2018 9:21 pm

“You said, “I know enough about statistics to point out mathematical errors.” Then please do, instead of just asserting that Kip is wrong.”

— There’s no reply button for your previous comment, so let me reply here. Then I’ll go back and read your columns.

Kip claimed that since temperature measurements were rounded to the nearest whole number, that estimating historical temperature to the tenth of a degree was not possible, even if the statistical math that everyone uses says otherwise.

I would agree that if you have only one temperature measurement, that he would be correct. But when you have hundreds of measurements, the errors can be largely eliminated with multiple readings. And in one year, you have 365 readings. You can round the temperature of each day’s reading to the nearest degree, and you can still determine the average temperature to a tenth of a degree. Obviously, systematic errors will not be corrected, but that’s another issue.

I created for him a math experiment that simulated measurements, with rounding, and it returned a result which would have been impossible if he had been right. The experiment said he was wrong. The math said he was wrong. Other people offered their own examples to show why he was wrong.

I describe my experiment in other comments, but it’s short enough that I can repeat it here: Enter a number to represent a true value being measured, and copy it down for 1000 rows in Excel. Enter a column of random numbers, and another column with those columns added together. In the next column, round the numbers to the nearest whole number. The average of all the rounded numbers will be very close to the true value.

If you have only one measurement, your error range is very wide. If you have 1000 measurements, you can get very, very close to the true value.

Reply to  Clyde Spencer
December 23, 2018 12:45 am

Steve O,
You said:
“Obviously, systematic errors will not be corrected, but that’s another issue.”
It is not another issue, it is central to the questions at hand.
No one here disputes statistical mathematics.
We are concerned with what the actual temperature is, whether it is known to a high degree of certainty for the entire planet back in the 19th century, and if is it changing and if so by how much.
We know it changes.
We do not know the actual values, we only know what some measurements said it was.
Measuring the air on different days in different places is not measuring the same thing.
You keep ignoring that!
There may or may not be a such thing as a global average temperature, but if there is, it is not a measure that means what people think it means, and there are not enough data points to know what it was.
You have to ignore a lot to be a warmista, but to be a scientist you have to take everything known into account.
You are ignoring a lot here Steve.
Why?

Clyde Spencer
Reply to  Clyde Spencer
December 25, 2018 10:58 am

Steve O
You said, “And in one year, you have 365 readings.” You have 365 DIFFERENT readings of 365 DIFFERENT items. That is analogous to taking 365 pieces of a puzzle, weighing them, and reporting the average weight of all the pieces. Each measurement has associated with it the error and uncertainty of the balance used to measure the pieces. The Standard Deviation can be calculated, and as a sanity check, estimated from the Empirical Rule. That tells you something about the probability of another piece of the puzzle having a weight within the range of the original 365 samples. However, the mean weight is of questionable practical value. The uncertainty of the individual weights is nominally the same for all pieces. The Standard Error of the Mean may give you a sense of having improved the precision. But, what is really important here is the range of values, because the 2 SD error bar tells you that 98% of the samples fall within that range.
Compare that procedure with taking 365 measurements of the diameter of a precision ball bearing. It is the same, unvarying diameter (except for possible negligible ellipticity), measured within a short enough time period that other issues like instrumental calibration drift can be ignored. That is to say, almost all the variance can be attributed to random error, not systematic changes in the measured object or the measuring instrument, and it is reasonable to expect that the random errors will cancel. Thus, the use of dividing by the square root of the number of measurements because a large number of measurements will cancel more effectively than just a couple.

Reply to  Clyde Spencer
December 25, 2018 12:10 pm

Steve O –> “If you have only one measurement, your error range is very wide. If you have 1000 measurements, you can get very, very close to the true value.”

No you can not. That is where you don’t understand what you are calculating. You can take one million readings and average them out to 10 decimal places. You are calculating the mean to a very accurate number with a concurrent small uncertainty of the mean. THAT IS NOT THE TRUE VALUE except for very specific circumstances.

As Clyde said “But, what is really important here is the range of values, because the 2 SD error bar tells you that 98% of the samples fall within that range.”

The standard deviation is very important. I know you and others think you can average days to months, months to years, and years to decades and tell whether the average is increasing or decreasing. What we are trying to tell you is that you can’t make that assumption for a number of reasons.

The range of values determines how far apart the actual value can be. As you move to 2 and 3 standard deviations you increase the range any measurement can be. As you move to more and more SD’s you approach the limit of the original error. Consequently, you must quote the mean with an error that ultimately is the measurement error, i.e. +- 0.5 degrees.

Look at my post below describing target shooting. Do you really think the mean will tell where the rifle described will shoot? Ultimately, you can only describe a circle that tells you where the errors are.

Steve O
Reply to  Clyde Spencer
December 27, 2018 4:54 am

Clyde, sorry it took me so long to get back on this. You might not be there anymore. Anyway, regarding your first article, I see a couple of problems.

Nobody cares what the temperature is on any particular day. When people reference “the temperature in 1850” nobody ever picks out one particular day. They are referring to the average temperature for the year. (Actually, the mid-range or two measurements, which is only a proxy, but it’s easier to just say “temperature.”)

You say that to determine the temperature for a day, you should have continuous measurements, and integrate under the curve, which is true. How much different is that from taking 365 daily measurements to estimate the average temperature for a year? Instead of taking 3,600 measurements to determine the average temperature of a day, you are taking 365 mid-ranges to determine the average temperature of a year.

If you do that in 1850 and again in 1950, and use this method to estimate a change in average temperature, how much different will your answer be if you plotted a curve and applied calculus to the values in 1850 and again in 1950?

You also have a math error. Mathematically, it does not matter if you have rounded off each day’s measurements to the nearest degree.

tty
December 21, 2018 2:52 am

It might be helpful to repeat the mathematical requirements for using multiple measurements to reduce uncertainty. Very briefly they only apply to random errors of independent, equally distributed measurements of the same value.

1. Errors must be random, more measurements have no effect whatsoever on systematic errors

2. Measurements must be independent, i. e. measurements must not have any influence on each other. This for example makes repeated measurements by the same person dubious since his memory of earlier measurements may affect his readings.

3. Measurements must be equally distributed i. e. they must have the same distribution function. This means that measurements with different methods can only be pooled if it is known that they have the same distribution (normal or whatever)

4. The same value must be measured, for example in the credit-card measurement above the temperature of the card mustn’t change during the measurement. It need hardly be pointed out that a moving satellite never measures the same value twice.

Steve O
December 21, 2018 4:54 am

“Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.”

I’m not sure this is correct.

Let’s say that we know from another instrument that the true length of the card is 85.7525 mm +/-0.0025 mm. Out of 1,000 people, we have 750 who give us 86mm and 250 who give us 85mm. Our estimate for the card length would be 85.75.

If we repeated the experiment with more cards and more 1,000’s of people we could even determine if there was a systematic error we needed to adjust for, and we could establish error bars around our point estimate. But as long as the measurement errors are randomly distributed, you can add accuracy by adding measurements.

If the next card had an actual length of 86.05mm, the reported observations might be centered around 86mm, with 900 people reporting 86mm, 49 reporting 85mm, and 51 reporting 87mm.

When Kip posted a column about temperature readings being only to the nearest degree, he also claimed a limit in the accuracy of estimating the true value. I created a mathematical experiment in Excel, laying in random “measurement errors” around a true value. The measurements themselves were then rounded to the nearest whole number. An average of the observations was still correct within several decimal points. Rounding to the nearest whole value had almost no impact on the accuracy of the estimation of the mean.

Reply to  Steve O
December 21, 2018 5:32 am

It seems that a percentage of people have the view that a device’s measurement resolution can be increased by just doing many iterations of the same process of measurement.
This ignores that the device has a resolution limit.
Using this logic, we could use yardsticks to measure the length of ants by just repeating the measurement a whole bunch of times.
Or maybe I am missing something.

A separate issue involves conflating such things as a measurement of temperature of the air and a series of random numbers.
I do not think this is valid.
This article may be relevant:

“Random errors are errors in measurement that lead to measurable values being inconsistent when repeated measurements of a constant attribute or quantity are taken. Systematic errors are errors that are not determined by chance but are introduced by an inaccuracy (involving either the observation or measurement process) inherent to the system.[3] Systematic error may also refer to an error with a non-zero mean, the effect of which is not reduced when observations are averaged.[”

And:

“Random error (or random variation) is due to factors which cannot or will not be controlled. Some possible reason to forgo controlling for these random errors is because it may be too expensive to control them each time the experiment is conducted or the measurements are made. Other reasons may be that whatever we are trying to measure is changing in time (see dynamic models), or is fundamentally probabilistic (as is the case in quantum mechanics — see Measurement in quantum mechanics).”
https://en.wikipedia.org/wiki/Observational_error#Systematic_versus_random_error

Steve O
Reply to  Menicholas
December 21, 2018 9:11 am

“It seems that a percentage of people have the view that a device’s measurement resolution can be increased by just doing many iterations of the same process of measurement.”

Andrew Preece has a comment above that explains how this is done — not with a thought experiment, or on a spread sheet, but how electronics engineers do this in real world applications.

tty
Reply to  Steve O
December 21, 2018 10:00 am

Andrews comment deals with a situation where the measured value varies randomly around a fixed value with an amplitude greater than the precision of the measuring device. In this case it is possible to narrow down the uncertainty, but only if you know that the average really is fixed, that the variation really is random and you also know the distribution of the errors (it doesn’t need to be normally distributed, but it must be known).

Reply to  Steve O
December 21, 2018 6:28 am

Steve O –> You are missing several attributes of the problem. First, your ‘random’ numbers probably have a normal distribution. No one doubts that this would happen with that. Second, you are missing Kip’s point. You don’t have 1000 people line up to read a thermometer. You have one person who reads it once and in the past, rounded it to the nearest degree. Please tell us how you can determine what the real temperature was when the accuracy was +- 0.5 degrees. I suspect you’ll realize that you won’t be able to know if the temperature was 50.5 or 49.5 degrees or anywhere in between.

The same thing applies when you begin to average temperatures that have been read once and rounded. If Tmax is 55.0 +- 0.5 and Tmin is 40 +- 0.5, what is the true average? Is it (55.5 + 40.5)/2 = 48, or (55.5 + 39.5)/2 = 47.5, or (54.5 + 39.5)/2 = 47. Or worse, anywhere in between?

Look at the range you get. It is somewhere between 47 and 48, i.e. the nearest degree, and you have no way to decide where it actually is. This same error carries through any number of temperature readings, of daily, monthly, or yearly averages.

Steve O
Reply to  Jim Gorman
December 21, 2018 8:38 am

My errors don’t need to be normally distributed. As long as there is no bias, or skewness in the distribution the errors will be symmetrically distributed and everything will work.

If you take five minutes to run your own experiment, you’ll see that Kip was wrong and you’ll be able to figure out exactly how you get greater accuracy starting with data readings that are rounded to the nearest whole number.

In Kip’s case of reading the historical temperatures, you don’t have 1,000 people reading each thermometer, but you do have a reading for every single day, and you have thermometers at multiple locations. It is only true that you do not have very good accuracy for any one location on any one particular day. The rounding or measurements has almost no impact on your ability to determine the true value to within a decimal point. His point was that rounding the readings and restricting the data to whole numbers degraded the accuracy and limited the precision in determining the true value and that’s not correct.

So, open up Excel and copy down 1000 instances of any “true value” you choose. In the next column, create measurement errors between 1 and -1 using a random function, and in the next column add the two columns together to come up with 1000 measurements. In the next column, round the measurements to the nearest whole number.

Now you have 1,000 measurements of incorrect measurements which are all rounded to the nearest whole number. How close do you think the average of all your bad measurements will be to the true value?

tty
Reply to  Steve O
December 21, 2018 10:04 am

For your information climate time series are almost never normally distributed and are often skewed to some extent (and usually rather strongly autocorrelated).

Paramenter
Reply to  Steve O
December 21, 2018 2:48 pm

Now you have 1,000 measurements of incorrect measurements which are all rounded to the nearest whole number. How close do you think the average of all your bad measurements will be to the true value?

I’ve followed the procedure and it does work nicely for me: mean of incorrect measurements is very close to the mean of true values. However, this is because randomizers usually generate a sample (or samples) from the ‘standard normal’ distribution. When I use random numbers generated by algorithms used for common cryptography it does introduce a detectable error.

Steve O
Reply to  Paramenter
December 22, 2018 8:54 pm

That’s a perfectly valid observation. If there is a skewness in the errors, or any systematic bias, then that error will not be corrected.

Reply to  Steve O
December 21, 2018 2:54 pm

You are dealing with numbers and how to handle them. I am dealing with measurements.

You are ignoring the questions I asked about the temperature measurements. The numbers there are very pertinent to this.

Please answer my questions if you can.

Then you may realize what the problems are.

Paramenter
Reply to  Jim Gorman
December 21, 2018 3:22 pm

I reckon what Steve tries to say is that even if single measurements contain significant uncertainty once sample size increases this uncertainty diminishes.

The same thing applies when you begin to average temperatures that have been read once and rounded. If Tmax is 55.0 +- 0.5 and Tmin is 40 +- 0.5, what is the true average? Is it (55.5 + 40.5)/2 = 48, or (55.5 + 39.5)/2 = 47.5, or (54.5 + 39.5)/2 = 47. Or worse, anywhere in between?

I would say then it is 47.5 +/- 0.5.

What if you have, say, 1000 such measurements each with +/- 0.5 deg accuracy and the trend is +0.3? Is this trend detectable or not?

Reply to  Jim Gorman
December 21, 2018 6:30 pm

Para –> The problem is that you’re not measuring the same thing 1000 times. You are measuring temperature a day later. This is a brand new measurement of a brand new thing. You can’t average them and say you have a more accurate measurement of either. The measurement error must be carried through.

Climate scientists use 47.5 and drop the +-0.5. This immediately makes one assume the data is more accurate than it really is.

To answer your question, you can’t recognize a trend that is less than the error. However, climate scientists do since they don’t even acknowledge the measurement errors! Too many deal with numbers in a computer and believe these are real. They are not.

Steve O
Reply to  Jim Gorman
December 22, 2018 9:38 pm

When you take the measurement of a temperature, it is recorded as a number. I don’t know how to get around that. I don’t understand why such a distinction is important.

“You have one person who reads it once and in the past, rounded it to the nearest degree. Please tell us how you can determine what the real temperature was when the accuracy was +- 0.5 degrees. ”

You don’t have one person who reads it once in the past. You have one person who read it once each day for 365 days over the course of a year. 3,650 times over the course of 10 years. You do not know the temperature on any particular day, but you DO know the average of your readings, even if the readings are rounded to the nearest degree. As far as knowing the average temperature on any particular day, you’ll need to provide me readings throughout the day, which don’t exist. Nobody knows what the temperature was on any particular day, but it’s also not important. If you have a reading taken at the same time each day, a lot of errors will wash out. You can determine if there is any trend in the data over time, to within a tenth of a degree, even if you take only one reading each day (at the same time) over the course of long period of time.

When errors are symmetrically distributed, the errors can be eliminated with multiple sampling.

R Percifield
Reply to  Steve O
December 21, 2018 6:35 pm

Dithering is not about the randomness of the signal but a dithering signal who’s amplitude greater than the resolution of the measuring device. If your resolution is 1mV then your random dithering would have to vary the signal randomly with some known distribution and amplitude (i.e. 2-5mV), the larger the signal the more resolution you would achieve. You would also have to have this occur at a frequency much greater than your sample period to meet Nyquist criteria. If you are wanting a 1kHz sampling rate (1 reading every millisecond) and you want to use a dithering average of 200 samples, the effective sampling rate would need to be 200kHz, and the dithering would have to be at 200kHz or greater or you would not be doing a full dithering sample which you hope isn’t changing during the 200 samples of dithered sampling. Also the 200 samples are not independent, they are of the same value with a displacement of time and random signal. For dithering to work the value must not change.

This additional overhead is why I do not use this method in my systems. We cannot afford the processing power to make this work. And as such since we were talking about measuring signals without any dithering applied at the time of measurement, we are still limited by the base resolution of the measurement itself. Thus, my contention still stands that averaging a group of purely independent measurements with both time and spacial displacement does not give you any more accuracy than the base resolution used.

Clyde Spencer
Reply to  Steve O
December 25, 2018 11:02 am

Steve O
The problem with your Excel ‘experiment’ is that you are starting with a single value instead of many values.

Steve O
Reply to  Clyde Spencer
December 26, 2018 9:23 am

“The problem with your Excel ‘experiment’ is that you are starting with a single value instead of many values.”

It only takes a few minutes to change the experiment so that you start with a set of values. At the end of the experiment, you will still be able to determine the numerical average of the set. This would more closely resemble an estimation of the average temperature of a year where you have 365 mid-range estimates based on two daily readings.

Or, you could say it applies to the average location of a chicken in a yard. Your estimated location can be off by 20 feet, and the chicken can be moving all over the yard, and you’ll be able to determine the “average location” of the chicken. You can even round off the last digit in order to make your inaccurate measurement less accurate and it will hardly make any difference.

Paramenter
December 21, 2018 5:18 am

Hey Willis,

Good stuff. This subject was discussed here several times in different forms. I reckon belief that ‘things will average out over when sample size increases’ is widespread in the climate science community. It’s convenient way of hiding all uncertainties and measurement errors.

I would echo question posted earlier: how situation looks with respect not to averaging of direct measurements but to averaging averages of those? So, say you’ve got mean of 1000 measurements, then another mean from another 1000 and so on. Does it change much?

knr
Reply to  Paramenter
December 21, 2018 6:27 am

to be fair that is easier when you start with the result you ‘need’ and work towards it.

Julian
December 21, 2018 5:44 am

Willis

In a previous article post on this blog. There was a mention of mixing satellite altimetry data and terrestrial tide gauge data.

In my day to day life I work with tide gauges from older mechanical gauges with rubber diaphragms to SOA electronic gauges available today. Measuring tide (especially offshore) was always one of the largest components of my error budget.

Disregarding the geodesy and its associated errors, from a starting point the instruments I use claim a precision of 1.0mm and an accuracy of +/- 10mm in optimum conditions, however in the real world we never see that.

My question is how can ‘scientists’ see sea level rises of 3mm per year.

knr
Reply to  Julian
December 21, 2018 6:25 am

tthey can’t , they ‘model them ‘
for no other reason than the very limited amount of measurements compared to vast area to be measured. And it before we get to allowing for errors and other factors.
We are once again in an area where ‘that will do’ was the norm because that was often all that was needed. Now its claims of unquestionable accuracy and demands for fundamental changes being based on this claims, while the data gather process has mostly remained at ‘that will do ‘ standard .
Its still the case that they cannot make a prediction for weather for more than 72 hours ahead worth much more than ‘in the summer it is warmer than in winter ‘ hence why they still get it wrong so often. Add that issue into the data collection problems and you can see the best ‘stunt ‘ they ever pulled was to get so many to buy into the idea that climate ‘science ‘ is ‘settled science ‘ in the first place.

Tom
December 21, 2018 6:16 am

Willis,
This discussion brings to mind the old adage: “It’s easy to read a thermometer, it’s difficult to measure temperature”. A corollary might be: “It’s easy to read a ruler, it’s difficult to measure length”. This is partially because we’re co-mingling more than one task. With more precise measuring tools, we would find that there is no single value for the ‘length’ of the credit card. Its edges are not exactly parallel nor are they exactly straight, along both their lengths and along their thicknesses. This makes a single value for the precise length meaningless. So now we must add the dreaded – statistics.

No number of multiple readings will make an instrument more accurate than its inherent accuracy, nor more precise than its inherent precision. (I include the ability of an observer to interpolate as part of inherent accuracy.) However, multiple observations of an instrument CAN, indeed, improve statistical summaries of its data. Torturing data sets with statistics, can also make the results meaningless. We see that a lot from Climate “Scientists”.

This brings about another corollary: “It’s easy to calculate an average of a set of numbers, it’s difficult to know what it means.” There is, indeed, an average temperature of earth. I doubt that anyone has come even close to determining what it is. It’s too bad we can’t simply place a rectal thermometer in Washington DC, and read it.

Reply to  Tom
December 21, 2018 6:14 pm

You can put the rectal thermometer anywhere in DC and get the global temperature. If “global temperature” means anything, it means the whole globe, anywhere on it, whatever point you want. It also means you really need only one thermometer to measure it, anomalies included.

December 21, 2018 7:52 am

Clearly the card is 3 3/8’. Makes it 85.725 mm. /sarc

If all the measurements use the same units and same rounding rule your std deviation would be zero, implying a very precise measurement. If you randomly chose the rounding rule and calculate the stats then all you know is the mean and std deviation if the choice of the rounding rule. On top of real data it looks like bias. Which is the problem I have when looking at year over year temperature trends.

Editor
December 21, 2018 8:12 am

No matter how many measurements are taken with a ruler marked to whole millimeters and recorded as whole millimeters, when one averages the records, one gets the Mean of the Measurements.

The Mean of Measurements can be considered accurate to 1/10th mm.

However, the Mean of Measurements must not be claimed to identify the actual length of the credit card. The Mean of Measurements only refers to the measurements taken — not to the physical object being measured.

For data sets of measurements such as Sea Surface Height — very complex systems involving multiple numerous conversions of electronic signals into ‘distances’ and multiple confounding factors — knowing that the accuracy and precision of your results only apply to your averages themselves and not to the physical thing (the distance from the satellite to the sea surface) is extremely important.

Reply to  Kip Hansen
December 21, 2018 10:11 am

Realistically, there is no way to measure the “physical thing”, but our measured impressions of the physical thing is a practical necessity — that’s why we measure — to attempt to synchronize human consciousness with a reality that we can never really grasp, but with which we can interact in a more controlled fashion, using our measurements of it.

That being said, we should know the limitations of our measurements — we should know when they are anywhere near real or somewhere in the realm of made-up or overly stated in terms of precision.

I still find the measurement of sea level from outer space mysterious. Where’s the reference point ? How can we establish the fact that any reference point in space stays in its place ? I mean, the Earth is moving constantly in all sorts of ways. How do you determine the fixed place in space, with respect to a constantly moving Earth? — star-field background, maybe? But, even so, there still seems to be lots of movements to take account of, in order to established that fixed reference point. And if the Earth is moving with respect to that eventually established reference point, then are you still measuring a sea-level height or a drift in distance that Earth has moved ?

Seems like lots of room for error.

tty
Reply to  Robert Kernodle
December 21, 2018 12:32 pm

Coordinates based on the Earth’s center-of-mass. These are known to a high degree of accuracy, though not to millimeter precision.
NASA is planning a new series of small geodetic satellites (GRITSS) that will hopefully improve the accuracy of the coordinate system to about 1 mm, i e about 10 times worse than already claimed by sea-level measurements.

Interestingly satellite measurements of arctic ice thickness are universally acknowledged to have decimeter precision at best, despite being vastly simpler than sea-level measurement since they measure the difference in altitude between the top of the ice floes and the intervening leads, and so are unaffected by atmospheric effects and independent of the absolute level of the sea.

December 21, 2018 10:00 am

Einstein’s doctorate, and 5 papers, on Brownian motion, used statistics together with physics to prove the existence of, at that time immeasurably small, molecules.
https://www.britannica.com/science/probability-theory/Brownian-motion-process#ref407453
Einstein, who never accepted statistics as natural law, was indeed the expert there, and as a physicist not limited by measurement accuracy, precision, repeatability….

Considering radar propagation through atmosphere, ionosphere, is there data from Jason-x-type satellites over at least 1 solar 22 year cycle available? Older data might not have GPS info.

Dr. Strangelove
Reply to  bonbon
December 21, 2018 8:23 pm

Brownian motion follows the normal distribution. The probability density of normal distribution:
P (x) = A (2 pi o^2)^(-1/2) e^-((x – u)^2/(2 o^2))
Where: A = normalization factor, o = standard deviation, u = mean

Einstein replaced the probability density (P) with mass density (p), and substituted: o^2 = 2 D t, u = 0, A = N and expressed it as a function of two variables (x, t)
p (x, t) = N (2 pi 2 D t)^(-1/2) e^-((x^2)/(4 D t))
where: N = number of particles, D = mass diffusivity, t = time

Solving this equation enabled Einstein to determine the number of atoms in a mole thus determining Avogadro’s constant.

Bill McCarter
December 21, 2018 1:54 pm

This is why I threw out all my Starret mics, and just use a wooden yardstick in the machine shop.
All I have to do is keep measuring with that yardstick and I can get whatever accuracy I want. Works great on crankshafts,,, to 100 thou, no problem /sarc

Clyde Spencer
December 21, 2018 1:54 pm

This is all rather discouraging! From my long-time reading of WUWT, I conclude that the commenters are generally bright, well-educated, and technically experienced. Yet, we can’t reach agreement on whether or not a large number of readings of some constant value, let alone a variable, can improve the precision of the estimate of the mean of the many readings, or reduce the standard deviation of the samples. What to do! Can we get William Briggs to weigh in on this?

tty
Reply to  Clyde Spencer
December 22, 2018 2:30 pm

The answer unfortunately is “it depends”.

If the value is really constant, the errors are random and the measurements are independent and equally distributed it does. Otherwise it doesn’t, or at least not proportional to the root of the number of measurements.

Clyde Spencer
Reply to  tty
December 22, 2018 3:33 pm

tty
Personally, I agree with your assessment. However, my concern is how do we convince those who disagree with us that they are wrong?

Reply to  Clyde Spencer
December 23, 2018 12:25 am

Very discouraging, Clyde.
Steve O in particular just will not get away from his conviction that measuring an ever changing value with a device, and doing so over many days in many places, will somehow wash out all errors and give you a number that very closely describes a trend over the time period in question.
I have started to read the posts you linked to, to articles here that you wrote over the past couple of years on this topic, and am now swimming through the long comments sections.
In addition to what we have from Steve O, I have noted that some people claim that different temperature measurements in multiple places, taken on multiple days (and even in-filled fake numbers) over some period of time, are not measuring something different each time…they are measuring “global average temperature”.
And since they give it that name, this magical number is “one thing”, that can be precisely and accurately determined with a long series of not particularly precise measurements of unknown accuracy…as long as we have lots and lots of them!

I refer back to my analogy of weighing cats, and claiming that you now know how much “cats” weigh, if you do it lots of times over a long period…even though they are different cats, and sometimes the same cat on a different day…or not. And they are not holding still while you weight them but just running around and sometimes running over one of the scales you have set up in cat town.
But the weight of cats may hover around a fixed value when averaged over many cats and many years. It may not, but it may. But can you say after a hundred years that you know what cats weigh to the nearest 1/100th of a gram and if there is a trend of 0.01 grams per decade?
Even if your scale only measured to the nearest pound?
Even if the scale outside the Midnight Cat Café was replaced with a laser scanner in 1994, and then moved to the Hissing Fence, which now has the dumpster of a fish market next to it?
*sorry, sometimes you just have to find something to lighten up the mood*

Everything I have learned has informed me that LLN applies to measuring the same thing, and we have several eloquent explanations for how it is simply not applicable to the subjects at hand, even if someone has a name for it that implies it is one thing.

Steve O
Reply to  Menicholas
December 23, 2018 7:29 am

What if we change what we say we are measuring? We’re not measuring the average daily temperature by taking a high reading and a low reading. Instead, let’s say that we’re measuring the average temperature for a one year period — by taking 2 measurements each day for 365 days.

Do we really need to take 240 measurements each day or can we get away with taking high and low readings and using the mid-range in our calculations?

Reply to  Menicholas
December 23, 2018 11:23 am

I have no idea why you ignore what everyone else is saying Steve.
No one is ignoring what you are saying.
And in your own words, you confirmed that Clyde is correct.
I am coming to the conclusion that you may well be a warmista troll.

Steve O
Reply to  Menicholas
December 26, 2018 9:56 am

“And since they give it that name, this magical number is “one thing”, that can be precisely and accurately determined with a long series of not particularly precise measurements of unknown accuracy…as long as we have lots and lots of them…”

Maybe this is the core of the disagreement. What makes an average temperature a magical thing? Does it not actually exist in reality? Why does the math not apply?

If you could measure the surface temperature at each square centimeter, simultaneously, around the entire globe, with precise instruments, could we not determine the average value with some great amount of accuracy? If we reduce the number of thermometers by a factor of a 1000, such that we only measure each square foot, will our accuracy decline appreciably? No. If we rounded each thermometer’s reading to the nearest degree, would average of them change? Not by very much. If we added random errors to each thermometer, would our average of the reading change? Not by very much.

To get within a certain level of accuracy is a matter of the number of readings you take. Take enough readings, in enough places and you can determine the average according to whatever degree of accuracy you desire. Or, start with the number of readings that are available, and you can determine the level of accuracy which you have achieved.

I also do not see why it matters that the temperature changes over the course of a year. You can measure the average location of a chicken in a chicken coop, with measurement errors that are also rounded off, and you’ll still be able to determine the “average location” of the chicken.

The same is true with the average weight of the cats of the world. With enough measurements, you can indeed know the average weight to within 1/100th of a gram. And if there is a trend of 0.01 grams per decade, with enough measurements you’ll be able to detect it.

Reply to  Menicholas
December 27, 2018 6:17 am

Not if the variance from the true value exceeds the trend.

Reply to  Menicholas
December 27, 2018 6:20 am

Keep in mind that at no time were all the cats in the world measured, and there was no systematic effort to make sure that the cats measured were truly representative of the average of all cats in the world.
Let alone that the subsequent measurements tracked the average of the true value of how all the cats in the world were changing over time.

Pat Lane
December 21, 2018 5:18 pm

Here’s a simple thought experiment based on the method used by Morice, Kennedy, Rayner and Jones in “Quantifying uncertainties in global and regional temperature change
using an ensemble of observational estimates: the HadCRUT4 data set ”

The maximum and minimum temperatures for a given day can be averaged to give a daily mean temperature.
Calculate the uncertainty as the standard deviation of the daily means. For the sake of argument say it’s 0.5 degrees Celsius. Half a degree? That’s not very accurate.

To get the average monthly temperature, note that there’s about 30 days in a month, and two measurements per day. That’s 60 measurements. Divide the average daily temperature’s standard deviation by SQRT(60). The gives us an uncertainty of 0.5/SQRT(60) = 0.065 degrees.
Now divide by SQRT(12) since there’s 12 months in a year. That reduces the uncertainty of the yearly mean to just 0.019 degrees.
Why stop there? We can keep going and calculate a decadal mean. Since there 10 years in a decade, divide by SQRT(10). Now the uncertainty is 0.0059 degrees.
By this same line of reasoning we can divide by SQRT(10) again and we now know the average temperature over the past century to an amazing 0.0019 degrees Celsius.
None of this required any equipment more complicated that a pocket calculator. I’m old enough to be able to do the calculations by hand with pencil and paper. (Yes, I was taught how to extract square roots in primary school.)

tty
Reply to  Pat Lane
December 22, 2018 2:40 pm

“The maximum and minimum temperatures for a given day can be averaged to give a daily mean temperature.”

However this is not the mathematically correct daily mean temperature. That can only be determined by integrating the temperature curve over the given day, and often will not be even close to the average of the maximum and minimum temperatures. This latter definition is quite honestly solely motivated by the fact that many weather stations have had maximum and minimum thermometers for a long time, so this data is available while the integral is not. And even that value is dependent on when the thermometers are read off, hence the need for TOBS corrections (which is one correction that is actually defensible, though it will of course increas the uncertainty even more).

Clyde Spencer
Reply to  tty
December 22, 2018 3:43 pm

tty and Pat Lane,
Yes, this has been gone over before. The value computed from TMax and TMin is at best a median, more properly called a mid-range value.

https://en.wikipedia.org/wiki/Mid-range

A mid-range has none of the characteristics of a true arithmetic mean, and as tty correctly points out, may not even be close to a true mean as derived from many discrete observations or the integration of a function.

Therefore, any subsequent calculation of the means of daily medians, and representing them as the mean of means is is a misrepresentation. At best, one is looking at the distribution of daily medians. This is the state of the ‘science’ of climatology.

Steve O
Reply to  tty
December 23, 2018 7:19 am

If you feel better calling this a mid-range value, then I don’t see any harm. But I also don’t believe it changes anything. Whatever difference there may be between the mid-range value and the area under the curve of a continual reading throughout the day is not important. If because of the warming and cooling pattern, the average difference is 0.487 degrees to one side or the other, it doesn’t have any impact as long as we are always using the mid-range value. And if we mis-name the value, calling it an average or a mean, it may be a bit sloppy but I don’t see how it changes anything.

Dr. Strangelove
December 21, 2018 7:27 pm

Willis
Assuming the lengths of credit cards have normal distribution, you can predict but not measure lengths smaller than the 1-mm resolution of the ruler.
Let:
u = mean length, o = standard deviation, x = 10 = multiple of standard deviation
Since 86 and 85 mm are the upper and lower limits, the mean is the midpoint of the limits:
u = (86 + 85)/2 = 85.5

Upper limit:
u + x o = 86
85.5 + 10 o = 86 (Eq. 1)
Lower limit:
u – x o = 85
85.5 – 10 o = 85 (Eq. 2)
Solving for o from Equations 1 and 2:
o = (86 – 85)/(2 x 10) = 0.05

Taking 10,000 samples, all of them will have lengths greater than 85 and less than 86. You can predict the length L at 1 standard deviation = 68%
10,000 (0.68/2) = 3,400 samples have a length:
L ≤ u + o = 85.5 + 0.05 = 85.55
Another 3,400 samples have a length:
L ≥ u – o = 85.5 – 0.05 = 85.45

Reply to  Dr. Strangelove
December 22, 2018 11:31 pm

Dr. Strangelove, I don’t understand your example and calculation. If instead of x =10 in your example, I assume x = 50: then I must conclude from your math that the standard deviation of all measurements has jumped downward from 0.05 to 0.01. Since just four standard deviations (Z=4, or “4-sigma”) includes 99.997% of all data points in a normal Gaussian distribution, how would adding the few extra data points associated with 50 standard deviations (Z=50, or “50-sigma”) between 85.0 and 85.5, and then a few more data points between 85.5 and 86.0, change the 1-sigma (one standard deviation) value so drastically?

iron brian
December 21, 2018 8:52 pm

how different is Mike’s Nature Trick (TM), when comparing electronic measurements to historic primitive thermometer values, read by eye? I call for a primitive instrument revival to be read for parallel values.

iron brian

iron brian
December 21, 2018 9:14 pm

I used to model pump stations with ocean intakes, using tide heights tables to change the suction head for the pump intakes. The predicted tide height was only an estimate, and barometric pressure can surely change actual water heights, by storm surge for example.

Do the satellite measurements use tide tables? what is the resolution of their tide calculations?

iron brian

n.n
December 21, 2018 9:52 pm

NIST Technical Note 1297

The stated NIST policy regarding reporting uncertainty is (see Appendix C):

Report U together with the coverage factor k used to obtain it, or report uc.

When reporting a measurement result and its uncertainty, include the following information in the report itself or by referring to a published document:

A list of all components of standard uncertainty, together with their degrees of freedom where appropriate, and the resulting value of uc. The components should be identified according to the method used to estimate their numerical values:

those which are evaluated by statistical methods,
those which are evaluated by other means.

A detailed description of how each component of standard uncertainty was evaluated.
A description of how k was chosen when k is not taken equal to 2.

It is often desirable to provide a probability interpretation, such as a level of confidence, for the interval defined by U or uc. When this is done, the basis for such a statement must be given.

kribaez
December 22, 2018 4:40 am

Willis,

You wrote:-

“Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.”

There are numerous realworld examples where precision error is reduced to negligible contribution by repeat measurements. The critical factor in determining whether precision error is reducible in this sense is the ratio of the range of measurements to the error range that comes from measurement precision. If this ratio is less than or equal to one (as is the case of your credit card example) then it is correct to say that precision error becomes irreducible in the average value no matter how many repeat experiments you carry out. Otherwise, your statement is simply not true.

You can test this for yourself in just a few minutes on a spreadsheet or by generating a simple R script.

(1) Take 500 random samples from an N(0,1) distribution, and calculate the sample mean and variance.

(2) Round the sample values to the nearest integer value. You should find that nearly all of the values take an integer value between -2 and plus 2. You have now imposed a precision error on the sample dataset equivalent to a Uniform distribution on the interval (-0.5, +0.5). It is equivalent to measuring only to the nearest unit value.

(3) Calculate the difference between the sample mean from (1) and the mean of the rounded values from (2).

(4) Record the value in (3)

(5) Choose a new set of 500 random numbers, return to (1), rinse and repeat.

The difference in means which you calculated in (3) represents the difference between using precise measurements (of each realisation from the N(0,1) distribution) and using rounded measurements with a precision error of plus or minus 0.5. If you repeat the above numerical experiment enough times, you will find for the values above that these differences have a mean of zero and a standard deviation of just 0.01291. This latter value is the contribution of the precision error to the estimate of the mean value of the 500 measurements. You should note that (i) it is far smaller than the precision error itself and (ii) it is further reducible by increasing the number of sample measurements.

Reply to  kribaez
December 22, 2018 2:46 pm

Using your example of 500 random samples. Do you really think that a glass thermometer in the late 1800’s had 500 sample measurements for each recorded temperature? How about satellite measurements? Do we have 500 satellites taking the measure of a particular stretch of ocean at a time?

Take the ADC discussion above. In essence you have a one bit sample of 0 or 1 taken ONCE. From that you have to determine where the signal level actually is. How accurate would that really be?

Paramenter
December 23, 2018 4:51 am

Hey Steve,

When you take the measurement of a temperature, it is recorded as a number. I don’t know how to get around that. I don’t understand why such a distinction is important.

My understanding of that: Jim is saying that there measuring one thing 1000 times (as per Willis analogy with a credit card length) is a different kind of fish that measuring variable temperature signal twice per day for 1000 days where each measurement represents different temperature. For the former you may hope that mean of repeated measurements converges to a true value. For the latter, where each measurement has associated uncertainty, say +/- 0.5 deg C such hope may be questionable.

Steve O
Reply to  Paramenter
December 26, 2018 9:28 am

“Jim is saying that there measuring one thing 1000 times (as per Willis analogy with a credit card length) is a different kind of fish that measuring variable temperature signal twice per day for 1000 days where each measurement represents different temperature.”

— I guess there is something here that I don’t understand. Why does that matter? Yes, a credit card is an unchanging value. Temperature is a changing value. Why is that an important distinction? The AVERAGE of the temperature over a period of time is not a changing value. At the end of a year, there are 365 measurements of a temperature that has changed over time. But there is only one average.

Paramenter
Reply to  Steve O
December 28, 2018 2:48 pm

Hey Steve,

I hope you had a glorious Christmas and you’re getting ready for New Year ball madness.

— I guess there is something here that I don’t understand. Why does that matter? Yes, a credit card is an unchanging value. Temperature is a changing value. Why is that an important distinction? The AVERAGE of the temperature over a period of time is not a changing value. At the end of a year, there are 365 measurements of a temperature that has changed over time. But there is only one average.

Again, my understanding is as follows: yes you’ve got only one average made out of 365 daily midrange values. Still, the uncertainty associated with each measurement, say +/-0.5 deg C persists and does not disappear as you go along with averaging. So, if you’ve got a trend, say +0.3 deg C this trend is not detectable due to uncertainty of +/- 0.5 deg, even if your average number is, say, 8.231 deg, i.e. to the thousands of Celsius.

Steve O
Reply to  Paramenter
January 3, 2019 1:58 pm

” Still, the uncertainty associated with each measurement, say +/-0.5 deg C persists and does not disappear as you go along with averaging.”

Thanks, but the math says otherwise. As long as the +/-0.5 deg C errors are symmetrically distributed, the error is reduced with sampling. When you are measuring the average temperature for an entire year, you have a LOT of measurements. The argument is that because the temperature is changing, that the Central Limit Theorem does not apply because you’re measuring something that is different. That argument is simply not correct.

A secondary argument is being made that there is a systematic error in the measurements. That’s another topic, and fails to refute the fact that the random errors go away with sampling. But the real topic applies to changes over time. And even with a systematic error, you can measure changes over time with a high degree of accuracy as long as the systematic error remains constant. Without a reason to claim that the systematic error has changed, it’s a reasonable assumption to make. If there is a reason to claim that the systematic error has changed, then that has to be taken into account.

December 23, 2018 10:36 am

Willis wrote:

Now … raise your hand if you think that we’ve just accurately measured the length of the credit card to the nearest three thousandths of one millimeter.

Of course not. And the answer would not be improved if we had a million measurements.

Yet you provide no proof.

I here provide proof of existence. It probably sits in your living room stereo if you have anything digital such as a CD player.

I propose that if your person doing the measurement has Parkinson’s and shakes randomly with a oscillation that exceeds about 1mm, then with a million measurements precision would improve according to the formula you describe.

Now, unless your calibrator of that ruler also shook randomly with an oscillation that exceeds about 1mm, then the accuracy would not be very useful given that precision.

This is precisely how delta-sigma converters work in your stereo system. They utilize noise above the final pass-band (20-20Khz) that works like the hypothetical Parkinson’s ruler user to provide that sigma/sqrt(n) improvement in the pass-band.

The actual improvement for a well-engineered system is actually much better than 1/sqrt(n) because the noise is shaped. That’s pretty advanced math so I’ll let you read up on it if you wish.

That also brings up the improvement can be worse than 1/sqrt(n) if the noise is correlated. Which in nature is quite often the case. I’ll leave that math for another day.

Nevertheless, your example is not correct unless it is clearly stipulated that there is no noise whose amplitude exceeds that of the one-measurement precision.

best regards,

Peter

Reply to  Peter Sable
December 23, 2018 10:40 am

There’s an experiment you can try yourself with just your finger.

Put your finger on something with a slight amount of texture without moving it. Note to yourself how much texture you feel.

Now move your finger. Notice how much more texture you feel.

It requires a motion that exceeds the size of the texture (and the distance between your finger’s nerves), but you can notice the improvement with motion.

Reply to  Peter Sable
December 23, 2018 3:50 pm

You keep changing the experiment. Deal with the experiment as it was originally stated.

Perhaps a different example of what we are talking about will help also. We are going to discuss target shooting. I have a rifle that consistently shoots in a two inch circle around the point of aim. We decide to shoot 100 times a day for 30 days and guess what, that gun shoots in a two circle around the point of aim. We then take each target put a grid on it and measure the distances from a fixed point to each hole in order to determine data points. You create a spreadsheet of the distances and determine the mean and the uncertainty of the mean. This should be the point of aim, and guess what, you can calculate that out to 5 digits if you want. We can add more and more shots to get any uncertainty you want.

Now we go out to shoot on the 31st day and I ask you to put a mark on the target where you think the first shot will go. Will you use your mean that is calculated out to 5 digits or will you use something else?

The same thing occurs when you start to average temperatures. Do you still think the mean of several days, months, or years give you any better idea of what the real temperature was? I think you’ll find the errors carry forward and that is all the accuracy you can get.

Reply to  Jim Gorman
December 23, 2018 4:16 pm

“Now we go out to shoot on the 31st day and I ask you to put a mark on the target where you think the first shot will go. Will you use your mean that is calculated out to 5 digits or will you use something else? ”

I forgot to add, “Let’s assume you choose to put a mark elsewhere, what is the probability the bullet will hit your mark? What if you choose the mean, what is the probability the bullet will hit there?

Same thing applies to temperatures. You can calculate a mean or average and a given uncertainty. But, what is the probability that mean is the actual value you should be using?

Reply to  Jim Gorman
December 25, 2018 3:08 pm

Your shooting experiment is not the same the ruler or global temperature experiments, so “let’s not change the experiment”.

If you asked me where the mean of the next 100 shots would be on the 31st day, I’d put a mark on the target at the mean for the prior 31 days. However, being an actual shooter, if the temperature or wind has changed (or I don’t have a $2000 scope), I’d not have much confidence in the prior mean.

That being said, shooting is not the same experiment as the ruler, global temperatures, or sea levels from satellites. In fact the closest related experiments in the set { ruler, shooting, global temps, sea levels } are global temperatures and sea levels. The ruler and shooting experiments have completely different sources of errors and completely different noise spectrum.

Reply to  Peter Sable
December 23, 2018 10:58 pm

You keep changing the experiment. Deal with the experiment as it was originally stated.

but if we change the experimental conditions you get a different answer … ya think?

The ruler thought experiment is a false analogy to global temperature or sea level measurements because unlike the climate measurements the ruler measurement has no noise. I instinctively changed the experiment to make a correct analogy because I don’t like to see someone I respect using false analogies.

With the global temperature and sea level measurements there’s high frequency noise that provides the resolution not otherwise available with low resolution instruments. Just like a delta-sigma converter gets 24 bits of resolution at 20Khz on your stereo with one bit of output (but the 1 bit is oscillating in the 10Mhz+ range).

That noise is in 3 dimensions – the location on earth and the time. Both of which are auto-correlated, but that’s not what Willis was attempting to show.

I do thank you for the link on how to determine N from the Hurst exponent Good write-up. Adding it to my bookmarks library.

Reply to  Peter Sable
December 24, 2018 4:54 am

You didn’t address my questions. They are pertinent to determining the actual signal in temperature.

The result is that the measurement error carries through to the end. If the variance you see is less than the error measurement of independent measurements of different things, then you simply can not quote a figure with higher resolution than the errors in the independent measurements.

Answer the questions about target shooting and you will see problem.