Guest Post by Willis Eschenbach
In the comments to my post called “Inside the Acceleration Factory“, we were discussing how good the satellite measurements of sea surface heights might be. A commenter said:
Ionospheric Delay is indeed an issue. For Jason, they estimate it using a dual frequency technique. As with most everything in the world of satellite Sea Level Rise, there is probably some error in their estimate of delay, but its hard to see why any errors don’t ether cancel or resolve over a very large number of measurements to a constant bias in their estimate of sea level — which shouldn’t affect the estimate of Sea Level Rise.
Keep in mind that the satellites are making more than 1000 measurements every second and are moving their “target point” about 8km (I think) laterally every second. A lot of stuff really will average out over time.
I thought I should write about this common misunderstanding.
The underlying math is simple. The uncertainty of the average (also called the “mean”) of a group of numbers is equal to the standard deviation of the numbers (a measure of how spread out the numbers are), divided by the square root of how many numbers there are. In Mathspeak, this is
where sigma (σ) is the standard deviation and N is how many numbers we’re analyzing.
Clearly, as the number of measurements increases, the uncertainty about the average decreases. This is all math that has been well-understood for hundreds of years. And it is on this basis that the commenter is claiming that by repeated measurements we can get very, very good results from the satellites.
With that prologue, let me show the limits of that rock-solid mathematical principle in the real world.
Suppose that I want to measure the length of a credit card.

So I get ten thousand people to use the ruler in the drawing to measure the length of the credit card in millimeters. Almost all of them give a length measurement somewhere between 85 mm and 86 mm.
That would give us a standard deviation of their answers on the order of 0.3 mm. And using the formula above for the uncertainty of the average gives us:
Now … raise your hand if you think that we’ve just accurately measured the length of the credit card to the nearest three thousandths of one millimeter.
Of course not. And the answer would not be improved if we had a million measurements.
Contemplating all of that has given rise to another of my many rules of thumb, which is:
Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.
Following that rule of thumb, if you are measuring say temperatures to the nearest degree, no matter how many measurements you have, your average will be valid to the nearest tenth of a degree … but not to the nearest hundredth of a degree.
As with any rule of thumb, there may be exceptions … but in general, I think that it is true. For example, following my rule of thumb I would say that we could use repeated measurements to get an estimate of the length of the credit card to the nearest tenth of a millimeter … but I don’t think we can measure it to the nearest hundredth of a millimeter no matter how many times we wield the ruler.
Best wishes on a night of scattered showers,
w.
My general request: when you comment please quote the exact words you are referring to, so we can avoid misunderstandings.
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
Here is a good explanation of the “uncertainty of a mean”.
http://bulldog2.redlands.edu/fac/eric_hill/Phys233/Lab/LabRefCh6%20Uncertainty%20of%20Mean.pdf
As is stated in their text,
“the estimate of the uncertainty of the mean given by equation 6.1 has two
properties that we know (or at least intuitively expect) to be true. First, it implies that the
uncertainty of the mean is indeed smaller than the uncertainty of a single measurement (by the
factor 1/ sqrt(N)), as we would expect from the argument given in the previous section. Second, the
more measurements we take, the smaller Um (uncertainty of the mean) becomes, implying that the mean becomes a better estimate of the measurement’s true value as the number of measurements it embraces increases”
Max,
“Second, the more measurements we take, the smaller Um (uncertainty of the mean) becomes, implying that the mean becomes a better estimate of the measurement’s true value as the number of measurements it embraces increases”
This statement is incorrect. That should read ‘measurement’s true correspondence to the instrument’s indication.’ The point being is the averaging merely provides a more PRECISE instrument indication. The averaging does not address the underlying ACCURACY of the instrument itself.
The ACCURACY of an instrument is a finite physical characteristic. It is a function of its basic design limitations. Those limitations are typically described in its specifications and or in its metrology lab calibration statements. It is an error to assume an instrument’s ACCURACY can be improved by averaging any measurements done with it. The best that can be done, given reasonably random scattering of measured values, is produce a value that more PRECISELY represents the instrument’s measurement indication. ACCURACY is still no better than what is shown on the instrument’s calibration sticker. Also, keep in mind that no claim has ever been made by instrument manufacturer’s or cal labs that a calibration errors are randomly scattered within that range specified on the sticker.
An argument has frequently been made that absolute accuracy of the measuring instrument is not important when measuring trends. Assuming that measured values collected at different times with the same instrument will each have the same error relative to the true physical value is also false. Instrument calibration drift’s with time and not necessarily in a constant direction or rate. Some instrument types are better than others in this regard. There are also issues with hysteresis, mechanical friction, thermal history, instrument contact with measured process, etc.
The concept is simple, never claim more accuracy than the instrument specification. On top of that, you must add any degradation in that accuracy added by how the instrument is connected to what it is supposed to be measuring. Averaging measured values does not reduce those errors.
It has nothing to do with instrument accuracy. It is just fundamental statistics of sampling. You can choose not to believe statistic if you wish.
Interesting response. You say instrument accuracy has no bearing on data collected with that instrument? Very accurate, sorta accurate, and barely working instruments are all the same for collecting data?
I definitely believe statistics when used appropriately. The trick is understanding what the result of a statistical manipulation is. Just because a particular statistical algorithm can be run does not mean its results apply to the desired end. A simple example is calculating the average life span of humans and then letting your kids play in the middle of a busy street cuz statistics say they will live decades longer. Calculating life average span is valid. Assuming it means very much for a particular individual is not valid. The same principle applies to this discussion.
Maybe I don’t explain it too well so I will give it one more shot. You can never make one measurement more accurately than your instrument will allow. But, if the errors are randomly distributed around the actual value then the accuracy of the measuring device is not critical when you take a large number of measurements. However, it is very likely that the more accurate instrument also has the more likelihood of the error being in fact random, while a less accurate instrument will likely have a bias in the errors.
Maybe this equation will help, or at least help me.
We can write a measured value as
U = Ut + delta
where U is the value you measure, Ut is the true, real value of what you are measuring (which you don’t know) and delta is the error in measurement. If you take a whole bunch of N measurements and then take an average we get
sum(U)/N = sum(Ut + delta) / N = sum(Ut)/N + sum(delta)/N
Now if the errors are truly random around the real value
sum(delta) = 0
and since Ut is the one true value, sum(Ut) = N*Ut
and sum(U)/N is just the average of our measurements ie. Uavg
So with all that we can get that
Uavg = Ut for our N measurements
and voila, the average of our bunch of measurements will in fact converge to the actual value, as long as the errors are random.
We employ this method in boundary layer turbulence statistics.
Max, there is no reason to think that the errors are randomly distributed around the true value.
In fact, no one knows what the true value is, and besides the measurements are not of the same thing.
There is every reason to be sure that there is NO such known random distribution.
So everything you wrote after that is moot.
No one is declaring statistical mathematics to be invalid when used correctly.
But you seem to be saying that we can ignore the rules about when the LLN applies, and when it does not.
Max –> Don’t forget we are not talking about sampling here. We are talking about one temperature measurement, taken one time, on one day.
Some folks want to say you can average daily temperatures, monthly temperatures, etc. and get better and better certainty of the mean. However, that just doesn’t work. It doesn’t help you determine what the real temperature was on any given day.
You only have to look at temperatures for two days. The first is 50 +- 0.5 and the second 51 +- 0.5. What is the average? It is 50.5 when not considering the possible values determined by the error. Does that mean each day was really 50.5, or just the first day, or just the second day? What was the average when you throw in the errors? What was the actual temperature for day 1, how about day 2?
An unstated presumption, without supporting evidence, is that the standard deviation remains constant, i.e. no growth in time, i.e. no rate of change: False.
Ha ha.
‘Certainty’ methods/discussions aside, color me skeptical (for now) on the accuracy of SLR derived from satellites. After all, look at the contested ‘accuracy’ around the height of a well-known and relatively stationary object that does not have ocean swells nor tides…Mt. Everest.
From https://www.britannica.com/place/Mount-Everest:
…A Chinese survey in 1975 obtained the figure of 29,029.24 feet (8,848.11 metres), and an Italian survey, using satellite surveying techniques, obtained a value of 29,108 feet (8,872 metres) in 1987, but questions arose about the methods used. In 1992 another Italian survey, using the Global Positioning System and laser measurement technology, yielded the figure 29,023 feet (8,846 metres) by subtracting from the measured height 6.5 feet (2 metres) of ice and snow on the summit, but the methodology used was again called into question.
In 1999 an American survey, sponsored by the (U.S.) National Geographic Society and others, took precise measurements using GPS equipment. Their finding of 29,035 feet (8,850 metres), plus or minus 6.5 feet (2 metres), was accepted by the society and by various specialists in the fields of geodesy science and cartography. The Chinese mounted another expedition in 2005 that utilized ice-penetrating radar in conjunction with GPS equipment. The result of this was what the Chinese called a “rock height” of 29,017.12 feet (8,844.43 metres), which, though widely reported in the media, was recognized only by China for the next several years. Nepal in particular disputed the Chinese figure, preferring what was termed the “snow height” of 29,028 feet. In April 2010 China and Nepal agreed to recognize the validity of both figures.
Use the long-term land-gauges that are on non-rising/non-subsiding rock for sea level measurement, I say. The rest is too close to nonsense.
“Following that rule of thumb, if you are measuring say temperatures to the nearest degree, no matter how many measurements you have, your average will be valid to the nearest tenth of a degree … but not to the nearest hundredth of a degree.”
The point that may slip by here is what is being measured: In your credit card example, you are measuring the credit card. When you switch to temperatures (which is often the case when talking about climate) you are not usually measuring temperature – per se, but the temperature OF something – air, water, etc. Your example falls apart when we realize that is nearly impossible [outside of a laboratory] to make very many temperature measurements of EXACTLY the same ‘piece’ of air under the same ambient conditions.
Conclusion: It is VERY difficult to even achieve the additional decimal point of better precision
Beyond the accuracy of the measurements themselves, there’s the difficulty of taking enough measurements so that you can claim to be taking a representative sampling of the entire earth.
Back in 1850, there was at most a couple of hundred sensors, mostly in western Europe and the eastern US.
The rest of the world was virtually unmeasured.
You might be able to say that we knew what the temperature of the eastern US and western Europe was, but to go from that to claiming that you could measure the temperature of the entire world was, much less to a tenth of a degree, is ridiculous.
nw sage,
Yes, there’s a big difference between measuring one invariant thing and calculating the average of many measurements of different things. In the first case, the precision and accuracy is limited to the measuring instrument and the average of multiple measurements will not improve this as the instrument will always return the same value. In the second case, the randomness of the data improves the precision and decreases the uncertainty in the average, but not for any one measurement. The result has no predictive power regarding what the next measurement will be, but accurately predicts what the average of many measurements will converge to.
After reading all the comments it seems to me that the concept of uncertainty has a very uncertain meaning. Not even the mathematical average of all the various proposals seems to be certain.
If the raw data is poor, the average of the raw data is poor.
So this is really just another manifestation of the Garbage In, Garbage Out principle.
OK Willis. So you don’t believe in statistics. That’s fine with me. There are surely practical issues with applying statistics although I’m not sure I follow your logic. And truly I don’t much care how many angels can dance on the head of a pin. I’ll leave it to Nick Stokes or his equivalent to argue with you about those details. I expect they are firing up their keyboards to discuss your disturbing lack of faith even as I type this.
But you’ve also missed an important point about satellite sea level data. It’s very noisy. You aren’t measuring a credit card, you are measuring an earthworm whose length changes constantly (within limits). And to some extent, your measuring tools are calibrated rubber bands whose lengths aren’t as constant as one might like.
How do you deal with that? You take a LOT of measurements and average. You don’t think that’s a valid procedure? Do you have a practical alternative?
If your point is that we really don’t know exactly how accurate the resulting averages are. I agree we don’t. Mostly because we don’t really know the distribution of the noise in the raw sea level data and probably don’t have all that good a grip on some of the calibration errors either
One further point. We’re mostly worried about Sea Level Change, not absolute Sea Level. How do we compute that? We subtract Sea Level estimates at some time T0 from some other time T1. That doesn’t do anything for random error since the values at T0 and T1 have uncertainties due to random error. The difference of two uncertain numbers is on average a somewhat more uncertain number. But some of our errors are probably always about the same magnitude and sign. Those are biases, not random errors. Biases largely go away when we subtract.
“Some of our errors are probably always about the same magnitude and sign. Those are biases, not random errors. Biases largely go away when we subtract.” If we just could identify them ..
Mr. Magoo looks at something. His eyesight is not good so it is fuzzy. With this logic all it takes is enough Mr. Magoos all looking at a fuzzy something to give you a crystal clear image of that thing. NOT!!!!
Modern image processing techniques would seem to contradict this. What you are talking about is something done routinely for imaging deep space. For example, the Hubble Deep Field.
Don K, you write “You aren’t measuring a credit card, you are measuring an earthworm whose length changes constantly (within limits). And to some extent, your measuring tools are calibrated rubber bands whose lengths aren’t as constant as one might like.”
And out of that process you believe statistics allows you to report an earthworm result with GREATER precision, accuracy, and to more significant figures than measuring a plastic card with a wooden ruler?
And out of your belief in statistics, what do you affirm or deny regarding the FRACTAL properties of the “edges” of either an earthworm or a credit card? Analogously, how long is the coastline? Is it conceivable in your estimation of Willis’s problem that the top length of the (not quite rectangular) credit card is different from the (nearly but not quite parallel) bottom edge, by some length within the uncertainty of the measurement process?
I don’t understand your claim with sufficient detail to determine whether or not it’s persuasive.
And out of that process you believe statistics allows you to report an earthworm result with GREATER precision, accuracy, and to more significant figures than measuring a plastic card with a wooden ruler?
Actually, I’m not a true believer in statistics, because it depends on assumptions about the world that seem to me to be rarely met. But I think the Standard Error of the Mean which is I think what we are dealing with here might be one of the few things in the world that actually has a Gaussian distribution. If it actually is Gaussian, textbook statistics might just work. But that’s not what I’m claiming.
Anyway what we have here is more like good precision but lousy accuracy. We’re measuring Willis credit card with a really good micrometer. But the folks making the measurements don’t really know how to use it very well so we get substantially different measurements every time we try. Why do you believe that averaging won’t improve the results?
Fractals? Those are boundary thingees. Not very relevant I think (hope). The physical phenomenon we depend on is reflection of a microwave radio pulse from the sea surface and I think the ambiguity there is quite small. I believe it’s one of the least uncertain elements in the process. Trouble is that the sea surface is usually not very flat and the radio beam isn’t that narrow, so what the satellite “sees” is something difficult to describe or analyze.
Don K you write: “We’re measuring Willis credit card with a really good micrometer.”
Well, not according to Willis!
HE writes: …”people … use the ruler … to measure the length of the credit card in millimeters.
Whoever “we” might be, “we” agree that “we” have lousy “accuracy”. But I guess my question to you is, do you believe increased (statistical) precision moves “our” actual understanding of a measurement from “lousy” to “good”? You correctly interpret me to believe that averaging doesn’t “improve” things. We have a measure in millimeters and I can’t see why YOU believe that CAN be improved. How do we communicate across that gap?
DonK
The surface of waves are “boundary thingees!”
Fractals?
Watch this if the subject interests you…or if it does not and you wonder what is up with that?
https://youtu.be/56gzV0od6DU
I’m a little rusty on this, but I have done instrument measurement uncertainty calculations for NIST traceable Safety Related measurements, to assure that Nuclear Safety Limits are met. These aren’t your run of the mill guesswork kinds of things, but exist to show regulatory compliance under penalty of law.
Measurement Uncertainty relates to the conformance of actual process to the readout that represents that process. In other words, it’s how close the actual process is to what you say it is. In the case of temperature, it’s how close the temperature is to what the instrument readout says it is. The first time I did one of these calculations I was shocked at how uncertain things really are.
To calculate uncertainty, one must include everything that can possibly affect the reading: the analog sensor (including drift and calibration accuracy), digital conversion, readout precision, theoretical limits (that is, the degree to which a correlation formula actually represents the conversion process), and uncertainties in any physical correlations that can affect the fundamental relations. One must also add the uncertainty of the test equipment employed for calibration. Each of these terms (which is not an complete list) has various components that contribute to the uncertainty of each term. It is not uncommon for two dozen or more individual uncertainty terms to be included in a properly done uncertainty calculation.
There are two kinds of errors at play: random error and bias error. Random errors can be reduced (but not eliminated) by more measurement. But bias errors never go away and are NOT improved with more measurements. Total uncertainty is calculated by Square Root of Sum of Squares (SRSS) of all random errors linearly added to SRSS of all bias errors.
Once the uncertainty for a given measurement system is determined, the total uncertainty can be reduced only if there are multiple independent measurements; that is, only if there is more than one source for a given reading. One cannot improve uncertainty by making multiple measurements with the same device, although one can (in principle) improve precision with more measurements.
It is clear from simple algebra that the total uncertainty of a measurement is bounded by the worst term in the system. For example, if the sensor error is 1%, no amount of additional precision or accuracy in subsequent processing can make the total uncertainty better than 1%. Uncertainty is always worse than than the worst term, because SRSS can only produce positive numbers. Each additional term makes the uncertainty worse. Usually the total is dominated by a just a few big terms. Adding 0.01% accurate readout to the tail end of a 1% accurate sensor doesn’t make any real difference.
It’s kind of depressing to realize that you really don’t know what you think you know when you take a reading. In the Safety Related business, uncertainty calculations are vital to provide assurance that design limits are met and that plant operation, even during accidents, will not injure the public.
-BillR
+100
Yeah!
What Jim Gorman said! (+100)
Hallelujah!
Thank you William!
May I point out that the term “the uncertainty of the mean” means what it says. The mean value of a number of measurements can be calculated very accurately.
It does not tell you if the individual measurements are either accurate or precise. For example, I can take a million measurements of the same credit card with a device that reads to 5 decimal places. I can calculate the uncertainty of the mean and get a very small number. Now what is the uncertainty of the mean if I tell you that the ruler is off by 5 mm? Does it change at all?
The physical measurement errors can not be removed through this method, especially if, as someone has already pointed out, you are measuring different things with different instruments . Several of us have been harping on this for a long time. Measurement errors must be propagated through out the calculations and are not removed by statistical manipulations. In other words, if you can only measure a temperature to within +- 1 degree, your average can only be accurate to within +- 1 degree.
Think about it. I’ll give you three readings, 52+-1 degree, 51+- 1 degree, and 50 +- 1 degree. What is the average? Is it (51+50+49)/3 = 50? Or is it (53+52+51)/3 = 52 Or is it perhaps somewhere in between? What is the uncertainty of the mean in this case.
Jim,, excellent point….I always adding more numbers…the bigger the margin of error
Willis: While I agree with your primary point, I would go a bit further. First the standard deviation of the mean should be multiplied by 2 to estimate the normal 95% confidence interval. But there is also a need to evaluate and include other effects and create an “uncertainty budget”. ISO’s ‘Guide to Expression of Uncertainty of Measurement’ (GUM) provides a detailed methodology for doing this that requires some quite heavy duty math as well as considerable training in metrology. For example there are six factors that are commonly considered in determining the MU of a high quality screw type micrometer.
I can say that when I was responsible for operations in an ISO 17025 laboratory and implementing the GUM requirements I was taken aback by how large the Measurement Uncertainty the we had to report was for much of our expensive and high precision equipment.
+100
Yeah!
What Jim Gorman said again! (+100)
Once again, he beat me to it.
I can recall the building and the lab where I took the lab portion of analytical chemistry (also called qualitative and quantitative analysis in previous years). The floor as concrete, six inches thick.
The scale was incredibly precise, I do not recall how many decimal places it had, but I do recall that in addition to of course having a glass enclosure to prevent air movements from corrupting the measurement…if someone walked into the room, the reading would bounce all over
the place for several tens of seconds. If you shifted your feet, it moved. And the scale was on a solid stone bench and the building had, again, a concrete floor.
No matter how well you knew the procedure of the experiment, and how carefully you followed it, if you did not have really excellent technique your result would suck. And no matter how careful you were to do everything exactly the same, the multiple trials you needed to do to get your 95% confidence interval would all be different.
Grades in that class were assigned on how well your result agreed with the accepted value, and it was tough. Some years no one got an A.
You had to know the course work and theory perfectly, and all of these limitations, and report results correctly, and be extremely careful, and do the exact same thing over multiple times, to even have any chance of getting a good result.
How are readings from satellites being published that are in complete disagreement with the established method of determining sea level changes?
How can sea level graphs now be outside the error bars of the same graphs using the same data as what was reported in 1982?
“First the standard deviation of the mean should be multiplied by 2 to estimate the normal 95% confidence interval.”
WARNING! This is only true for normally distributed data. Which climate data often isn’t. You can calculate 2 SD for any distribution, but only for normally distributed data does it equal 95 % confidence level.
Just imagine the considerable tome “How to lie with statistics” would become if updated now with the benefit of “climate science”
I believe the point of Willis’ post is that the most that can ever be achieved, even under the best circumstances, is 1 additional decimal place. Increasing the number of measurements won’t improve the accuracy. Therefore, any calculated measurement indicating better than that should be discounted back to the original measurements’ level of accuracy.
Thus, any claim of sea level rise acceleration, for example, that is based upon a change of mere hundreths of a millimeter when the individual measurements used for the calculations were made to tenths of a millimeter should be noted as unfounded.
Of course, as you point out nw, under less than ideal circumstances, resultant accuracy will be even less.
SR
I am trying to figure out the significance of all this is somebody multiplying mean sea level rise of .003mm over a thousand years and getting 3mm of sea level rise?
Usually you don’t worry about significant digits in mean calculations. Additionally in rules of significant digits the figure .003 represents one significant digit, not three. You are supposed to ignore leading zeros. If it were recorded as .0030 it might be considered two significant digits as its suggested you derived that last zero from the observation.
The bottom line is that accuracy is related to systematic error and precision is related to random error. Random error can be reduced by multiple readings of a FIXED value, but systematic error requires comparison to a standard or an instrument of known higher accuracy AND precision.
Willis’s point is spot on.
Other Sea level impacts…
Temperature of the water.
Direction of the wind.
Barometric pressure.
Wave peaks and troughs.
Tidal movement.
Position of the moon and sun.
And exactly how does the satellite know the local conditions while it is scanning the water?
Nor should one overlook NOAA/NASA’s modeling their sea surface heights according to isostasy and Earth’s geod structure.
And somehow, NASA/NOAA claims to measure sea level rise of 3.2 mm annually?
Not a chance.
Believe it or not Jason 3 measures or has inputs for all those except maybe Barometric pressure would need to look that up. I know most of the others you mention are actually recorded in each data entry.
Again Jason 3 accuracy is 2.5 cm RMS it claims nothing more and is calibrated at numerous sites to ensure it stays within that.
Climate Science do analysis that infer greater accuracy but that has nothing to do with the raw data.
At the moment you are blaming an instrument about an accuracy that it doesn’t claim .. if you have a problem identify the right party.
FYI … read the calibration at one of the sites
https://www.mdpi.com/2072-4292/10/11/1679/htm
I was going to make pretty much the same comment. Even if they tried to measure the deck of an aircraft carrier it would probably vary a couple of feet over a years time at the same location and much more at different locations. To claim that measurements can be made on the ocean surface to accuracy or precision within millimeters appears a little farfetched to say the least.
Well, a 1000+ foot aircraft carrier would expand and contract with temperatures, the upper and lowest sections would bend several feet as the seas move (large waves) move from fore to aft, and I’ve seen the long open inside vehicle decks of cargo vessels (800 feet long) twist by feet as the bow twists and then the mid and then the stern. But the length changes because of temperature changes (-15 C to +35 C for example) are not several feet.
A simple change in atmospheric pressure alone could result in the water elev change of a few inches and with 25.4 mm per inch times say 3 inches equal to 75 mm change that would blow the millimeter claims of precision/accuracy out of the water.
I think I will go measure a piece of 20 or 40 grit sandpaper to get an accurate measurement of it’s thickness to the nearest .001 inch. I’ll report back with my measurements. Meanwhile someone else can measure the mean sea level in 20-30 ft seas to the nearest millimeter and we’ll compare results. We can use simple or complex math equations and compare the results to a baseline measurement of a fart in a high wind.
Jason 3 does do an adjustment for Barometric pressure
The datasheet gives you the formula it uses to correct it .. again it is calibrated and stays within the 2.5cm RMS accuracy claimed.
You guys seem to have a lot of complaints about an instrument without reading a single word about what it does and how it does it.
Your example is simple, concrete & useful to show one limitation of averaging. However, what about creating an average of the world’s temperature? So, we can average the temperature at the North Pole, New York and Jamaica.
That would be more like averaging my credit card, the size of my wallet and a passport. The number is meaningless. And yet, climate “scientists” do this all the time.
Willis your rule-of-thumb is correct. It is founded in the mathematics of “Reliability and Statistics.”
In simple terms the mathematics would be:
Total Error = Model (or System) Error + Measurement (or Data) Error
Where the above errors are probability (or random) distributions. If the system were perfect, then the system error would be zero but this rarely happens in the real world.
To use your thermometer example, the thermometer (system)would have an error of ± 0.1 °C. Therefore, no matter how many measurements we take the error would never be better than ± 0.1 °C and could be much worse, depending the size of the measurement errors.
Angus, I am afraid that you are not a climatologist.
Curious George, brilliant!
The problems with our temperature measurements are not limited to the instrument itself. We know that the readings will be impacted greatly by the choice and condition of the site at which the thermometers are read. And we know from past work that there are huge issues with site selection and their condition.
Averaging won’t significantly reduce those errors when urban and anthropogenic effect almost always bias high.
My analogy is measuring the height of adults. Australian men are 1.76 m on average, and to keep things simple, the median is 1.75 m . Women are 1.62 m and, again to keep it simple, all under 1.75 m. Rounding 1000 measurements to the nearest 0.5 m and you get 75% of the adult population are 1. 5 m tall while 25% are 2 m tall. That means that Australian adults are 1.625 m tall on average and men are only 5 mm taller than women on average in Australia. Or if we measured the two sexes rounding off to the nearest half metre, men are 250 mm taller than women on average.
Making 1 million measurements will not make it all better.
Statistics is a wild beast. Take the statement Willis quotes “its hard to see why any errors don’t ether cancel or resolve over a very large number of measurements to a constant bias.” Let’s consider a related situation, the Brownian motion – a small particle in water moved randomly by the impact of water molecules. How does the large number of impacts average? What can we say about the particle position statistically?
It turns out that if you simulate a one-dimensional case or a two-dimensional case, the particle does not move much. In a three-dimensional case, the average displacement of the particle from the original position is linear in time – I vaguely remember that it got actually measured as a way to determine Boltzmann’s constant.
“A lot of stuff really will average out over time”.
Errors which go in one direction don’t, that’s why bias and extremism tends to get worse over time.
Isn’t there a large systematic drift error with the height of the satellites. I do recall a proposed GRASP system to try and reduce this cumulative error to 0.1mm a year.
Here is another question. Why don’t the providers of temperature data sets require that users (i.e., scientists, etc.) not only acknowledge the measurement error range contained in the data set but must also quantify how these errors affect their conclusions. In other words, evaluate their results based upon the low range of temperatures and upon the high range of the errors.
This would go a long way toward making everyone aware of the inaccuracies being ignored.
This process of averaging multiple measurements to get a more accurate measurement is done all the time in electronics. Take the case of an 8-bit sampling system (ADC) which can resolve 256 different voltage levels (2^8). Imagine it has a full scale input range of 0 to 25.6 volts (unlikely numbers but it makes the maths easy). We can then resolve voltages 25.6/256 apart (0.1V apart). You might say that is the limit of our accuracy but that isn’t so.
Take an input voltage of 10.973V, just as a for instance. The nearest ADC level is 11.0V (level 110) and in the absence of noise that is what the ADC would read. It would be wrong by 0.027V. If there is some noise (dither) on the signal however the voltage will bob up and down around 10.973V randomly. However, it will apparently spend more time at the 11.0V level (level 110) than the 10.9V level (level 109). In fact it will spend time at both of these levels pro rata to the distance the real input voltage is from these levels. In this case, the signal is 0.027V from level 110 and 0.073V away from level 109. If we take a thousand measurement of the ADC we will end up with approximately 270 readings at ADC level 109 and 729 at level 110. We can use that information to narrow down the true voltage at the input to near the real value of 10.973V even though at first glance we can’t get a better accuracy than the native accuracy of the ADC, one part in 256 (or 0.1V in this particular example). In this way, 8-bit ADCs can be made equivalent to 9-bit, 10-bit, 11-bit or more, at the expense of taking multiple readings.
We can take this principle to its extreme and come up with a 1-bit ADC. This has only two output states and looks to be fairly useless at first glance. Say the input voltage range is the same as before, 25.6V. Then if the input voltage is <12.8V the output will read 0 (equivalent to 0V to 12.8V) and if it's higher it will read 1 (12.8V to 25.6V). A reolution of 12.8V, apparently rubbish. However, if we apply a large amplitude noise dither on the input voltage, the 1-bit ADC will read a signal that is at exactly 12.8V 50% of the time as a 0 and 50% of the time as a 1. Given enough readings, we will find we get equal numbers of 0s and 1s. Thus we know, despite the apparently awful resolution of a 1-bit ADC, that the input signal is at 12.80000etcV despite the only two levels we can actually read being 0V or 25.6V. If the input voltage strays away from 12.8V, say to 13.273671V then if we take enough measurements we will find we get an excess of 1s over 0s and by the proportions of these two results we can determine the input voltage to an arbitrary accuracy, limited only by the number of samples. One can turn a 1-bit ADC with dither into a 24-bit equivalent ADC just by oversampling in this way. In fact, there would be other factors that would limit the accuracy of the measurement no matter how many sample we take, but this does not impact on the primary question – that taking more samples can indeed improve accuracy in a sampled data system – it is done all the time.
This is a perfect example. Thank you.
Andrew,
You seem to be forgetting that your overall accuracy is still limited by the A to D converter. Yes, by dithering you can jitter the value across an ADC steps and thus estimate a value between ADC steps. This is done in many systems. However, the final accuracy, no matter how fine a division on an ADC step, is ultimately only the ADC basic accuracy. In many cases, such as signal analysis, absolute ADC accuracy is not as important as monotonicity and even step size. Dithering helps minimize ADC step intermodulation. Again, accuracy is that of the ADC, not how fine you can subdivide each ADC step.
With just dithering you only get 1/2 bit per 2x oversample, or 2^48 oversampling to get 24 bits, because at best the dithering noise is uncorrelated. I don’t think there’s any system that can oversample fast enough to get 2^48 oversampling…
Delta-Sigma converters using negative correlation by feeding back an inverted signal, allowing a Hurst exponent of the quantization noise to be between 0 and 0.5, thus giving M+1/2 bits per doubling, where M is the number of feedback stages (the “order” of the delta-sigma converter)
A typical commercial delta-sigma ADC is third order giving 4.5 bits per doubling, so 32x-64x oversampling is sufficient to get 24 bits
funny enough, Willis’s rule is correct here, you can get at most 3 additional bits (which is ~1 decimal place) of an 8 bit system because there are 8 comparators and thus differences between the comparators (manifesting as non-linearities), which means oversampling has diminishing returns.
BTW your typical 8 bit A/D in a microcontroller has terrible linearity so is already only effectively 7 bits.
This is why 1-bit delta-sigma converters are so nice. There’s only one comparator and making one comparator linear is relatively straightforward, and can be done with fairly standard CMOS processes. This is why all your computers have a delta-sigma converter for audio output.
references:
https://www.renesas.com/us/en/www/doc/application-note/an9504.pdf
http://www.seas.ucla.edu/brweb/papers/Journals/BRSpring16DeltaSigma.pdf
Willis,
Working in engineering I experience this all of the time. My assertion is the following:
Resolution:
The maximum ability of the measurement system to quantify a value. A 10 bit A to D has at best 1024 individual values 0-1023. If for example a 5V reference is used the system will divide the 5 volts into 1024 steps. This implies each step has approximately 4.88mV so the max resolution is 4.88mV. No matter how many measurements you make you will never be able to tell the difference between two numbers with less than 4.88mV.
Accuracy:
The delta from the real to measured value in the system. This is a calibration issue and independent of precision. In the A to D system above that is determined by the reference voltage accuracy and the internal offsets within the signal processing system.
Precision:
The variation of the measurements due to random error within the system. In the A to D example above the system is capable of +/- 1 Least Significant Bit.
My assertion is this:
1. For Resolution taking multiple samples will not improve resolution. If a value is between two buckets no number of samples will resolve them.
2. Accuracy is an offset and sample size will not resolve this problem, only calibration will.
3. Precision can be affected by samples assuming a predictable and stable variation. If the variation is systematic, i.e. in certain conditions the offset is different, is not predictable, or nonlinear, then an average will not work.
Thus, I would say you do not gain an extra digit by averaging, without knowing the other factors affecting the system.
Thanks, R. That’s well explained and clearly written, and I can only agree. You’ll note that I said my rule of thumb involved what in general is the maximum real-world gain you can get by averaging. Yes, sometimes you can get more. And often you can get no gain at all, as you point out.
Much appreciated,
w.
Willis,
When I posted much of the intervening explanation you had stated for some reason was not displayed. I fight software people all of the time thinking that they can improve resolution by oversampling. Don’t get me started about sensor variation……. Have a good weekend and holiday.
I’ll raise my hand.
But then I had a science education.
And a bit of maths.
Sometimes Willis, you are dangerously wrong.
I will raise my hand as well.
Here is a nice video explaining standard error of sample means and why the more samples you take the smaller the error of the sample mean will be.
Max, I discussed the calculation of the standard error of the mean in the head post. That video adds nothing. And if you raised your hand, you missed the whole point of the post. Likely my fault, my writing may not be clear, but no, you cannot measure a credit card to the nearest three thousandths of a millimeter using an ordinary ruler no matter how many measurements you take.
Using the normal procedure for calculating the s.e.m. that both I and the Khan Academy laid out, you end up with an answer which is the standard error of the mean of the MEASUREMENTS, but which is NOT related to the actual length of the credit card—a vital difference.
w.
Leo Smith
Are you suggesting that you are one of the few people here with a science education and exposure to mathematics?
Leo Smith December 20, 2018 at 7:43 pm
Seriously? You are claiming that you can measure a credit card length to the nearest three thousandths of a millimeter using an ordinary ruler just by repeating the measurements?
Really? We can throw away our micrometers, rulers are adequate?
Someone is dangerously wrong here, but it’s not me …
w.
I guess I don’t need my expensive 8.5 digit DMM anymore either. Instead I can just use a cheap handheld one & just take loads of measurements 🙂
The standard deviation (SD) is independent of the standard error (SE). So reducing SE has no effect on SD. The SE only tells you how good your sample represents the population. For example, you want to determine the average age of a city’s population. From your sample, you calculated mean = 30 and SD = 1. How do you know SD is really 1? You did not ask all the people in the city. It could be 2. SE tells you that actual SD is close to 1 and far from 2. Therefore, your sample is good.
But your sample size does not affect SD because that is a property of the population. Another city could have SD = 0.1 or 0 (all people have the same age). SE has nothing to do whether or not you can measure the mean age to the last hour, minute or seconds.
Dr. Strangelove
Another way to look at that is that there is a fundamental relationship between the range of a normal distribution and the standard deviation. Increasing the number of samples will have a small impact on the range because most of the samples will be in the high probability central region of the distribution.
Another way to look at it is the normal distribution is a mathematical model. SE tells you how well your data fit the model. If the population is really normally distributed, the greater the sample size (N) the better the fit. But it doesn’t tell you whether or not your data is accurate. If you just made up the data, you can have a perfect fit and it would still be wrong.
The normal distribution is a continuous probability function:
P (x) = e^(a x^2 + b x + c)
It’s the base of natural log raised to the power of a quadratic equation.
When the number of trials (or samples) n is large, the normal distribution approximates the binomial distribution, which is a discrete probability function. If the probability of x is p, then the probability of getting x in n trials is a binomial distribution:
P (x) = n! / (x! (n – x)!) p^x (1 – p)^(n – x)
I agree with Willis here but I keep hearing on both sides, the notion that a random distribution is an accessible absolute, that unveils uncertainty!
Everyone who actually employs this term should do the experiment in the real world, by tossing a coin for example and noting the result of even that simple 50/50 probability.
This is the way I was first introduced to “Experiment Probability” and it has stuck with me, ever since.
Lately it seems, it has become plain old probability and the “experimental” part has been dropped.
In reality, nature doesn’t have to obey the theory of probability as we commonly experience it. Only as the number of experiments – coin tosses – increase does reality approach the theory.
Convergence is not guaranteed because there is nothing in reality to stop the experiment from diverging from the theory, though it does become less and less likely as the number of “experiments” increases.
In lieu of doing the recommended experiment in real life, I just found an excellent video* that does a good job of modelling the common real world experience.
I would add that it is a cheat to use a random generator because the lesson to be learned here, is that nature only ever approaches “randomness” via large numbers. Therefore notions of ”distribution” are misleading in the “real” world!
Once more to be clear, strings of heads or tails or complex patterns are not ruled out in actuality – the experiment – despite the theory!
* https://www.khanacademy.org/math/ap-statistics/probability-ap/randomness-probability-simulation/v/experimental-versus-theoretical-probability-simulation
Typo above, I meant to say experimental probability but empirical probability or relative frequency, would smell as sweet!
I just found a much better example here:
https://www.youtube.com/watch?v=dXEBVv8PgZM
I highly recommend this for anybody interested in probability or relative frequency / distribution, particularly from 36.00 onwards.
Coin tossing follows the Law of Long Leads or the Arcsine Law.
*In a coin toss, the symmetry does not show itself by being heads half the time, it shows itself in half the “sample paths” being above or below the line of theoretical probability most of the time!
cheers,
Scott
*Professor Raymond Flood, Probability and Its Limits
Willis, Thank You for this post. This is the primary reason that the pseudo science of AGW is a farce.
Also have a look at windy.tv, almost all the weather is over and because of the oceans.
We are on planet water not planet earth.