The Limits Of Uncertainty

Guest Post by Willis Eschenbach

In the comments to my post called “Inside the Acceleration Factory“, we were discussing how good the satellite measurements of sea surface heights might be. A commenter said:

Ionospheric Delay is indeed an issue. For Jason, they estimate it using a dual frequency technique. As with most everything in the world of satellite Sea Level Rise, there is probably some error in their estimate of delay, but its hard to see why any errors don’t ether cancel or resolve over a very large number of measurements to a constant bias in their estimate of sea level — which shouldn’t affect the estimate of Sea Level Rise.

Keep in mind that the satellites are making more than 1000 measurements every second and are moving their “target point” about 8km (I think) laterally every second. A lot of stuff really will average out over time.

I thought I should write about this common misunderstanding.

The underlying math is simple. The uncertainty of the average (also called the “mean”) of a group of numbers is equal to the standard deviation of the numbers (a measure of how spread out the numbers are), divided by the square root of how many numbers there are. In Mathspeak, this is

\frac{\sigma}{\sqrt{N}}

where sigma (σ) is the standard deviation and N is how many numbers we’re analyzing.

Clearly, as the number of measurements increases, the uncertainty about the average decreases. This is all math that has been well-understood for hundreds of years. And it is on this basis that the commenter is claiming that by repeated measurements we can get very, very good results from the satellites.

With that prologue, let me show the limits of that rock-solid mathematical principle in the real world.

Suppose that I want to measure the length of a credit card.

So I get ten thousand people to use the ruler in the drawing to measure the length of the credit card in millimeters. Almost all of them give a length measurement somewhere between 85 mm and 86 mm.

That would give us a standard deviation of their answers on the order of 0.3 mm. And using the formula above for the uncertainty of the average gives us:

\frac{0.3}{\sqrt{10000}} = 0.003

Now … raise your hand if you think that we’ve just accurately measured the length of the credit card to the nearest three thousandths of one millimeter.

Of course not. And the answer would not be improved if we had a million measurements.

Contemplating all of that has given rise to another of my many rules of thumb, which is:

Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.

Following that rule of thumb, if you are measuring say temperatures to the nearest degree, no matter how many measurements you have, your average will be valid to the nearest tenth of a degree … but not to the nearest hundredth of a degree.

As with any rule of thumb, there may be exceptions … but in general, I think that it is true. For example, following my rule of thumb I would say that we could use repeated measurements to get an estimate of the length of the credit card to the nearest tenth of a millimeter … but I don’t think we can measure it to the nearest hundredth of a millimeter no matter how many times we wield the ruler.

Best wishes on a night of scattered showers,

w.

My general request: when you comment please quote the exact words you are referring to, so we can avoid misunderstandings.

Advertisements

335 thoughts on “The Limits Of Uncertainty

  1. “A lot of stuff really will average out over time.”
    A septic tank does that. Averages stuff, over time. But it’s still crap.

  2. willis
    What happens if the thing you are measuring is not a single thing (like the length of a credit card) but is itself the average of (say) 1000 things? In that case, would taking 1 million, or 100 million measurements make the average more accurate?

    • My understanding, and I could be wrong, is that such techniques are valid for multiple measurements of the same thing.
      But not for multiple measurements of something which is in the process of varying over time, or which is different depending on which part of the thing you are measuring.
      I know that a number of commenters in previous threads had this same argument regarding such things as the global temp, and averaging readings from different places or the same place on different days.
      My understanding, which I freely admit may be wrong, is that you have to be measuring the same thing multiple times. If you have one ore sample, it is one thing, and results can be averaged.
      You cannot average measurements taken one for each of one hundred samples. AFAIK.

      Besides for that, in every single class I ever took, you cannot report any result to more significant figures that the least number of sig figs in the measurements, whenever two or more measurements are factors in the final result. Because then you are multiplying errors and claiming precision you did not measure.
      I have not read any of the comments yet, but I know this is always a huge debate whenever this comes up. This is my out of the gate thoughts only…and going by memories from a long time ago.
      I do not think I have done this sort of work since I got an A in analytical chemistry in college, which was a long time ago.
      I am looking forward to this comment thread, because I do not know nearly as much about this as Willis or probably numerous others here.

      • You’re right, Menicholas, and so is Willis.

        The lower limit of accuracy is given by the resolution of the instrument. In the case of Willis’ ruler, the smallest division is 1 mm. The most careful measurement, using one’s most analytical eye-ball, is to 0.25 mm.

        It won’t matter how many gazillion measurements are taken, the length of the credit card will never be better than, e.g., 8.6±0.25 cm. And that’s assuming all the score-marks on the ruler are regularly spaced to 10-times that accuracy.

        There’s another little interesting fillip about satellite sea level. Assuming all the error is random, it will average out as per Willis’ example: 1/sqrtN. But if the true height of the sea level varies from place to place, then the measured heights will have different physically true values.

        That means sea level will have a state uncertainty because its height — its physical state itself — varies. So, the reported sea level should be reported as mean±error, and mean±(physical variation). Unless all the physical variations can be shown to be normally distributed about the mean.

        • Providing The measurement instrument is has the accuracy levels claimed of it and you deal with operatory error.
          So in this case, to what level of precision is the ruler marked, is 1 mm really 1 mm through the whole range being used , and what measure where taken to deal with factors such has eye-sight ?
          These really matter when you start to make great claim about either the data or the accuracy of the information.

          You can try the average route but you still average the numbers without knowing any errors and hopping, not knowing, if the process covers the errors.

          • Remember Wittgenstein’s Ruler: Unless you have confidence in the ruler’s reliability, if you use a ruler to measure a table, you may also be using the table to measure the ruler.

        • The ENSO state uncertainty comes to mind.

          https://geosciencebigpicture.files.wordpress.com/2015/08/nino-sea-level.png

          It can be seen that the range of the multivariate index or sea level show a range of some 5 “normalized values”. It would be interesting to know how these values translate into meters, but this is no random variation.

          Theoretically this variation is achieved by ENSO causing more rain to fall over the ocean and less over land, causing particularly South American tropical land droughts.

          The Rossby and Kelvin alternating reaction waves bouncing back and forth across the basins, and the trade wind stacking are other state uncertainties at annual scale.

      • The rule works if errors are perfectly random and so with millions of measurements, nearly every measurement of +0.251 is cancelled by a -0.251 measurement. You can’t round measurements to the nearest whole and expect the rule to still be true. Even if you did each measurement to many more significant figures, you use the rule assuming no systematic error eg temperature change during measurements have contracted ruler and card differently. A bit silly to just assume it when applying the rule to a million measurements even if they don’t vary spatially or with time.

        • Thank you Pat and Robert for the replies.
          Very good point about systematic errors…I was thinking the same thing…the ruler could be defective, or it could be at a far different temperature than when it was manufactured.
          The picture could have been taken from an angle, instead of square on, etc…

          Another thing I wanted to mention was being careful to use consistent terminology, and to use the words that mean what one if attempting to communicate, in particular the distinction between accuracy and precision.

          About the sea level measurements…we know for sure that the level of the sea is not anything like a symmetrical oblate spheroid, and for several reasons this is true. The actual shape of the Earth is called the geoid. I will post a link to an image of this shape. It is a bumpy lumpy and irregular shape. And sea level is even more so, given that sea level is defined in terms of the gravimetric field of the Earth in each given spot of the surface, and this varies tremendously, as mentioned and depicted briefly in the minute physics video on sea level.
          I am not even gonna pretend I have any special insight based on knowledge of the method used by the people interpreting the satellite data…but given all of the comments over the past few days from people who evidently do have such knowledge of at least some aspects of the method and possible confounders, and when I have gleaned about what “sea level” even means…I think I would doubt the results of satellites over tide gages and old photos of known landmarks, even if I thought the people in charge of the entire process were as unbiased as a person could possibly be…which I do not.
          In common parlance, these terms are interchangeable, but we know that is not the case when discussing measurements, and errors, and statistics, and such. Mentioning this for any readers who may not be familiar with this distinction, which is anything but a trivial one. But also to remind myself, because even though I know this, I still find myself using the wrong words sometimes in my haste to get my thoughts typed out.

          • Oops, my second comment got the last part, beginning “In common parlance…”,
            out of order.
            Those last sentences should be further up…here is how it was meant to read:

            Thank you Pat and Robert for the replies.
            Very good point about systematic errors…I was thinking the same thing…the ruler could be defective, or it could be at a far different temperature than when it was manufactured.
            The picture could have been taken from an angle, instead of square on, etc…
            Another thing I wanted to mention was being careful to use consistent terminology, and to use the words that mean what one if attempting to communicate, in particular the distinction between accuracy and precision.
            In common parlance, these terms are interchangeable, but we know that is not the case when discussing measurements, and errors, and statistics, and such. Mentioning this for any readers who may not be familiar with this distinction, which is anything but a trivial one. But also to remind myself, because even though I know this, I still find myself using the wrong words sometimes in my haste to get my thoughts typed out.

            About the sea level measurements…we know for sure that the level of the sea is not anything like a symmetrical oblate spheroid, and for several reasons this is true. The actual shape of the Earth is called the geoid. I will post a link to an image of this shape. It is a bumpy lumpy and irregular shape. And sea level is even more so, given that sea level is defined in terms of the gravimetric field of the Earth in each given spot of the surface, and this varies tremendously, as mentioned and depicted briefly in the minute physics video on sea level.
            I am not even gonna pretend I have any special insight based on knowledge of the method used by the people interpreting the satellite data…but given all of the comments over the past few days from people who evidently do have such knowledge of at least some aspects of the method and possible confounders, and when I have gleaned about what “sea level” even means…I think I would doubt the results of satellites over tide gages and old photos of known landmarks, even if I thought the people in charge of the entire process were as unbiased as a person could possibly be…which I do not.

          • I was working on a complex analogy, and realised that it was summed up by Robert’s point that each error +0.251 is balanced by a -0.251 if the errors each side of the measurement are balanced.

            However, is this discussion really relevant to satellite measurements? If we are discussing tide guages then yes, measurements are taken to the nearest millimeter (or 10th or whatever they use). Satellites will measure to much greater precision than that. I just hope that the average was taken before the measurements were rounded.

          • Its been a while since I read up on it. I remember that average distance to a swathe of surface is measured with a precision of ±2 cm with a method of judging what the conditions of the area is and choosing the right model for the wave conditions. You really can’t expect the law of large numbers to fix up any systematic errors.

      • Haven’t read the other responses yet, but my first reaction is, you are correct. My background is geology (so last century), but we had to take chemistry and then had geochemistry classes. Don’t get me started on the geophysics professor who had no concept of sig figs.

        • I took a bunch of geology classes, starting with physical, and then took one called Geology of the National Parks, and also took Earth history. At that time I was thinking I would pursue a degree program called Interdisciplinary Natural Science, but mostly I was just taking classes in subjects I wanted to know more about and was not really thinking about a degree.
          So I also took physics, history of science classes, zoology and other biology classes, and some other classes that were heavy on earth history and the history of science.
          Then I started to take more classes in physical geography, since I wanted to study the weather and those were prerequisite to meteorology and climatology and hydrology.
          But then I found out that physical geography, meteorology and all of those classes were not considered natural science classes, hey were in the humanities dept, and I had so many science credits it was impossible to get a degree in any of those subjects without taking over a year of extra classes. At that point I had a lot of chemistry classes, and it seemed very easy to me and hard to other people, and also it seemed like pretty much everything is based on chemistry and physical chemistry at some level, so chemistry it was.
          If I had been getting good advice or I had in mind getting a degree from the get go that was going to result in a wide choice of readily available jobs, I would have been in the engineering dept, or premed/medical school.
          I was just about to decide on what graduate degree to pursue when I found myself under intense pressure to help out with the family business, the plant nursery, when my dad got a brain tumor and was unable to do anything and was facing a tax nightmare. Previously, I was just building the place for them on my weekends and Holidays and a couple of summers…no plans to have anything to do with the biz, but I was the only one in the whole clan who knew anything about construction, so I just built and built until I had put up 80,000 square feet under shade and glass.
          I was getting letters of employment offers from everyone from the Navy (civilian nuclear tech on a sub) to the EPA, and the nursery ting was supposed to be just season, then a year.
          By the time I realized it, I had been out of school so long, could not decide on what to pursue or where to do it (I had been accepted to Penn twice…family legacy) and then wound up renovating historical buildings and shooting pool and womanizing.

          Life is never what you think it will be, at least it was not for me.

    • There is no way to answer your question AndyL, this comes back to the issue of the salary for the Greenpeace employees. Once you start abstracting thing you need to know the distribution of the thing we are talking about.

      In the Visa card situation you will get most around the same sort of number and a few who make complete mistakes etc so you can apply averaging knowing that you will get closer to the answer.

      Wages for example can have a heavily cauchy distribution with management with one grouping of salaries and pleb workers with a different grouping of salaries. An average in such a situation is meaningless in most analysis because it represents a salary no-one actually gets. No matter how many samples you take it doesn’t improve things because you need to first understand the distribution.

    • What Willis did not make especially clear in this post is that he is describing the precision of the standard deviation, not the precision of the measurement of card length.

      So yes, collecting a million measurements and having them all fit within the described range of 1 mm, then the precision of the standard deviation becomes very tight indeed. But the standard deviation itself does not grow smaller … it remains 0.3 mm.

      Same with any other statistical description based upon real world measurements, or conditions.

      • And amazingly, the SD from 36 measurements (samples) is almost always the same as the SD when taking a million measurements.

  3. if you are measuring say temperatures to the nearest degree, no matter how many measurements you have, your average will be valid to the nearest tenth of a degree … but not to the nearest hundredth of a degree.

    This assertion is probably wrong.

    Because the statement clearly assumes that there are 10 distinct markings in between a degree (c or f or whatever) scale. However one can always subdivide the scale to a hundred, thousand or even a million markings (increasing precision). Once thousands of observers make millions of recordings of a set temperature (lets say the freezing water at sea level) the certainty will be more than a hundredth of a degree using the above formula of Standard Error of Estimate (SEE) However it does not mean the measured value will be exactly 0 degrees C (accurate), MEAN.

    In other words, there seems to be confusion about accuracy and precision in the above statement.

    • Chris, my point is that beyond a certain limit, increasing the number of measurements doesn’t increase either the real-world accuracy or the real-world precision. All it changes is the statistical uncertainty, not the real-world uncertainty.

      Best regards,

      w.

      • “doesn’t increase either the real-world accuracy or the real-world precision”

        I dont really understand what these two terms mean. Never heard “real-world” metrics before.

        However, I agree with you that there is a limit to what can be measured. It is called the Planck’s length,1.6 x 10-35 m. Perhaps you meant this as your “real-world precision”

        • You are both correct, in a way, but also incorrect. I’ve had a bit of training in metrology. You both seem to be confusing accuracy and precision. The limit on the precision of any measurement [or aggregation of measurements] is the precision available to you provided by the measuring tool. The limit on the precision of the ruler in the image is 1mm. You cannot measure tenths of a millimeter, or any fraction of a millimeter, with a tool that is not marked in fractions of a millimeter. If you take a thousand measurements with that ruler, or one, you will never measure more precisely than ± 1mm. Similarly, if your radar altimeter, on one measurement, reads to the nearest tenth of a millimeter, your aggregation of fifty million measurements taken with that altimeter will never be more precise than ±.1mm. Whether it is accurate or not is another question, requiring comparison to a measurement taken of a known standard.

          • Brings to mind the same issue with the “warmest the Earth has been in history by .04C garbage we heard about a couple of years ago (or was it last spring). Up until the mid 20th Century, meteorological temperatures were measured to the nearest degree (F in the case of Central England which has the longest record) because greater precision didn’t matter. And until the late 20th century the tenth of a degree was eyeballed (estimated) I know, I was an observer. There were no computers to number crunch and make predictions and + or- one degree didn’t matter. Claims of warmer or colder with greater precision than one degree F (1 degF is more precise than 1 degC) are pure garbage!!!!!

          • “If you take a thousand measurements with that ruler, or one, you will never measure more precisely than ± 1mm.”

            This is not actually correct. Andrew Preece’s comment (way, way below) gives a example from the world of electronics of how it is done.

            When Kip posted a column about exactly the same thing I created a math experiment in Excel that you can repeat for yourself. You can create 1,000 instances of measurements, with randomly distributed errors around a specific known value. Round each individual measurement to the nearest round number. You can still estimate the mean with surprising accuracy. More measurements produce better estimates. With 1,000 measurements, you’ll almost always be within a two decimal places.

          • You can create 1,000 instances of measurements, with randomly distributed errors around a specific known value. Round each individual measurement to the nearest round number.

            This is the fundamental assumption where theory departs from reality. The reality is that measurements are rarely evenly distributed around the value you want to measure.

            If a measuring device is capable of measuring with say 3 values, 1, 2 and 3 and you’re wanting to measure 2.75 then you’ll probably actually measure 3. If on the other hand you’re trying to measure 2.5 then you might do it if there are evenly distributed values of 2 and 3 coming from the device.

        • ” Never heard “real-world” metrics before.”

          Measure the thickness of printer paper, I’ll provide the yardstick.

          • Measure the thickness of 500 sheets of printer paper 400 times.

            That’s more relevant.

            Then try to figure out “why” the differences are present:
            Too much pressure on the caliper?
            Too little pressure sometimes?
            Different air masses between the 500 sheets of paper in each ream?
            Different nbr of sheets in different reams?
            Different sheet thicknesses between reams and between paper mills?
            Different calipers?
            Bad or changing caliper measurements?
            Caliper differences in successive measurements?
            Different users for the calipers?
            Different humidity when measured?

            but remember! The alarmists would have you believe none of these occur in the sea level measurements!

          • Instead of using credit cards or sheets of paper as an analogy (even though you have shown how many things can vary with even something as straightforward seeming as measuring something which seems fixed and definite), maybe a better analogy would be trying to find out how much cats weigh, and if they are gaining weight over time, and if so, is the weight gain accelerating?
            Suppose one had set up scales all over the cat world and the weight of cats was measured as they ran or walked across it.
            And someone else was driving around with a very accurate laser scanner which had been cleverly designed to measure the shape and size of cats very precisely, and had also devised and algorithm to translate these readings into a weight.

            Or even something simpler…how long are cats from nose to tip of tail, and like the sea and the tides, the darn things are never holding still and can change the shape and length of the tail and do, continuously, do so.
            And then I graphed all the results…and on the page with the graphs I said all of the values were averages which had removed seasonal and daily variations to correct for meal times and cats that got more to eat during certain times of year.

          • I like comparing trying to measure cats with trying to determine the level of the never-still sea surface! Very apt analogy

            SR

          • “…the darn things [cats] are never holding still…” Oh then you’ve never met my slacker cat. She will sit in your lap for hours on end.

        • Andrew Stanbarger December 20, 2018 at 4:02 pm
          You have also forgotten “repeatability” of the instrument.
          Ever done Gauge Capability Studies?

      • ” beyond a certain limit, increasing the number of measurements doesn’t increase either the real-world accuracy or the real-world precision.”

        Number of measurements increases precision but not always accuracy. You can be precisely inaccurate.

        You should really should take a stat class sometime.

        In general, the S.E. (the S.D. divided by (approximately) the sqrt of the N) dictates the significant figures of the measurement. You can easily add one or two decimals with 30 measurements. And with 1000 measurements, go crazy.

        That said, and you do haf to use a little judgement (no math formulas involved), but we have extended really crappy measurements to nice statistical differences all the time. It’s not rocket science. Just stats 101.

        If you measure the height of the two sides of your favorite table enuf times, you will always find them different. (unless you use a ruler calibrated in cm only with no tenths of a mm.)

        • trafamadore December 20, 2018 at 3:41 pm

          ” beyond a certain limit, increasing the number of measurements doesn’t increase either the real-world accuracy or the real-world precision.”

          Number of measurements increases precision but not always accuracy. You can be precisely inaccurate.

          Accuracy is how close successive measurements are to the true value.

          Precision is how close successive measurements are to each other.

          Please explain, using my credit card example above, how taking more measurements of the credit card will make the measurements either more accurate or more precise.

          w.

          • The more measurements the more confident you can be that the actual value is within a certain range.
            However, the size of that range is determined by the physical nature of the measuring process and equipment.
            They are two different things. But they are both called uncertainty because they are both related to the chance that a measurement isn’t reflective of the actual value.
            Hope that helps resolve the confusion.

          • Actually, with your credit card example, if the true value was 85.8 mm, more people will get 86 than 85 and even fewer 87, and even fewer 84 and 88. So even though the ruler was calibrated in mm, you can easily increase precision of the measurement.

            But I think you just being purposely obtuse on this, you could easily look that up in a stat book.

          • Willis, as long as the errors are randomly distributed, it is possible to determine the true value with surprising accuracy.

            In a cell in Excel, create a “true value” between 85 and 86mm. Copy that number down into a column 1,000 rows long, and in the next column use the random function to create measurement errors and then add the errors to the true value. (Create random numbers between 1 and minus 1. Or heck, make the errors between 2 and minus 2.) In the next column, round all the measurements to the nearest whole number.

            Now you have 1,000 measurements, all of which are wrong. How close do you expect the average of the 1,000 incorrect measurements to be to your true value? If you take 5 minutes to do this, I guarantee you are in for a surprise.

          • Steve O, isn’t it possible you learned something surprising about Excel’s “Random()” function?

          • trafamadore December 21, 2018 at 6:33 am

            Actually, with your credit card example, if the true value was 85.8 mm, more people will get 86 than 85 and even fewer 87, and even fewer 84 and 88. So even though the ruler was calibrated in mm, you can easily increase precision of the measurement.

            I agree … but only so far. First, despite your snide comment to me about taking a statistics class, you cannot increase precision by repeated measurements. Precision is the standard deviation of the measurements, which generally doesn’t change much with repeated measurements.

            And while you can increase the accuracy by the means you describe, you cannot do so indefinitely. You’ll get another decimal of accuracy out of your procedure, but not another three decimals of accuracy. WHICH IS WHAT I SAID!

            But I think you just being purposely obtuse on this, you could easily look that up in a stat book.

            “Purposely obtuse”? Perhaps you and your friends play underhanded tricks like that. I do not, nor do I appreciate being accused of that kind of behavior. I tell the truth as best I know it.

            w.

          • Steve O December 21, 2018 at 7:51 am

            Willis, as long as the errors are randomly distributed, it is possible to determine the true value with surprising accuracy.

            In a cell in Excel, create a “true value” between 85 and 86mm. Copy that number down into a column 1,000 rows long, and in the next column use the random function to create measurement errors and then add the errors to the true value. (Create random numbers between 1 and minus 1. Or heck, make the errors between 2 and minus 2.) In the next column, round all the measurements to the nearest whole number.

            Now you have 1,000 measurements, all of which are wrong. How close do you expect the average of the 1,000 incorrect measurements to be to your true value? If you take 5 minutes to do this, I guarantee you are in for a surprise.

            Steve, I’ve done that a number of times, starting back when I first got hold of a computer.. So despite your “guarantee”, that is absolutely no surprise to me at all.

            Stop assuming I don’t know what I’m talking about, and start thinking about the example I gave.

            w.

          • Willis, you posted :
            “Accuracy is how close successive measurements are to the true value.
            Precision is how close successive measurements are to each other.”

            Well, I was taught that “accuracy” is how closely the reported measurement represents the true value of the parameter being measured. One the other hand, “precision” just represents the number of digits used in the numerical value being reported.

            Thus, for example, a mechanical caliper may report a length measurement of 6.532 inches and yet be much more accurate that a scanning electron microscope that reports the same object feature as being 6.54798018* inches long if the SEM has been incorrectly calibrated or has a undetected electronic failure. (*Note: modern SEMs can indeed achieve resolutions of one nanometer, or 4e-8 inches.)

            So, precision actually has nothing to do with accuracy and the number of times a given parameter is measured. And accuracy is not necessarily related to precision of the measurement, but it can be improved by statistical analysis of repeated measurements at any given level of precision.

            It is the combination of using highest precision within the known/calibrated range of accuracy of the measuring device that is of utmost importance to “truthful” value reporting, whether it be for a single measurement or multiple measurements of the same parameter.

          • Trafamadore,
            if the true value was 85.8 mm, more people will get 86 than 85
            That’s totally true. You can get SOME increased accuracy, but not ANY increase just by increasing the number of measurements.
            In your example, you can be sure that your average will be closer to 86 than to 85. But it is NOT true that, by increasing the number of measurements, you will get exactly 80% more 86s than 85s. It may be 80% or 70% or 90%. You won’t change that by taking more and more measurements. You could increase precision to about 0,1mm as Willis said, but not more than that, being realistic.
            This doesn’t mean that the maths are wrong. What it means is that the condition for the maths to work (perfectly evenly distributed errors) never happens in real world.

      • IF SATELLITES are capable of measuring
        sea level to the nearest millimeter,
        or even to the nearest centimeter,
        I’ll eat my hat (a scientific term).

        And I would say the same thing
        if the oceans had no tides,
        and no waves.

        In my opinion, there is no accuracy,
        and 100% uncertainty. with the
        satellite methodology !

        Of course after the “proper adjustments,”
        the conclusion will be the usual:
        “It’s even worse than we thought.”

        The real question is whether sea level rise
        is causing property damage on the shorelines,
        such as on the Maldives Islands … where
        investors are building new resorts
        like money was growing on trees there.

        Who is getting hurt by sea level rise now?

        Who is getting hurt by global warming now?

        The correct answer is “the taxpayers” —
        getting fleeced by leftist con men,
        scaremongering about
        (harmless) global warming
        and sea level rise.

        Satellite sea level data
        does not pass the smell test”
        (another scientific term)

        My climate science blog:
        http://www.elOnionBloggle.Blogspot.com

        • It is incredible how the smell test, the nose, makes scientists look blind.
          Smell is the most ancient sense. The smells of peppermint and caroway, quite distinct, are yet of two molecular enantiomers.
          Even we, nowhere near a cat, can actually smell chirality (rotational polarity).
          There is more to measurement than meets the eye!

          The satellites are sensing through a plasma, atmosphere, of highly variable electromagnetic polarizablilty. Ironically they are telling us more about the atmosphere or about the ocean surface physics. It would be an expensive scandal if that data is thrown away in a mad pursuit of height accuracy.

        • “Jason-2 flies in a low-Earth orbit at an altitude of 1336 km. With global coverage between 66°N and 66°S latitude and a 10-day repeat of the ground track, Jason maps 95% of the world’s ice-free oceans every ten days. Sea surface height accuracy is currently 3.4 centimetres, with 2.5 expected in the future.” — source: https://www.eumetsat.int/jason/print.htm#page_1.3.0

          The “Inside the Acceleration Factory“article linked in the above article’s first sentence states that C&W data analysis gives a SLR slope of 2.1 +/- 0.5 mm per year for the last 20 years using a large amount of data from one or more spacecraft instruments (presumably the Poseidon-3 dual frequency altimeter that is on Jason-2 or something with similar accuracy), having at best a 25 mm accuracy.

          As to how anyone can assert that satellite radio altimetry (independent of GPS use) above oceans is accurate to +/- 1 mm or better . . . go figure.

        • Richard Greene:

          If satellites weren’t capable of measuring
          sea level to the nearest millimeter,
          or even to the nearest centimeter,
          Why did they build and deploy them.

          Even though the oceans have tides.
          And waves.
          The area struck by the radar beam
          is many waves wide.
          It can’t see them.

          Averaging enough points
          can handle random noise.
          It is systematic error
          we must fear.

          The GPS in your cell phone
          will soon be accurate to 1 cm.

          The first Grace satellites
          measured their separation to 1 um
          over 220 kilometers with microwaves
          The second generations with visible lasers
          should be much more accurate.

          LIGO detected gravity waves
          by detecting motions of
          1/10,000 the width of a nucleus.
          1/1,000 wasn’t good enough.

          You are talking out of your hat.
          Now eat it. (It doesn’t pass the smell test.)

          For those who don’t know,
          the satellites are being calibrated
          by measuring the distance down
          to sites with see level known by GPS.

          Using one set of sites for calibration
          and a second set for validation
          might work.

          The potential for systematic error
          is high.

        • I take real-world measurements at my holiday cottage, the address of which is:

          1 Derwater Street
          Tipping Point
          Maldive Islands

          Merry Christmas to all! Glub glub glu…..

      • “beyond a certain limit” – What does define that “certain limit”? One can imagine many experiments where you can get very high precision with very crude tools. For instance think of Buffon’s needle to determine many digits of PI. It is basically a binary precision that leads to multi digit precision.

    • Of course there were none of the subdivisions of a degree in 1850 when humans were visually eye-balling liquid mercury thermometers in a few sparse locations.

      So to pretend that we know the “global temperature” was in 1850 to a tenth, or a hundredth of degree, is completely absurd, dishonest and completely unscientific.

      • Not to mention the number of sensors was 2 to 3 orders of magnitude too few to even begin contemplating such a thing, even if they were accurate to 0.001C.

    • ChrisB,

      Not sure, even with my glasses, if I could differentiate the markings of the scale were “subdivide the scale to a hundred, thousand or even a million markings.”

      “In theory there is no difference between theory and practice; in practice there is.”

      No one could read a scale with the divisions you suggest. And you mention a confusion about accuracy and precision. your divisions could be precise, but not accurate.

      • PhilR: “No one could read a scale with the divisions you suggest. ”

        There is a tool called electron microscope. With it one can measure a distance of 43 picometers. That is almost a million million times smaller than a meter. https://www.youtube.com/watch?v=eSKTFXv5rdI.

        As for the discussion about precision and accuracy please google it. You’ll be surprised what is out there.

        • So what, you do not use an electron microscope to measure Sea Level or temperatures, in fact for anything “large” either.
          So how accurate or precise is an electron microscope at measuring something 1 metre across?

    • Willis article refers to the practical limits of measurement accuracy, plank distance is a theoretical measurement accuracy of distance which you will never achieve.

      In the Willis article the ruler itself has practical limits of measurement accuracy, it for example expands and contracts with temperature as does the Visa card. So no what number you finally agree on it is only representative of a certain temperature. We have all made an assumption the ruler which was probably printed in China is actually the right scale which is why most countries have standards bodies to oversee devices that measure things.

      So there are practical limits measurement accuracy of any measurement equipment and no averaging does not improve that limit because the errors lie outside the measurement distribution.

      • The point about a ruler and Eye Sight is that you are measuring to 1mm and guessing or estimating anything less.

    • It is you who is mixing up terminology Chris. Subdividing the scale will increase resolution, not precision. Precision relates to repeated measurements. A meter could have excellent resolution (resolution is the smallest signal change that it can detect) and poor precision, although in my experience meters with excellent resolution tend to have excellent precision. The word precision is not recommended for use anymore because so many people mix up what it means. Instead the term Repeatability is recommended.

      I recommend the ISO Guide to the Uncertainty of Measurement to everyone interested in this topic. An excellent guide to it with loads of examples is the SAC Technical Reference 1.

      • “Precision relates to repeated measurements.”

        I disagree. Let’s say I measure a coin with digital calipers and record its value as 20.2 mm, which may be all I need to assist in verifying its authenticity. But in reality, I could record the full readout of 20.1825 mm displayed on the caliper’s digital scale. The first numerical value is less precise than the second numerical value but both represent the exact same measurement.

        I don’t need any repeat measurement to establish a given level of precision.

        And the above measurement scenario tells you absolutely nothing about how accurate those numerical values are unless you know the digital caliper has been recently calibrated (or you yourself just did so) against a known standard at some time before or after that measurement.

        • “When values obtained by repeat measurements of a particular quantity exhibit
          little variability, we say that those values are precise” Les Kirkup & Bob Frenkel, ‘An Introduction to Uncertainty in Measurement’, 2006 page 33 section 3.1.9.

  4. The old ‘the errors cancel out’ anti-argument. I bet those that claim that would believe the same even if guessing the future from goat entrails. You just need many goats for accurately guessing the future 🙂

    The same line of thinking goes for averaging wrong results from computer models. Why don’t they just pick results at random then average them, if they think the errors will cancel out no matter what?

    On a less funny note, the central limit theorem has its limits.

  5. Excellent explanation.
    Those that have used Slide Rules easily understand this. A simple way of realizing / demonstrating this is to perform a moderately complex calculation on a slide rule (one with more than three steps) as you normally would. Write down the answer. Next, perform the first step of the calculation, write down the result of that step. Slide the slider back and forth, then set the slide on the result. Do the same for the rest of the steps. Each time reading, writing down the result of the intermediate step, sliding the slide back and forth then resetting to thath intermediate result to obtain the next result. Even with problems that only have two or three intermediate steps, the final result is drastically different. An clear example of why this happens is when you divide the cercomperference by the radius, the index is EXACTLY on PI. If your slide rule does not have a clear mark for PI you will never put the slide in the correct spot.

    Also have problems over the fact that the satellite is moving, thus measuring a different spot on the ocean in a different portion of the Swell, a different wave height, and even a different ocean level, through a different wavy atmosphere.

    • Only my dad could use a slide rule. Messed with them a bunch when I was a kid, but could never figure how it worked.

      • I still have one NEW, in its box! My son, who is in the Oregon National Guard discovered them throwing one out and brought it home to me. I still remember how to multiply, divide and do logs but not much more. I will keep it, one of these years it will probably be worth a bunch of money.

        • There is a web site for that.
          Look it up.
          I have a yellow metal Pickett Dual Base ~1961
          Yours should have the name of the maker, copyright year, & model #.
          Slide rules as a topic cycle through about every 2 years.
          Last time, I spent an hour looking at sites.
          Give it a try.

        • I still have the slide rules I used in the first two years of mechanical engineering, nuclear physics, statics, dynamics, and reactor design. I competed in slide rule and mathematics in high school though, so they are well-used. defintely “not in the box” shape!

          • And I have a Carmody Navigational Slide Rule. Never actually got past multiplication and division though, just kept using Norries Tables.
            But at Nautical College before my Mate’s exam we had a lecturer who used one all the time. We set him up. During an exercise someone looked up from his paper and asked as if he had forgotten, “What is the square root of four?” Out came the slide rule and the answer “two nearly” had the class in stitches. The rule didn’t get used nearly as much after that.

    • Usurbrain

      My slide rule had PI clearly marked but it only showed up when the circumference was divided by the diameter or twice the radius! : ]

  6. To make this more like climate science. Let’s make that measuring 1000 different credit cards using 1000 different rulers by 1000 different people.

    Would these measurements increase the over all accuracy at all?

    • Unlike thermometers, the readings from rulers won’t drift over time if they aren’t re-calibrated regularly.

      • That all depends on the ruler material’s sensitivity to environmental conditions.
        Wood: heat and humidity
        Plastic: heat and humidity
        metal: heat
        laser: everything

      • They will drift because they expand and contract with temperature and many will be printed in China, with a that is close enough attitude 🙂

    • “To make this more like climate science, let’s make that measuring 1000 different credit cards using 1000 different rulers by 1000 different people. Would these measurements increase the overall accuracy at all?”

      Only after the data was properly adjusted of course…

    • MarkW
      It is worse than you suggest. The credit card doesn’t change length over time, whereas the temperature is never the same even in the same place. Thus, the randomness of the measuring error doesn’t come into play.

      • I think weighing all the cats in the world to determine if they are fattening up at an accelerating rate, may be a robust analogy.

  7. Sooo, with respect to satellite sea level measurements, what uncertainty can be expected?

    Also, I am curious why the rate of sea level change as measured by satellite seems always to be a factor of 2 higher than that measured by tide gauges. Can it be a reflection effect, like an echo, where the change in distance that the satellite measures is twice the change in sea level?

    • There was a thread a while back showing that satellites are measuring different rates in different areas, I dimly recall noting that they in general showed higher levels of rise in the middle of the oceans than they did close to shore. I can think of a few ways that this might actually be physically possible given all the cycles at play and the short observation time we have. But let’s put that aside for the moment and assume it is true.

      If the processes at play express themselves first (or more at this point in time) in the deep ocean than they do close to shore, then tide gauges, which by definition are located at the shore, would show less rise than would the satellites which are looking at a lot more area.

    • Originally the satellite showed a decline in sea levels. And, as they did with the ARGO data, they adjusted it to fit their preconceived ideas — the adjustment is called GIA and based on the belief that tectonic activity was hiding the “real” sea level rise.

      This of course begs the question that if the sea isn’t rising where we live and where the tide gauges, why should we care.

      • Well, yes, the GIA adjustment is clearly bogus and has to be removed to get Eustatic (apparent) sea level. But it’s only about 12% (roughly 0.35 mm/yr). The rest of the discrepancy between tidal gauges and satellites is a mystery.

      • When the original readings did not agree with what they were expecting, they went in and looked at everything very carefully and identified some things that were out of whack or improperly calibrated, or maybe they rewrote the algorithm, or all of the above.
        Then they got a result which not only showed what they wanted to find, but found that it was worse than we thought!
        Jackpot!
        They then stopped looking for things that might be out of whack or miscalibrated and stopped looking for better algorithms.
        Just a hypothetical, but this is how confirmation bias, and climate science, works.

    • From a satellite sea level measurement perspective the biggest problem is waves. You are trying to measure the surface of something has these lumpy things on. Those waves by definition distort the very surface you are trying to measure locally, the more of them the more they distort.

  8. Many years ago, on my first day in CHEM 101 class, the instructor told us we would have points deducted for any lab or test result where we express an answer to a calculation with a higher degree of accuracy than the least accurate measurement that was used in the calculation.

    This story makes me think of the announcements saying the current year is the hottest ever because the average temperature is 0.00023 degrees higher. If the weather stations providing these measurements have an accuracy of only 0.05 degrees, someone is full of bull waste products.

    Do we need to send these people back to undergrad classes to remind them of the basic rules of math and science?

    • “Do we need to send these people back to undergrad classes to remind them of the basic rules of math and science?”

      That would be a waste of time and money, as witch doctors don’t follow the basic rules of math and science. But we do need to keep reminding readers that the reason ‘climate scientists’ don’t follow basic freshmen chemistry science rules is because they’re witch doctors, and NOT scientists!

  9. As applied to climate studies, a better analogy would be a large group of riflemen, shooting at different targets. All the riflemen have varying skills, and all the rifles/ammunition have differing inherent accuracies.
    So would combining the results give any better understanding of the odds of any one target being hit?

  10. Do those satellites used for sea level measurements ever measure something that is at a known level, just for a reality check?
    Every time a satellite passes over a harbor, why not take a reading, at the same degree of accuracy, at that moment?

    SR

    • My understanding is that they are checked by measuring the level over lakes, where waves and tides are not much of a problem. However, the same issues persist.

      w.

        • Wouldn’t that also mean the satellite reading would necessarily be constrained to the same degree of accuracy as can be had from the lake’s gauge?

          SR

        • If it is then the uncertainty of the tide gauge must be included as part of the uncertainty calculation for the satellite measurement. Thus the satellite measurement uncertainty figure must be higher than lake tide gauge uncertainty figure.

      • If they are using lake surfaces for reference, how do they know what the level of the lake is? Prior to GPS this was all done with eyeballs and transits. And even GPS has a prolate elipsoidal error envelope. It has about 6 meter radial horizontal accuracy and 10-20 meters vertical accuracy.
        That’s one reason inexpensive drones won’t be making deliveries any time soon.
        https://www.gps.gov/systems/gps/performance/accuracy/

        • Large lakes often have their own “tide gauges”. Here’s a link to a web site for the Lake Champlain site at Rouses Point New York. https://water.weather.gov/ahps2/hydrograph.php?wfo=btv&gage=roun6
          And another for a site at Burlington, VT https://nh.water.usgs.gov/echo_gage/measurements.htm

          There would be all sorts of problems with trying to use these as calibration points. All the problems of tide gauges less tides themselves but plus changes in lake level due to precipitation, evaporation, river inflow and outflow. Plus which, the level of any point on the lake varies with which way the wind is/has been blowing.

          I honestly don’t know how/if satellite RA’s are calibrated against surface targets. There are many papers on the subject and they concur that it’s a tough problem at the accuracy levels required. My vague impression is that they are checked periodically to make sure they are drifting off into fantasy land, but that there’s no overt correction for RA calibration error.

          I’ll add it to my lengthy list of things to look into.

        • Had some surveying done for property values this Fall. He just used a GPS pole for legal surveys. Next year use your smartphone!

          “Smartphones’ GPS systems are going to start getting a lot more accurate. According to IEEE Spectrum, Broadcom is starting to make a mass-market GPS chip that can pinpoint a device’s accuracy to within 30 centimeters, or just under one foot. That’s compared to today’s GPS solutions, which typically have a range of three to five meters, or up to 16 feet away.”
          https://www.theverge.com/circuitbreaker/2017/9/25/16362296/gps-accuracy-improving-one-foot-broadcom

      • They are specific sites around the globe with there own floating laser reflective target which has GPS surface fixes.

        I know Lake Issykkul (Kyrgyzstan) has one, Jason 1 and 2 had a floating target in Bass Straight off Tasmania in Australia which I assume Jason 3 also uses.

        Jason 3 is now flying the same orbital path Jason 2 used to so I imagine it uses the same calibration sites.

        • So the algea and seaweed and dirt growing underneath and top of the calibration site bouys didn’t change the “elevation” of the reflectors over those many years in the middle of Russia?

          • The buoy is GPS positioned, what you are saying makes no sense????

            All that matters is the buoy is floating on the water surface are you saying it isn’t?

          • The buoy is floating on a water surface. The reflectors on the top of the buoy are held above the water level, obviously tilting as the bouy jerks and moves on its anchor chain above the bottom as random waves go by, as wind tilts the buoy slightly.

            As the buoy gets physically heavier (as with every marine object with a waterline and anchor chain supporting growing biologics and dirt and scum), the buoy sinks down. This requires daily cleaning all over the surface, if you’re going to claim a “sub-millimeter” calculation accuracy in the RESULT of the “measurements” made from your instrument that is calibrated from a moving irregular source. A laser reflector on the moon, on the other hand, IS capable of giving higher accuracy because the only thing moving it are “moonquakes” and aliens.

          • I could add imagine the lake got lifted or dropped 100m in an earthquake. The GPS reading on the floating buoy measures the new position. Jason 3 sees the new height of the lake and it should match the buoy position from it’s own GPS as up/down 100m and so we know the satellite is in calibration. The only way it can fail is if the buoy isn’t floating.

          • As for accuracy Jason 3 clearly states it’s accuracy as 2.5 cm RMS so I don’t know where you get sub millimeter accuracy. Climate Science models things from the data but that ahs nothing to do directly with the Jason 3.

          • LdB

            No, earthquakes move the surface anywhere from 2 cm (mag 4, 5 or 6) to 2-3 meters sideways (mag 7.5+) and “maybe” 1/2 cm to 1-2 meters vertically in limited areas around the quake. 100 meters lake surface movement? No.

            Look at the fence broken in the 1903 San Francisco quake, the road busted a few days ago in Alaska. Underwater, even Japan 8+ quake disturbed the sea floor “only” a few meters over a long line.

            Well, you see, regardless of what the satellite accuracy is, the result of the satellite readings is a sea level change “accelerating” from 2.1 mm/year to 2.5 mm/year, with a ‘trend line” being analyzed to 3 and 4 decimal places to find anything but a linear (or steady) trend. But, you see, an “acceleration” had to be found!

          • The sea level measurement and acceleration you are talking about is a different thing it has nothing to do with the instrument doing the measurement., that belongs to climate scientists.

            At the moment your criticisms are all over the place like a rabid dog but you clearly don’t know how it all works. I suggest you do some reading so you can voice whatever complaint you want to make.

  11. One of my favorites the first sentence in Chapter five of the IPCC’s 4th assessment report:

    Climate Change 2007: Working Group I: The Physical Science Basis *
    The oceans are warming. Over the period 1961 to 2003, global ocean temperature has risen by 0.10°C from the surface to a depth of 700 m

    Really they can measure the global ocean to with 0.10°C Is that 2nd zero after the decimal point really significant? Let’s see, 0.10 / (2003-1961) = 0.002 – They must have some very accurate data.

    * That’s a WayBack Machine link. It seems that the people at the IPCC have recently decided to make it difficult to navigate their website. Could that possibly be by design?

  12. “Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.” I was taught that you can’t get more accurate than the least accurate number used when averaging but I can see why you might push it out one decimal place. My caveat to that is while I’ve always enjoyed math I stayed away from statistics classes.

    • True, Darrin, the concept of “significant figures” seems to have been forgotten in climate science.

      w.

      • Significant figures in climate science.
        (To quote someone else about something else)
        “They don’t know what it is, they just know what it’s called.”

        • “Significant figures in climate science.”

          That’s the ilk of Mann et al isn’t it? Nothing to do with real numbers.

  13. And to top in all off, the “sea level” that the satellites measure is not the “sea level” that human civilization needs to care about. I hate to claim it’s not possible, but I doubt even the most sophisticated deep ocean creature will notice if the mid-Pacific ocean depth changes by a meter or two. But in any case, we don’t. We care about sea level relative to shorelines where there is significant population and fixed infrastructure.

    And in those cases, if sea level rise is a problem, it’s most likely due more to land subsidence than actual change in the ocean level.

    The satellite data is interesting and no doubt useful for some things, but it isn’t what we should pay attention to.

    • There is one case where open ocean measurement is critical – tsunami warning. They start as harmless swell, until they arrive at come coastline. A lot of talk about buoy sensors went nowhere afaik after the big one in Indonesia. And there were strong rumors a deep sea ocean creature, a US Ohio Class boat was severely damaged, rolled like a toy, and had to limp home.
      Some of these are caused by major earthquakes, which it so happens also have warning signatures with ionospheric effects which ironically could in theory be monitored with the same satellite radar system (if subtle effects are not averaged out). The “fukoshima” earthquake precursors were in fact measured – resistance to this earthquake precursor “smell-test” blindsides otherwise capable scientists, and gets a lot of people killed.

  14. An excellent common sense rule, Willis, thank you. Regarding sea level measurements back when estimates for 2100 were a rise of twelve feet plus (Hansen?) I argued that if that is the worry, we needn’t run down to the sea with a micrometer, a yardstick will do and if worse, then axe handles are sufficient for the job.

    Also, for global av. temperature rise of 6C which seemed to be the worry, a dozen thermometers, scattered across the Arctic, where with enhancement we would have 2 or 3 times this temperature, would be a perfectly adequate warning system. Had they done this in 1988, we would know before now that another degree at most is really all we are in for.

    It would be instructive to know what all the T adjustments have done to uncertainty of measurement. Indeed the algorithms are changing previous temperatures as we speak. As Mark Steyn noted at a senate hearing on data quality, ‘we don’t know what the temperature in1950 will be by 2100 and yet certainty is high about 2100’s temperature’.

    • Even today 1950 is a lot colder than it was in 1950. Proof positive of global warming. The further we go into the future the colder the past becomes. Entropy in action. Eventually 1950 will be the date of the big bang.

      • Yogi Berra would have been a climate scientist if he had been born 100 year later than he was.
        “It gets late early out here”, “But the towels at the hotels are the best…I could barely close my suitcase”.

    • Even today 1950 is a lot colder than it was in 1950. Proof positive of global warming. The further we go into the future the colder the past becomes. Entropy in action. Eventually 1950 will be the date of the big bang.

        • From my file of tag lines and smart remarks:

          Climatologists don’t know what the temperature is today within ±0.5°C, and don’t know what it was 100 years ago, but they know it’s 0.8°C ±0.1°C hotter now than it was then.

        • That deserves a house point, or two. Brilliant.
          I am off to calibrate my hockey stick, I use it to measure climate change, I sometimes need to re-calibrate it, especially when historic matches have to be replayed.

        • CO2, is a temporal gas, it steals heat from the past and moves it to the future.

          I’ve added that one to my smart remarks and tag line file.

  15. Accuracy depends on resolution. Uncertainty depends on distribution relative to the true statistic (so-called “normal”). The average temperature is highly time and space dependent, with a potentially large spread in even a narrow frame of reference.

  16. Averaging measurements WILL increase resolution for a discrete sampled signal as long there exist an appropriate random dither noise.

    See the excellent paper by Walt Kester: ADC Input Noise: The Good, The Bad, and The Ugly­. Is No Noise Good Noise? Published in Analog Dialog 40-02, Feb 2006

  17. Willis wrote: “Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.”

    That depends on the what type of error you are dealing with. Using EXCEL or some other software, generate 10,000 numbers with a mean of 15 (deg C) and a standard deviation of 1 (degC). You will have more than 10 digits to the right of the datapoint for each value using an add-on for EXCEL. Calculate the mean and standard deviation of the 10,000 number you actually received with all of the decimal points. Make another column of data and round your values to the nearest 0.1. Take the mean and standard deviation of the the first 100 values of rounded data and separately for 10000 values of rounded data. Do you get closer to the “right” answer using 100 rounded values or all 10,000?

    If you have random noise in your data, averaging a lot of values can give you a more accurate mean. If you have systematic errors in your data, your mean won’t get closer to the right answer. For example, it you fail to look straight down and always look at a slight angle, parallax could make all of you measurements systematically bigger or smaller than the should be. If you sometimes look slightly from the left and equally likely slightly from the right, your measurements will have random noise in them, averaging can help.

    IMO, because there are so many such large adjustments in converting time for the radar signal to return into a distance, the possibility of systematic error is the biggest problem. One large adjustment is from humidity and it involves meters IIRC. Humidity data is determined from re-analysis. Re-analysis is based on a set of inputs that changes over time. A gradually-changing small bias in humidity of 0.1%/yr appears capable of biasing the trend by 1 mm/yr. IIRC, there have been three major corrections of systematic errors in calculating SLR from satellite altimetry data.

    • Exactly. What Willis is saying is basically that in real world you can never count on having completely eliminated any possible systematic errors in measurements. That only happens in the world of mathematics. Considering that less than 10% of the error of any measurement is systematic is a wrong assumption.

  18. I’m confused. The commenter said,”…resolve over a very large number of measurements to a constant bias…”, which strikes me as the gist of the comment. I don’t think he was commenting on the accuracy of the measurement but on the precision of the measurement. If the measurements are as precise as he seems to imply then the inaccuracy should be consistent. You can usually correct for inaccuracy if your measuring device is precise.

  19. This seems to be like what I call The False Precision falicy. A true measure of accuracy (not precicision) would be sigma/mean but of course we cannot measure the mean accurately, which is why we are resorting to statistics in the first place.

    Let’s generate a random number between 1 and 100, a million times. The mean would be 50, every time, whether we measured it to 1 significant digit or 100.

    • Sorry, Robert, not true. First, I think you meant a random number between zero and a hundred. Here’s the experiment:

      mean(runif(1000000,0,100))
      [1] 50.00737

      Note that this is in line with the theoretical uncertainty of the mean as calculated above, viz:

      sd(runif(1000000,0,100))/sqrt(1000000)
      [1] 0.02888416

      w.

        • Sorry, Mark, but that is NOT the problem with Robert’s claim. The random number generator is giving numbers close enough to random.

          The problem is that his claim goes against centuries of mathematical knowledge about the “standard error of the mean”. Even with N = one million, you don’t get exactly fifty as Robert claims.

          w.

  20. The length of the credit card is at least assumed to remain effectively constant. That is not the case for either temperature measurements or those of sea-surface height. They are both dynamic.

  21. Willis,

    I think you are confusing resolution (1mm divisions on the ruler) with precision (how repeatable are the measurements). Also you are using an example whose precision/repeatability is smaller than it’s resolution (everyone measures between 85 and 86)

    I think if you redo your analysis with something that changes more than the resolution of your measurement you will come up with different conclusion.

    For example, what if I wanted to estimate the average length of pencil being used by primary grade students in a district with 100,000 students by asking them to measure their pencils and reporting to their teacher to the nearest mm. And lets say students will have up to a quarter mm error (they might round x.25 mm up or x.75 mm down but not worse).

    In this example sampling more students will still improve accuracy (assuming zero bias) because the variation in individual pencils will be much greater than measurement precision even if the average over the district changes very little. With enough students I can still get better than 0.1 mm accuracy (relative to the actual mean) even with measurements rounded to the nearest mm.

    David

    • Thanks, David. I’ve chosen that example because it is related to our measurements using thermometers. If the actual temperature is 85.37°F and we ask 10,000 people what the temperature is, that is very much like my example, and unlike your example.

      Finally, your example only works if the errors are symmetrically distributed … which in the real world is generally not true. For example, when measuring extreme cold people will tend to round down, because it’s more impressive …

      My best to you,

      w.

  22. I was not in the original discussion but measuring ocean conditions such as wave heights can be misleading. I was on the deck of very large derrick barge one time off the shore of the North Island in New Zealand trying to get a fixed production platform piled to the ocean floor. The waves did not seem too high to me compared to other locations around the world. As I was pondering what to do, I noticed the wave pole on the fixed platform. The swell as 19 feet. I then timed the swell period and it was 20 seconds. I then thought about the 4000 mile fetch of open water and understood why one could not see the wave/swell height changing very much.

    With the satellite moving so fast I wonder how the surface measurement is corrected for the wave heights, periods and fetches? There would be a difference between the middle of an ocean and the near shore area. As an old sea salt, Willis probably understand this aspect well.

  23. This is why we have tolerances on drawings/tooling. To elevate guess work and scrap.
    In my position as a mechanical engineer for over 50 years, I had the distention of being the evil one for rejecting mechanical parts to unusually tight toleranced parts per the drawing. Many times the problem was elevated by loosening the tolerance where allowed with an Engineering Change order agreed by all involved. No need to use a micrometer, let’s say, on a credit card. Or a yard stick on a precision part.

    Regards, retired mechanical engineer

  24. The issue is with the absolute uncertainty of each measurement. If this error is consistently biased in one direction and/or not random, then the error will not cancel and the uncertainty of the average converges to the bias. However; if the error is normally distributed around actual values, then the precision of the average will continue to increase as the number of samples increases.

    • Thanks, CO2, but that’s not the issue I’m highlighting. In my example, the error will be normally distributed, but we still cannot use a ruler as a micrometer no matter how many readings we take.

      w.

      • Willis,

        Yes, a ruler can’t be used as a micrometer by measuring one distance over and over and taking an average, but that’s not the case with satellite based sea level measurements.

        In the first case, the same instrument will always result in the same measurement of the same thing, so there’s no variability to average out. However; if you add stochastic noise to the measurement whose variability is more than the precision of the measuring device and centered around the actual distance, then there is variability to average out and the precision will continue to get better as more measurements are made.

        In the second case, the same instrument makes many measurements of different distances at different places at different times and never measures the same thing twice. To the extent that the data is normally distributed around the steps of the measurements, the precision of the average will continue to get better as more measurements are made.

  25. Here is a good explanation of the “uncertainty of a mean”.

    http://bulldog2.redlands.edu/fac/eric_hill/Phys233/Lab/LabRefCh6%20Uncertainty%20of%20Mean.pdf

    As is stated in their text,

    “the estimate of the uncertainty of the mean given by equation 6.1 has two
    properties that we know (or at least intuitively expect) to be true. First, it implies that the
    uncertainty of the mean is indeed smaller than the uncertainty of a single measurement (by the
    factor 1/ sqrt(N)), as we would expect from the argument given in the previous section. Second, the
    more measurements we take, the smaller Um (uncertainty of the mean) becomes, implying that the mean becomes a better estimate of the measurement’s true value as the number of measurements it embraces increases”

    • Max,

      “Second, the more measurements we take, the smaller Um (uncertainty of the mean) becomes, implying that the mean becomes a better estimate of the measurement’s true value as the number of measurements it embraces increases”

      This statement is incorrect. That should read ‘measurement’s true correspondence to the instrument’s indication.’ The point being is the averaging merely provides a more PRECISE instrument indication. The averaging does not address the underlying ACCURACY of the instrument itself.

      The ACCURACY of an instrument is a finite physical characteristic. It is a function of its basic design limitations. Those limitations are typically described in its specifications and or in its metrology lab calibration statements. It is an error to assume an instrument’s ACCURACY can be improved by averaging any measurements done with it. The best that can be done, given reasonably random scattering of measured values, is produce a value that more PRECISELY represents the instrument’s measurement indication. ACCURACY is still no better than what is shown on the instrument’s calibration sticker. Also, keep in mind that no claim has ever been made by instrument manufacturer’s or cal labs that a calibration errors are randomly scattered within that range specified on the sticker.

      An argument has frequently been made that absolute accuracy of the measuring instrument is not important when measuring trends. Assuming that measured values collected at different times with the same instrument will each have the same error relative to the true physical value is also false. Instrument calibration drift’s with time and not necessarily in a constant direction or rate. Some instrument types are better than others in this regard. There are also issues with hysteresis, mechanical friction, thermal history, instrument contact with measured process, etc.

      The concept is simple, never claim more accuracy than the instrument specification. On top of that, you must add any degradation in that accuracy added by how the instrument is connected to what it is supposed to be measuring. Averaging measured values does not reduce those errors.

      • It has nothing to do with instrument accuracy. It is just fundamental statistics of sampling. You can choose not to believe statistic if you wish.

        • Interesting response. You say instrument accuracy has no bearing on data collected with that instrument? Very accurate, sorta accurate, and barely working instruments are all the same for collecting data?

          I definitely believe statistics when used appropriately. The trick is understanding what the result of a statistical manipulation is. Just because a particular statistical algorithm can be run does not mean its results apply to the desired end. A simple example is calculating the average life span of humans and then letting your kids play in the middle of a busy street cuz statistics say they will live decades longer. Calculating life average span is valid. Assuming it means very much for a particular individual is not valid. The same principle applies to this discussion.

          • Maybe I don’t explain it too well so I will give it one more shot. You can never make one measurement more accurately than your instrument will allow. But, if the errors are randomly distributed around the actual value then the accuracy of the measuring device is not critical when you take a large number of measurements. However, it is very likely that the more accurate instrument also has the more likelihood of the error being in fact random, while a less accurate instrument will likely have a bias in the errors.

            Maybe this equation will help, or at least help me.
            We can write a measured value as

            U = Ut + delta

            where U is the value you measure, Ut is the true, real value of what you are measuring (which you don’t know) and delta is the error in measurement. If you take a whole bunch of N measurements and then take an average we get

            sum(U)/N = sum(Ut + delta) / N = sum(Ut)/N + sum(delta)/N

            Now if the errors are truly random around the real value

            sum(delta) = 0

            and since Ut is the one true value, sum(Ut) = N*Ut
            and sum(U)/N is just the average of our measurements ie. Uavg

            So with all that we can get that

            Uavg = Ut for our N measurements

            and voila, the average of our bunch of measurements will in fact converge to the actual value, as long as the errors are random.
            We employ this method in boundary layer turbulence statistics.

          • Max, there is no reason to think that the errors are randomly distributed around the true value.
            In fact, no one knows what the true value is, and besides the measurements are not of the same thing.
            There is every reason to be sure that there is NO such known random distribution.
            So everything you wrote after that is moot.
            No one is declaring statistical mathematics to be invalid when used correctly.
            But you seem to be saying that we can ignore the rules about when the LLN applies, and when it does not.

          • Max –> Don’t forget we are not talking about sampling here. We are talking about one temperature measurement, taken one time, on one day.

            Some folks want to say you can average daily temperatures, monthly temperatures, etc. and get better and better certainty of the mean. However, that just doesn’t work. It doesn’t help you determine what the real temperature was on any given day.

            You only have to look at temperatures for two days. The first is 50 +- 0.5 and the second 51 +- 0.5. What is the average? It is 50.5 when not considering the possible values determined by the error. Does that mean each day was really 50.5, or just the first day, or just the second day? What was the average when you throw in the errors? What was the actual temperature for day 1, how about day 2?

  26. An unstated presumption, without supporting evidence, is that the standard deviation remains constant, i.e. no growth in time, i.e. no rate of change: False.

    Ha ha.

  27. ‘Certainty’ methods/discussions aside, color me skeptical (for now) on the accuracy of SLR derived from satellites. After all, look at the contested ‘accuracy’ around the height of a well-known and relatively stationary object that does not have ocean swells nor tides…Mt. Everest.
    From https://www.britannica.com/place/Mount-Everest:
    …A Chinese survey in 1975 obtained the figure of 29,029.24 feet (8,848.11 metres), and an Italian survey, using satellite surveying techniques, obtained a value of 29,108 feet (8,872 metres) in 1987, but questions arose about the methods used. In 1992 another Italian survey, using the Global Positioning System and laser measurement technology, yielded the figure 29,023 feet (8,846 metres) by subtracting from the measured height 6.5 feet (2 metres) of ice and snow on the summit, but the methodology used was again called into question.

    In 1999 an American survey, sponsored by the (U.S.) National Geographic Society and others, took precise measurements using GPS equipment. Their finding of 29,035 feet (8,850 metres), plus or minus 6.5 feet (2 metres), was accepted by the society and by various specialists in the fields of geodesy science and cartography. The Chinese mounted another expedition in 2005 that utilized ice-penetrating radar in conjunction with GPS equipment. The result of this was what the Chinese called a “rock height” of 29,017.12 feet (8,844.43 metres), which, though widely reported in the media, was recognized only by China for the next several years. Nepal in particular disputed the Chinese figure, preferring what was termed the “snow height” of 29,028 feet. In April 2010 China and Nepal agreed to recognize the validity of both figures.

    Use the long-term land-gauges that are on non-rising/non-subsiding rock for sea level measurement, I say. The rest is too close to nonsense.

  28. “Following that rule of thumb, if you are measuring say temperatures to the nearest degree, no matter how many measurements you have, your average will be valid to the nearest tenth of a degree … but not to the nearest hundredth of a degree.”
    The point that may slip by here is what is being measured: In your credit card example, you are measuring the credit card. When you switch to temperatures (which is often the case when talking about climate) you are not usually measuring temperature – per se, but the temperature OF something – air, water, etc. Your example falls apart when we realize that is nearly impossible [outside of a laboratory] to make very many temperature measurements of EXACTLY the same ‘piece’ of air under the same ambient conditions.
    Conclusion: It is VERY difficult to even achieve the additional decimal point of better precision

    • Beyond the accuracy of the measurements themselves, there’s the difficulty of taking enough measurements so that you can claim to be taking a representative sampling of the entire earth.
      Back in 1850, there was at most a couple of hundred sensors, mostly in western Europe and the eastern US.

      The rest of the world was virtually unmeasured.
      You might be able to say that we knew what the temperature of the eastern US and western Europe was, but to go from that to claiming that you could measure the temperature of the entire world was, much less to a tenth of a degree, is ridiculous.

    • nw sage,

      Yes, there’s a big difference between measuring one invariant thing and calculating the average of many measurements of different things. In the first case, the precision and accuracy is limited to the measuring instrument and the average of multiple measurements will not improve this as the instrument will always return the same value. In the second case, the randomness of the data improves the precision and decreases the uncertainty in the average, but not for any one measurement. The result has no predictive power regarding what the next measurement will be, but accurately predicts what the average of many measurements will converge to.

  29. After reading all the comments it seems to me that the concept of uncertainty has a very uncertain meaning. Not even the mathematical average of all the various proposals seems to be certain.

  30. If the raw data is poor, the average of the raw data is poor.

    So this is really just another manifestation of the Garbage In, Garbage Out principle.

  31. OK Willis. So you don’t believe in statistics. That’s fine with me. There are surely practical issues with applying statistics although I’m not sure I follow your logic. And truly I don’t much care how many angels can dance on the head of a pin. I’ll leave it to Nick Stokes or his equivalent to argue with you about those details. I expect they are firing up their keyboards to discuss your disturbing lack of faith even as I type this.

    But you’ve also missed an important point about satellite sea level data. It’s very noisy. You aren’t measuring a credit card, you are measuring an earthworm whose length changes constantly (within limits). And to some extent, your measuring tools are calibrated rubber bands whose lengths aren’t as constant as one might like.

    How do you deal with that? You take a LOT of measurements and average. You don’t think that’s a valid procedure? Do you have a practical alternative?

    If your point is that we really don’t know exactly how accurate the resulting averages are. I agree we don’t. Mostly because we don’t really know the distribution of the noise in the raw sea level data and probably don’t have all that good a grip on some of the calibration errors either

    One further point. We’re mostly worried about Sea Level Change, not absolute Sea Level. How do we compute that? We subtract Sea Level estimates at some time T0 from some other time T1. That doesn’t do anything for random error since the values at T0 and T1 have uncertainties due to random error. The difference of two uncertain numbers is on average a somewhat more uncertain number. But some of our errors are probably always about the same magnitude and sign. Those are biases, not random errors. Biases largely go away when we subtract.

    • “Some of our errors are probably always about the same magnitude and sign. Those are biases, not random errors. Biases largely go away when we subtract.” If we just could identify them ..

    • Mr. Magoo looks at something. His eyesight is not good so it is fuzzy. With this logic all it takes is enough Mr. Magoos all looking at a fuzzy something to give you a crystal clear image of that thing. NOT!!!!

      • Modern image processing techniques would seem to contradict this. What you are talking about is something done routinely for imaging deep space. For example, the Hubble Deep Field.

    • Don K, you write “You aren’t measuring a credit card, you are measuring an earthworm whose length changes constantly (within limits). And to some extent, your measuring tools are calibrated rubber bands whose lengths aren’t as constant as one might like.”

      And out of that process you believe statistics allows you to report an earthworm result with GREATER precision, accuracy, and to more significant figures than measuring a plastic card with a wooden ruler?

      And out of your belief in statistics, what do you affirm or deny regarding the FRACTAL properties of the “edges” of either an earthworm or a credit card? Analogously, how long is the coastline? Is it conceivable in your estimation of Willis’s problem that the top length of the (not quite rectangular) credit card is different from the (nearly but not quite parallel) bottom edge, by some length within the uncertainty of the measurement process?

      I don’t understand your claim with sufficient detail to determine whether or not it’s persuasive.

      • And out of that process you believe statistics allows you to report an earthworm result with GREATER precision, accuracy, and to more significant figures than measuring a plastic card with a wooden ruler?

        Actually, I’m not a true believer in statistics, because it depends on assumptions about the world that seem to me to be rarely met. But I think the Standard Error of the Mean which is I think what we are dealing with here might be one of the few things in the world that actually has a Gaussian distribution. If it actually is Gaussian, textbook statistics might just work. But that’s not what I’m claiming.

        Anyway what we have here is more like good precision but lousy accuracy. We’re measuring Willis credit card with a really good micrometer. But the folks making the measurements don’t really know how to use it very well so we get substantially different measurements every time we try. Why do you believe that averaging won’t improve the results?

        Fractals? Those are boundary thingees. Not very relevant I think (hope). The physical phenomenon we depend on is reflection of a microwave radio pulse from the sea surface and I think the ambiguity there is quite small. I believe it’s one of the least uncertain elements in the process. Trouble is that the sea surface is usually not very flat and the radio beam isn’t that narrow, so what the satellite “sees” is something difficult to describe or analyze.

        • Don K you write: “We’re measuring Willis credit card with a really good micrometer.”

          Well, not according to Willis!

          HE writes: …”people … use the ruler … to measure the length of the credit card in millimeters.

          Whoever “we” might be, “we” agree that “we” have lousy “accuracy”. But I guess my question to you is, do you believe increased (statistical) precision moves “our” actual understanding of a measurement from “lousy” to “good”? You correctly interpret me to believe that averaging doesn’t “improve” things. We have a measure in millimeters and I can’t see why YOU believe that CAN be improved. How do we communicate across that gap?

  32. I’m a little rusty on this, but I have done instrument measurement uncertainty calculations for NIST traceable Safety Related measurements, to assure that Nuclear Safety Limits are met. These aren’t your run of the mill guesswork kinds of things, but exist to show regulatory compliance under penalty of law.

    Measurement Uncertainty relates to the conformance of actual process to the readout that represents that process. In other words, it’s how close the actual process is to what you say it is. In the case of temperature, it’s how close the temperature is to what the instrument readout says it is. The first time I did one of these calculations I was shocked at how uncertain things really are.

    To calculate uncertainty, one must include everything that can possibly affect the reading: the analog sensor (including drift and calibration accuracy), digital conversion, readout precision, theoretical limits (that is, the degree to which a correlation formula actually represents the conversion process), and uncertainties in any physical correlations that can affect the fundamental relations. One must also add the uncertainty of the test equipment employed for calibration. Each of these terms (which is not an complete list) has various components that contribute to the uncertainty of each term. It is not uncommon for two dozen or more individual uncertainty terms to be included in a properly done uncertainty calculation.

    There are two kinds of errors at play: random error and bias error. Random errors can be reduced (but not eliminated) by more measurement. But bias errors never go away and are NOT improved with more measurements. Total uncertainty is calculated by Square Root of Sum of Squares (SRSS) of all random errors linearly added to SRSS of all bias errors.

    Once the uncertainty for a given measurement system is determined, the total uncertainty can be reduced only if there are multiple independent measurements; that is, only if there is more than one source for a given reading. One cannot improve uncertainty by making multiple measurements with the same device, although one can (in principle) improve precision with more measurements.

    It is clear from simple algebra that the total uncertainty of a measurement is bounded by the worst term in the system. For example, if the sensor error is 1%, no amount of additional precision or accuracy in subsequent processing can make the total uncertainty better than 1%. Uncertainty is always worse than than the worst term, because SRSS can only produce positive numbers. Each additional term makes the uncertainty worse. Usually the total is dominated by a just a few big terms. Adding 0.01% accurate readout to the tail end of a 1% accurate sensor doesn’t make any real difference.

    It’s kind of depressing to realize that you really don’t know what you think you know when you take a reading. In the Safety Related business, uncertainty calculations are vital to provide assurance that design limits are met and that plant operation, even during accidents, will not injure the public.

    -BillR

  33. May I point out that the term “the uncertainty of the mean” means what it says. The mean value of a number of measurements can be calculated very accurately.

    It does not tell you if the individual measurements are either accurate or precise. For example, I can take a million measurements of the same credit card with a device that reads to 5 decimal places. I can calculate the uncertainty of the mean and get a very small number. Now what is the uncertainty of the mean if I tell you that the ruler is off by 5 mm? Does it change at all?

    The physical measurement errors can not be removed through this method, especially if, as someone has already pointed out, you are measuring different things with different instruments . Several of us have been harping on this for a long time. Measurement errors must be propagated through out the calculations and are not removed by statistical manipulations. In other words, if you can only measure a temperature to within +- 1 degree, your average can only be accurate to within +- 1 degree.

    Think about it. I’ll give you three readings, 52+-1 degree, 51+- 1 degree, and 50 +- 1 degree. What is the average? Is it (51+50+49)/3 = 50? Or is it (53+52+51)/3 = 52 Or is it perhaps somewhere in between? What is the uncertainty of the mean in this case.

  34. Willis: While I agree with your primary point, I would go a bit further. First the standard deviation of the mean should be multiplied by 2 to estimate the normal 95% confidence interval. But there is also a need to evaluate and include other effects and create an “uncertainty budget”. ISO’s ‘Guide to Expression of Uncertainty of Measurement’ (GUM) provides a detailed methodology for doing this that requires some quite heavy duty math as well as considerable training in metrology. For example there are six factors that are commonly considered in determining the MU of a high quality screw type micrometer.

    I can say that when I was responsible for operations in an ISO 17025 laboratory and implementing the GUM requirements I was taken aback by how large the Measurement Uncertainty the we had to report was for much of our expensive and high precision equipment.

    • I can recall the building and the lab where I took the lab portion of analytical chemistry (also called qualitative and quantitative analysis in previous years). The floor as concrete, six inches thick.
      The scale was incredibly precise, I do not recall how many decimal places it had, but I do recall that in addition to of course having a glass enclosure to prevent air movements from corrupting the measurement…if someone walked into the room, the reading would bounce all over
      the place for several tens of seconds. If you shifted your feet, it moved. And the scale was on a solid stone bench and the building had, again, a concrete floor.
      No matter how well you knew the procedure of the experiment, and how carefully you followed it, if you did not have really excellent technique your result would suck. And no matter how careful you were to do everything exactly the same, the multiple trials you needed to do to get your 95% confidence interval would all be different.
      Grades in that class were assigned on how well your result agreed with the accepted value, and it was tough. Some years no one got an A.
      You had to know the course work and theory perfectly, and all of these limitations, and report results correctly, and be extremely careful, and do the exact same thing over multiple times, to even have any chance of getting a good result.
      How are readings from satellites being published that are in complete disagreement with the established method of determining sea level changes?
      How can sea level graphs now be outside the error bars of the same graphs using the same data as what was reported in 1982?

    • “First the standard deviation of the mean should be multiplied by 2 to estimate the normal 95% confidence interval.”

      WARNING! This is only true for normally distributed data. Which climate data often isn’t. You can calculate 2 SD for any distribution, but only for normally distributed data does it equal 95 % confidence level.

  35. Just imagine the considerable tome “How to lie with statistics” would become if updated now with the benefit of “climate science”

  36. I believe the point of Willis’ post is that the most that can ever be achieved, even under the best circumstances, is 1 additional decimal place. Increasing the number of measurements won’t improve the accuracy. Therefore, any calculated measurement indicating better than that should be discounted back to the original measurements’ level of accuracy.

    Thus, any claim of sea level rise acceleration, for example, that is based upon a change of mere hundreths of a millimeter when the individual measurements used for the calculations were made to tenths of a millimeter should be noted as unfounded.

    Of course, as you point out nw, under less than ideal circumstances, resultant accuracy will be even less.

    SR

    • I am trying to figure out the significance of all this is somebody multiplying mean sea level rise of .003mm over a thousand years and getting 3mm of sea level rise?

      Usually you don’t worry about significant digits in mean calculations. Additionally in rules of significant digits the figure .003 represents one significant digit, not three. You are supposed to ignore leading zeros. If it were recorded as .0030 it might be considered two significant digits as its suggested you derived that last zero from the observation.

    • The bottom line is that accuracy is related to systematic error and precision is related to random error. Random error can be reduced by multiple readings of a FIXED value, but systematic error requires comparison to a standard or an instrument of known higher accuracy AND precision.

  37. Willis’s point is spot on.

    Other Sea level impacts…
    Temperature of the water.
    Direction of the wind.
    Barometric pressure.
    Wave peaks and troughs.
    Tidal movement.
    Position of the moon and sun.

    And exactly how does the satellite know the local conditions while it is scanning the water?

    Nor should one overlook NOAA/NASA’s modeling their sea surface heights according to isostasy and Earth’s geod structure.

    And somehow, NASA/NOAA claims to measure sea level rise of 3.2 mm annually?
    Not a chance.

    • Believe it or not Jason 3 measures or has inputs for all those except maybe Barometric pressure would need to look that up. I know most of the others you mention are actually recorded in each data entry.

      • Again Jason 3 accuracy is 2.5 cm RMS it claims nothing more and is calibrated at numerous sites to ensure it stays within that.

        Climate Science do analysis that infer greater accuracy but that has nothing to do with the raw data.

        At the moment you are blaming an instrument about an accuracy that it doesn’t claim .. if you have a problem identify the right party.

    • I was going to make pretty much the same comment. Even if they tried to measure the deck of an aircraft carrier it would probably vary a couple of feet over a years time at the same location and much more at different locations. To claim that measurements can be made on the ocean surface to accuracy or precision within millimeters appears a little farfetched to say the least.

      • I was going to make pretty much the same comment. Even if they tried to measure the deck of an aircraft carrier it would probably vary a couple of feet over a years time at the same location and much more at different locations.

        Well, a 1000+ foot aircraft carrier would expand and contract with temperatures, the upper and lowest sections would bend several feet as the seas move (large waves) move from fore to aft, and I’ve seen the long open inside vehicle decks of cargo vessels (800 feet long) twist by feet as the bow twists and then the mid and then the stern. But the length changes because of temperature changes (-15 C to +35 C for example) are not several feet.

      • A simple change in atmospheric pressure alone could result in the water elev change of a few inches and with 25.4 mm per inch times say 3 inches equal to 75 mm change that would blow the millimeter claims of precision/accuracy out of the water.

        I think I will go measure a piece of 20 or 40 grit sandpaper to get an accurate measurement of it’s thickness to the nearest .001 inch. I’ll report back with my measurements. Meanwhile someone else can measure the mean sea level in 20-30 ft seas to the nearest millimeter and we’ll compare results. We can use simple or complex math equations and compare the results to a baseline measurement of a fart in a high wind.

        • Jason 3 does do an adjustment for Barometric pressure

          As atmospheric pressure increases and decreases, the sea surface tends to respond hydrostatically, falling or rising respectively. Generally, a 1-mbar increase in atmospheric pressure depresses the sea surface by about 1 cm. This effect is referred to as the inverse barometer (IB) effect.

          The datasheet gives you the formula it uses to correct it .. again it is calibrated and stays within the 2.5cm RMS accuracy claimed.

          You guys seem to have a lot of complaints about an instrument without reading a single word about what it does and how it does it.

  38. Your example is simple, concrete & useful to show one limitation of averaging. However, what about creating an average of the world’s temperature? So, we can average the temperature at the North Pole, New York and Jamaica.

    That would be more like averaging my credit card, the size of my wallet and a passport. The number is meaningless. And yet, climate “scientists” do this all the time.

  39. Willis your rule-of-thumb is correct. It is founded in the mathematics of “Reliability and Statistics.”
    In simple terms the mathematics would be:

    Total Error = Model (or System) Error + Measurement (or Data) Error

    Where the above errors are probability (or random) distributions. If the system were perfect, then the system error would be zero but this rarely happens in the real world.

    To use your thermometer example, the thermometer (system)would have an error of ± 0.1 °C. Therefore, no matter how many measurements we take the error would never be better than ± 0.1 °C and could be much worse, depending the size of the measurement errors.

    • The problems with our temperature measurements are not limited to the instrument itself. We know that the readings will be impacted greatly by the choice and condition of the site at which the thermometers are read. And we know from past work that there are huge issues with site selection and their condition.

      Averaging won’t significantly reduce those errors when urban and anthropogenic effect almost always bias high.

  40. Statistics is a wild beast. Take the statement Willis quotes “its hard to see why any errors don’t ether cancel or resolve over a very large number of measurements to a constant bias.” Let’s consider a related situation, the Brownian motion – a small particle in water moved randomly by the impact of water molecules. How does the large number of impacts average? What can we say about the particle position statistically?
    It turns out that if you simulate a one-dimensional case or a two-dimensional case, the particle does not move much. In a three-dimensional case, the average displacement of the particle from the original position is linear in time – I vaguely remember that it got actually measured as a way to determine Boltzmann’s constant.

  41. My analogy is measuring the height of adults. Australian men are 1.76 m on average, and to keep things simple, the median is 1.75 m . Women are 1.62 m and, again to keep it simple, all under 1.75 m. Rounding 1000 measurements to the nearest 0.5 m and you get 75% of the adult population are 1. 5 m tall while 25% are 2 m tall. That means that Australian adults are 1.625 m tall on average and men are only 5 mm taller than women on average in Australia. Or if we measured the two sexes rounding off to the nearest half metre, men are 250 mm taller than women on average.

    Making 1 million measurements will not make it all better.

  42. “A lot of stuff really will average out over time”.

    Errors which go in one direction don’t, that’s why bias and extremism tends to get worse over time.

  43. Isn’t there a large systematic drift error with the height of the satellites. I do recall a proposed GRASP system to try and reduce this cumulative error to 0.1mm a year.

  44. Here is another question. Why don’t the providers of temperature data sets require that users (i.e., scientists, etc.) not only acknowledge the measurement error range contained in the data set but must also quantify how these errors affect their conclusions. In other words, evaluate their results based upon the low range of temperatures and upon the high range of the errors.

    This would go a long way toward making everyone aware of the inaccuracies being ignored.

  45. This process of averaging multiple measurements to get a more accurate measurement is done all the time in electronics. Take the case of an 8-bit sampling system (ADC) which can resolve 256 different voltage levels (2^8). Imagine it has a full scale input range of 0 to 25.6 volts (unlikely numbers but it makes the maths easy). We can then resolve voltages 25.6/256 apart (0.1V apart). You might say that is the limit of our accuracy but that isn’t so.

    Take an input voltage of 10.973V, just as a for instance. The nearest ADC level is 11.0V (level 110) and in the absence of noise that is what the ADC would read. It would be wrong by 0.027V. If there is some noise (dither) on the signal however the voltage will bob up and down around 10.973V randomly. However, it will apparently spend more time at the 11.0V level (level 110) than the 10.9V level (level 109). In fact it will spend time at both of these levels pro rata to the distance the real input voltage is from these levels. In this case, the signal is 0.027V from level 110 and 0.073V away from level 109. If we take a thousand measurement of the ADC we will end up with approximately 270 readings at ADC level 109 and 729 at level 110. We can use that information to narrow down the true voltage at the input to near the real value of 10.973V even though at first glance we can’t get a better accuracy than the native accuracy of the ADC, one part in 256 (or 0.1V in this particular example). In this way, 8-bit ADCs can be made equivalent to 9-bit, 10-bit, 11-bit or more, at the expense of taking multiple readings.

    We can take this principle to its extreme and come up with a 1-bit ADC. This has only two output states and looks to be fairly useless at first glance. Say the input voltage range is the same as before, 25.6V. Then if the input voltage is <12.8V the output will read 0 (equivalent to 0V to 12.8V) and if it's higher it will read 1 (12.8V to 25.6V). A reolution of 12.8V, apparently rubbish. However, if we apply a large amplitude noise dither on the input voltage, the 1-bit ADC will read a signal that is at exactly 12.8V 50% of the time as a 0 and 50% of the time as a 1. Given enough readings, we will find we get equal numbers of 0s and 1s. Thus we know, despite the apparently awful resolution of a 1-bit ADC, that the input signal is at 12.80000etcV despite the only two levels we can actually read being 0V or 25.6V. If the input voltage strays away from 12.8V, say to 13.273671V then if we take enough measurements we will find we get an excess of 1s over 0s and by the proportions of these two results we can determine the input voltage to an arbitrary accuracy, limited only by the number of samples. One can turn a 1-bit ADC with dither into a 24-bit equivalent ADC just by oversampling in this way. In fact, there would be other factors that would limit the accuracy of the measurement no matter how many sample we take, but this does not impact on the primary question – that taking more samples can indeed improve accuracy in a sampled data system – it is done all the time.

    • Andrew,
      You seem to be forgetting that your overall accuracy is still limited by the A to D converter. Yes, by dithering you can jitter the value across an ADC steps and thus estimate a value between ADC steps. This is done in many systems. However, the final accuracy, no matter how fine a division on an ADC step, is ultimately only the ADC basic accuracy. In many cases, such as signal analysis, absolute ADC accuracy is not as important as monotonicity and even step size. Dithering helps minimize ADC step intermodulation. Again, accuracy is that of the ADC, not how fine you can subdivide each ADC step.

    • One can turn a 1-bit ADC with dither into a 24-bit equivalent ADC just by oversampling in this way.

      With just dithering you only get 1/2 bit per 2x oversample, or 2^48 oversampling to get 24 bits, because at best the dithering noise is uncorrelated. I don’t think there’s any system that can oversample fast enough to get 2^48 oversampling…

      Delta-Sigma converters using negative correlation by feeding back an inverted signal, allowing a Hurst exponent of the quantization noise to be between 0 and 0.5, thus giving M+1/2 bits per doubling, where M is the number of feedback stages (the “order” of the delta-sigma converter)

      A typical commercial delta-sigma ADC is third order giving 4.5 bits per doubling, so 32x-64x oversampling is sufficient to get 24 bits

      Take the case of an 8-bit sampling system (ADC)

      funny enough, Willis’s rule is correct here, you can get at most 3 additional bits (which is ~1 decimal place) of an 8 bit system because there are 8 comparators and thus differences between the comparators (manifesting as non-linearities), which means oversampling has diminishing returns.

      BTW your typical 8 bit A/D in a microcontroller has terrible linearity so is already only effectively 7 bits.

      This is why 1-bit delta-sigma converters are so nice. There’s only one comparator and making one comparator linear is relatively straightforward, and can be done with fairly standard CMOS processes. This is why all your computers have a delta-sigma converter for audio output.

      references:
      https://www.renesas.com/us/en/www/doc/application-note/an9504.pdf
      http://www.seas.ucla.edu/brweb/papers/Journals/BRSpring16DeltaSigma.pdf

  46. Willis,

    Working in engineering I experience this all of the time. My assertion is the following:
    Resolution:
    The maximum ability of the measurement system to quantify a value. A 10 bit A to D has at best 1024 individual values 0-1023. If for example a 5V reference is used the system will divide the 5 volts into 1024 steps. This implies each step has approximately 4.88mV so the max resolution is 4.88mV. No matter how many measurements you make you will never be able to tell the difference between two numbers with less than 4.88mV.

    Accuracy:
    The delta from the real to measured value in the system. This is a calibration issue and independent of precision. In the A to D system above that is determined by the reference voltage accuracy and the internal offsets within the signal processing system.

    Precision:
    The variation of the measurements due to random error within the system. In the A to D example above the system is capable of +/- 1 Least Significant Bit.

    My assertion is this:

    1. For Resolution taking multiple samples will not improve resolution. If a value is between two buckets no number of samples will resolve them.

    2. Accuracy is an offset and sample size will not resolve this problem, only calibration will.

    3. Precision can be affected by samples assuming a predictable and stable variation. If the variation is systematic, i.e. in certain conditions the offset is different, is not predictable, or nonlinear, then an average will not work.

    Thus, I would say you do not gain an extra digit by averaging, without knowing the other factors affecting the system.

    • Thanks, R. That’s well explained and clearly written, and I can only agree. You’ll note that I said my rule of thumb involved what in general is the maximum real-world gain you can get by averaging. Yes, sometimes you can get more. And often you can get no gain at all, as you point out.

      Much appreciated,

      w.

      • Willis,

        When I posted much of the intervening explanation you had stated for some reason was not displayed. I fight software people all of the time thinking that they can improve resolution by oversampling. Don’t get me started about sensor variation……. Have a good weekend and holiday.

  47. Now … raise your hand if you think that we’ve just accurately measured the length of the credit card to the nearest three thousandths of one millimeter.

    Of course not. And the answer would not be improved if we had a million measurements.

    I’ll raise my hand.

    But then I had a science education.

    And a bit of maths.

    Sometimes Willis, you are dangerously wrong.

    • I will raise my hand as well.
      Here is a nice video explaining standard error of sample means and why the more samples you take the smaller the error of the sample mean will be.

      • Max, I discussed the calculation of the standard error of the mean in the head post. That video adds nothing. And if you raised your hand, you missed the whole point of the post. Likely my fault, my writing may not be clear, but no, you cannot measure a credit card to the nearest three thousandths of a millimeter using an ordinary ruler no matter how many measurements you take.

        Using the normal procedure for calculating the s.e.m. that both I and the Khan Academy laid out, you end up with an answer which is the standard error of the mean of the MEASUREMENTS, but which is NOT related to the actual length of the credit card—a vital difference.

        w.

    • Leo Smith
      Are you suggesting that you are one of the few people here with a science education and exposure to mathematics?

    • Leo Smith December 20, 2018 at 7:43 pm

      Now … raise your hand if you think that we’ve just accurately measured the length of the credit card to the nearest three thousandths of one millimeter.

      Of course not. And the answer would not be improved if we had a million measurements.

      I’ll raise my hand.

      But then I had a science education.

      And a bit of maths.

      Sometimes Willis, you are dangerously wrong.

      Seriously? You are claiming that you can measure a credit card length to the nearest three thousandths of a millimeter using an ordinary ruler just by repeating the measurements?

      Really? We can throw away our micrometers, rulers are adequate?

      Someone is dangerously wrong here, but it’s not me …

      w.

      • I guess I don’t need my expensive 8.5 digit DMM anymore either. Instead I can just use a cheap handheld one & just take loads of measurements 🙂

  48. The standard deviation (SD) is independent of the standard error (SE). So reducing SE has no effect on SD. The SE only tells you how good your sample represents the population. For example, you want to determine the average age of a city’s population. From your sample, you calculated mean = 30 and SD = 1. How do you know SD is really 1? You did not ask all the people in the city. It could be 2. SE tells you that actual SD is close to 1 and far from 2. Therefore, your sample is good.

    But your sample size does not affect SD because that is a property of the population. Another city could have SD = 0.1 or 0 (all people have the same age). SE has nothing to do whether or not you can measure the mean age to the last hour, minute or seconds.

    • Dr. Strangelove
      Another way to look at that is that there is a fundamental relationship between the range of a normal distribution and the standard deviation. Increasing the number of samples will have a small impact on the range because most of the samples will be in the high probability central region of the distribution.

      • Another way to look at it is the normal distribution is a mathematical model. SE tells you how well your data fit the model. If the population is really normally distributed, the greater the sample size (N) the better the fit. But it doesn’t tell you whether or not your data is accurate. If you just made up the data, you can have a perfect fit and it would still be wrong.

      • The normal distribution is a continuous probability function:
        P (x) = e^(a x^2 + b x + c)
        It’s the base of natural log raised to the power of a quadratic equation.

        When the number of trials (or samples) n is large, the normal distribution approximates the binomial distribution, which is a discrete probability function. If the probability of x is p, then the probability of getting x in n trials is a binomial distribution:
        P (x) = n! / (x! (n – x)!) p^x (1 – p)^(n – x)

        • I agree with Willis here but I keep hearing on both sides, the notion that a random distribution is an accessible absolute, that unveils uncertainty!

          Everyone who actually employs this term should do the experiment in the real world, by tossing a coin for example and noting the result of even that simple 50/50 probability.

          This is the way I was first introduced to “Experiment Probability” and it has stuck with me, ever since.

          Lately it seems, it has become plain old probability and the “experimental” part has been dropped.

          In reality, nature doesn’t have to obey the theory of probability as we commonly experience it. Only as the number of experiments – coin tosses – increase does reality approach the theory.

          Convergence is not guaranteed because there is nothing in reality to stop the experiment from diverging from the theory, though it does become less and less likely as the number of “experiments” increases.

          In lieu of doing the recommended experiment in real life, I just found an excellent video* that does a good job of modelling the common real world experience.

          I would add that it is a cheat to use a random generator because the lesson to be learned here, is that nature only ever approaches “randomness” via large numbers. Therefore notions of ”distribution” are misleading in the “real” world!

          Once more to be clear, strings of heads or tails or complex patterns are not ruled out in actuality – the experiment – despite the theory!

          * https://www.khanacademy.org/math/ap-statistics/probability-ap/randomness-probability-simulation/v/experimental-versus-theoretical-probability-simulation

          • Typo above, I meant to say experimental probability but empirical probability or relative frequency, would smell as sweet!

          • I just found a much better example here:

            https://www.youtube.com/watch?v=dXEBVv8PgZM

            I highly recommend this for anybody interested in probability or relative frequency / distribution, particularly from 36.00 onwards.

            Coin tossing follows the Law of Long Leads or the Arcsine Law.

            *In a coin toss, the symmetry does not show itself by being heads half the time, it shows itself in half the “sample paths” being above or below the line of theoretical probability most of the time!

            cheers,

            Scott

            *Professor Raymond Flood, Probability and Its Limits

  49. Willis, Thank You for this post. This is the primary reason that the pseudo science of AGW is a farce.
    Also have a look at windy.tv, almost all the weather is over and because of the oceans.
    We are on planet water not planet earth.

  50. Willis,

    As an experimental scientist, I can say that your arguments here are mostly wrong. The measurement precision of an instrument has effectively no impact on how precisely or accurately a quantity can be measured. To see why let’s break down the various elements of your post.

    With regard to precision (random error), we can take your credit card example. Let’s suppose that people measuring the credit card all get a number between 85 and 86 mm and their results cover the full range. As part of our measurement protocol, we tell them to choose not any number between the two but one or the other. They should make that choice based on which is closer to the actual length they observe. Our data will consist, then, of a seemingly random string of 85s and 86s. We are taking the average of this string of data, some of which are 85 and the rest of which are 86.

    If we have ten such measurements, the calculated averages will be numbers like 85.5 (5, 5 split), 85.6 (4, 6 split), 85.7 (3, 7 split), etc. For ten thousand measurements, however, the numbers will be 85.5000 (5000, 5000 split), 85.5001 (4999, 5001 split), 85.5002 (4998, 5002 split), etc. With an arbitrarily large number of measurements, any average between 85 and 86 is possible. But how good are these averages?

    The uncertainty you mentioned (sigma/sqrt(N)), also called the “standard deviation of the mean” or the “standard error,” tells us the precision or random error of the mean thus obtained. Do an experiment with N measurements. Calculate the mean. If later experiments are done with N measurements under the same conditions, the standard deviation of the mean (SDM) tells us how much those other means will vary from the first. In general, 95% of means obtained by all N-measurement experiments will be within 2 SDMs of each other. Thus, with an arbitrarily large number of measurements, we can get means from many experiments to be as consistent as we want. There is, in other words, no fundamental limit on how precisely we can determine the result (mean) of an experiment. The measurement precision of the apparatus used is irrelevant.

    Now your objection to this would likely be “But that doesn’t guarantee that the mean is the right answer. We might get 85.567 +- 0.003 mm when the actual length is 85.329 mm. The mean is consistent but consistently wrong.”

    This objection would be correct, and is an example of systematic error (inaccuracy). Every experiment is subject to many systematic errors of varying sizes, and these inaccuracies cause the results to be wrong. Systematic errors are governed by a fundamental principle similar to your incorrect rule of thumb (“you can’t squeeze more than one additional decimal out of an average of real-world observations”):

    No systematic error smaller than the uncertainty (random error) can be identified.

    This is the point of making more and more measurements. By reducing the uncertainty of the result, we are able to potentially uncover smaller and smaller systematic errors. A small uncertainty doesn’t guarantee that the experiment is that accurate, but it does allow for the experiment to be MADE that accurate by eliminating the systematic errors. Just as a falling tide uncovers more and more objects on the beach, a falling uncertainty uncovers more and more systematic error. And like objects on a beach, this allows the systematic errors to be identified and removed.

    But how are systematic errors actually identified? By doing different KINDS of experiments. If the same quantity is measured by many different techniques, those techniques are likely to have different systematic errors. For temperature, this would mean using thermometers, satellites, balloons, etc. For sea level this might be tide gauges, laser altimetry, etc. If we make all those techniques precise enough (through N measurements), they will eventually disagree. We use that disagreement to identify the systematic errors causing the discrepancies. Once the systematic errors are identified and removed, the various methods will agree to high precision. AT THAT POINT, WE KNOW THE ACTUAL RESULT TO THAT LEVEL OF UNCERTAINTY. Again, our ability to do this does not depend in any fundamental way on the measuring precision of the apparatus.

    In practice, of course, this process is slow and laborious, requiring a great deal of time and resources. In practice, our ability to reduce the uncertainty may depend on available technology and manpower. But the precision of the measuring device does not place any fundamental limit on the process.

    • Brian December 20, 2018 at 9:09 pm

      Willis,

      As an experimental scientist, I can say that your arguments here are mostly wrong.

      Brian, you say that:

      By reducing the uncertainty of the result, we are able to potentially uncover smaller and smaller systematic errors. A small uncertainty doesn’t guarantee that the experiment is that accurate, but it does allow for the experiment to be MADE that accurate by eliminating the systematic errors.

      And how do you do that? Again according to you:

      But how are systematic errors actually identified? By doing different KINDS of experiments.

      So in fact, you agree with me that UNDER THE CONDITIONS I SPECIFIED, more measurements do NOT help. In order to get a better answer, more measurements are meaningless—according to you that can only be done by using different KINDS of experiment.

      So what you’re telling us is that if you subsequently measure the credit card with a micrometer (a different KIND of experiment) we can get more decimals. No duh …

      But you have agreed with me that more measurements BY THEMSELVES won’t help us.

      Which is exactly what I said.

      w.

      • Willis, what say you to this specific contention:

        “The measurement precision of an instrument has effectively no impact on how precisely or accurately a quantity can be measured.”

        This seems to me to me saying having a ruler with only inches on it is no better or worse than having one with no graduations or one marked in millimeters.
        I do not think any machinist or engineer would agree with this statement.

        • Menicholas December 21, 2018 at 6:10 am

          Willis, what say you to this specific contention:

          “The measurement precision of an instrument has effectively no impact on how precisely or accurately a quantity can be measured.”

          This seems to me to me saying having a ruler with only inches on it is no better or worse than having one with no graduations or one marked in millimeters.
          I do not think any machinist or engineer would agree with this statement.

          I’d say that you are confusing resolution with precision. Resolution is the smallest difference that can be measured. I have a digital scale that I get on every morning … sometimes to my regret …

          It measures to the nearest 0.2 pounds. That is its resolution. Or in your rulers example, the resolution would be either inches or mm.

          Precision, on the other hand, is how close successive measurements are to each other. That is a wholly separate and distinct question from resolution.

          w.

      • Willis,

        No, measuring something by multiple methods doesn’t imply that the other methods have better resolution or precision (a micrometer vs. a ruler). There’s an advantage to be gained even if the other methods have identical resolution and precision to the first. The key, again, is choosing methods that likely have different systematic errors. That’s the only way that systematic errors can be identified and eliminated.

        This key point is most definitely NOT what you said or implied. Yes, you focused on using only a ruler and said that taking more measurements doesn’t help you out. But even that claim misses the point of how measurements are done in science. Any single measuring device that purports to be accurate is calibrated against other reference devices, or against reference quantities. Every scientific measuring device must be traceable to fundamental measurement standards. A calibrated device and the procedure for using it already has its systematic errors defined and quantified through the use of multiple types of experiments. So even if you are using only the ruler, it is already backed by the multiple devices and experiments I mentioned.

        The bottom line is that a ruler calibrated through use of a specific measurement procedure can have systematic errors much less than the resolution of the device. And such a procedure can involve a large number of measurements. By making this large number of measurements according to the designated calibration procedure, the measurement precision is indeed reduced to a small value that is both precise and accurate. And this is all with the use of a single device.

        • Brian, my point remains—no matter how many measurements you take with a ruler, you CANNOT measure to the nearest .001 mm. Period.

          Yes, you can use all kinds of other techniques to improve your accuracy. But to do that, you have to use other techniques.

          Finally, you say:

          “The bottom line is that a ruler calibrated through use of a specific measurement procedure can have systematic errors much less than the resolution of the device.”

          It doesn’t matter if the lines are scribed on the ruler using laser interferometry. You STILL cannot use repeated measurements with that ruler to measure a length to the nearest .001 mm. And while you can get systematic errors LESS than the resolution of the ruler, you cannot get “systematic errors MUCH less than the resolution” of the ruler as you claim … in general you can get one more decimal, but not two more.

          w.

          • Willis,

            You say “Yes, you can use all kinds of other techniques to improve your accuracy. But to do that, you have to use other techniques.”

            The point you are ignoring is that ANY measuring device needs to be calibrated and it gets calibrated by using other techniques and devices. That doesn’t mean you have to use the other techniques as part of YOUR measurement. Look, you provided a ruler as your example. Presumably you weren’t assuming that the ruler is a piece of garbage. Presumably you think it works correctly according to its design. In that case, it’s been calibrated using other techniques. You can’t use the “other techniques” excuse to wiggle out of my point.

            And you are just plain wrong about this. It is always possible to obtain much smaller systematic errors with a device AND A PRESCRIBED MEASUREMENT PROCEDURE than the device resolution. And once the systematic errors are small, the only thing preventing a highly accurate measurement is the random error, which is made smaller by repeated measurements. Under those conditions (where the systematic error has already been made small), high accuracy is obtained by repeated measurements.

          • Brian, is there some part of the following that was unclear?

            It doesn’t matter if the lines are scribed on the ruler using laser interferometry. You STILL cannot use repeated measurements with that ruler to measure a length to the nearest .001 mm.

            Yes, I understand both the math and the techniques involved in utilizing the CLT to improve measurements. I am simply pointing out that in many real-world situations, the math gives us wrong answers.

            You say:

            And you are just plain wrong about this. It is always possible to obtain much smaller systematic errors with a device AND A PRESCRIBED MEASUREMENT PROCEDURE than the device resolution

            Not true. It is almost always possible to obtain SMALLER errors in the manner you describe. But it is not always possible to obtain MUCH SMALLER errors in the manner you describe.

            w.

          • Brian –> You are conflating a real world measurement with a lab experiment. You seem to be missing the whole point. Willis’ example doesn’t exactly apply to sea level measurements by satellite nor to temperature measurements.

            In both cases you get only one measurement of one thing at one time. You don’t have a static substance that you can continue to measure or to diddle with in order to get a more accurate or precise measurement.

            All the ADC and other discussions are fine, but they simply don’t apply.

          • Willis,

            Getting systematic and random errors much less than the base resolution is done all the time and can be taken to extreme levels.

            One good example is LIGO, the gravitational wave observatory. LIGO works on interferometry of 1-micron light, which means the base resolution (equivalent to the marking spacing on a ruler) is only 1 micron. Yet they measure deviations of space on the order of 10^-18 m, or 1 trillionth the base resolution. This is accomplished by reducing systematic errors (such as vibrations from earthquakes and traffic) to a very low level and then using statistics to reduce the random error. In this case, the repeated measurements required to gauge random error are provided by the high number of photons in the cavity. Much as I described in my first comment, this allows shifts of less than 1 millionth of a wavelength to be measured. So your statement that errors can be reduced to only about 1/10 the resolution of the device is wrong.

    • I too started in professional life doing a PhD in experimental physics. I then moved onto safety critical software then ion propulsion for satellites. I became very familiar with metrology and had to improve my scientific method.

      So, talking multiple measurements will not change anything if you cannot determine that your sample distributions are identical. After all a necessary condition for the Central Limit Theorem that underpins the error of the mean is that samples are i.i.d. – Independent and identically distributed.

      If the uncertainty of individual elements of your sample are sufficiently large the uncertainty in distribution goes up. In other words noise is greater than signal. For temperature measurements this is the case when looking at changes 0.1 degrees. For satellite measurements it may also be the case if a certain range of heights are required.

      More measurements do not produce less uncertainty. It is due to the CLT.

  51. Brian
    For Willis’ example, very few if any people will record 85. Almost all will record 86. Thus, you will get a long string of 86s. If nobody records an 85, you will have actually reduced the precision by reporting 86, when it is clearly less than 86!

    • Clyde,

      That’s why I said this:

      “Let’s suppose that people measuring the credit card all get a number between 85 and 86 mm and their results cover the full range,”

      in keeping with Willis’ own description. That’s how he got SD = 0.3. My proposed measurement procedure for getting highly precise results actually depends on individual measurements being as imprecise as the device resolution to avoid systematic error (or bias) that can creep into people’s judgement.

  52. The above is n the eyes glazing over catogary.

    Two things come to mind , Julie Ceasar was asked about the loyalty of his guards. He replied yes that is of interest, but who is going to guard the guards.

    Second, Joseph Stalin is said to have said, It does not matter how many vote, it does matter who counts the votes”

    All of the above is wide open to cherry picking and adjustments to the data.

    MJE

    • Which is why Alexander Hamilton introduced the Electoral College, never allowed to be all in the same room at the same time. Counting-statistics may be popular, yet when this brilliant idea actually works, the winner is called “populist”.

  53. “Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.

    Following that rule of thumb, if you are measuring say temperatures to the nearest degree, no matter how many measurements you have, your average will be valid to the nearest tenth of a degree ”

    I think you won’t even get an additional tenth. I think you will at best, get an additional 1/2 of of a degree. Suppose you ask a bunch of people to take a reading in whole degrees from a thermometer marked off in whole degrees. If the group overwhelmingly says the thermometer indicates 35 degrees, you must record 35 degrees.

    If approximately 50% say it is 35, while the others say it is 36, you can be very confident rounding the temperature to 35.5 degrees. This is because you asked people simply to decide which whole degree the thermometer indicator was closest to. If nearly equal numbers decided each way, the indicator must be very close to midway between.

    If 25% say 35 degrees, while 75% say 36 degrees, you may record the temperature as 35.5 degrees, but not as something like 35.7 degrees. This is because you asked for a reading to the nearest degree. All you know for sure is that most of the group determined the indicator was closer to 36 than 35. Because some said it was closer to 35, the indicator must be still very close to midway between 35 and 36. If the indicator actually was very close to what would be 35.7, the group would have overwhelmingly noted it was closer to 36 than 35.

    By asking for a reading to a whole degree, you were asking people to judge only whether the reading was more or less than .5 of a degree. Thus, you can sometimes squeeze out an extra 1/2 degree, but no finer.

    Automated equipment works in the same manner because that is what it has been set up to do.

    SR

    • I forgot to put in my conclusion that multiple readings only increase the precision to 1/2 of the delineations the measuring device is marked to or capable of determining.

      SR

  54. I have used a system which gets higher accuracy.

    You need to add in a truly random noise source, whose standard deviation is some convenient small multiple of the measurement unit, and whose distribution is well known. For example, the thermal noise in a resistor.

    Now the quantity of interest is represented by a population of different measurements. And the statistics of that population give a way to calculate the underlying value more precisely.

    • Bob, I’d agree with a slight difference. I’d say:

      Now the quantity of interest is represented by a population of different measurements. And the statistics of that population MAY give a way to calculate the underlying value more precisely.

      While I can certainly envision situations where your procedure would help, in my credit card example, I fail to see how adding random noise to the measurements would increase either the accuracy or the precision of the measurements.

      Regards,

      w.

    • You are fooling yourself Bob if you think adding noise to series of measurement improves accuracy. What it does is to allow you to use a formula that isn’t applicable. In your case if you do an infinite number of measurements it will only allow you to get back to the original accuracy before you added the noise.

      • tty, Bob is not saying add noise to the measurements.

        He’s saying add noise to the actual quantity being measured. In that way you can indeed get better accuracy through repeated measurements.

        However, in the world of credit cards (or climate science) this is rarely an option.

        w.

        • He is still fooling himself. An infinite number of measurements would then allow him to get rid of the noise he added but would not improve the accuracy of the original measurement.

      • You are fooling yourself Bob if you think adding noise to series of measurement improves accuracy

        Funny, in electronics we call this “dithering” and it’s used all the time to turn correlated noise into uncorrelated noise due to quantization errors, which is exactly what’s happening in the case of the ruler.

        The general topic is “quantization noise”, go ahead and Google it.

        That quantization noise is white noise and you can get 1/2 bit resolution per halving of frequency (which is exactly analagous to the sigma/sqrt(n).

        Without dithering, the quantization noise is correlated to the input signal and you are lucky if you get 3 bits or one decimal of improvement.

        Of course in climate signals, unlike the ruler, there’s tons of dithering but who knows whether the dither signal is white noise.. often not. But I think making blanket statements about “at most one decimal place” isn’t correct either. Why not just estimate using the Hurst exponent that can be roughly derived from the data itself?

        • Peter Sable December 26, 2018 at 4:10 pm

          Of course in climate signals, unlike the ruler, there’s tons of dithering but who knows whether the dither signal is white noise.. often not. But I think making blanket statements about “at most one decimal place” isn’t correct either. Why not just estimate using the Hurst exponent that can be roughly derived from the data itself?

          Peter, I use the Hurst exponent to correct for autocorrelation. However, I’ve never heard of using it to get a more accurate average of a number of measurements. Could you say a bit more about that?

          There are other problems with climate measurements. Often, you are not repeatedly measuring the same constant thing. Instead, you’re measuring the variations in something.

          Also, even in theory you can’t average temperature because it is an intensive quantity … in order to average something you have to add the individual measurements up. But unlike the sum of say weights, the sum of temperatures has no physical meaning …

          w.

          • Peter, I use the Hurst exponent to correct for autocorrelation. However, I’ve never heard of using it to get a more accurate average of a number of measurements. Could you say a bit more about that?

            If the Hurst exponent is between 0 and 0.5 you get a more accurate average because the signal has negative autocorrelation. This was discussed in the comments of your Hurst article.

            That’s the principle behind delta-sigma converters. Negative feedback allows about one decimal place per oversample ratio in a 3-stage design. (though designers typically talk about bits, or powers of 2, instead of powers of 10).

            Unless your measurement system can interact with the climate (hah!), you are not going to get negative autocorrelation from the measurement system! But I wouldn’t be surprised if negative autocorrelation appears somewhere in nature.

            BTW on further reading estimating the Hurst exponent consistently requires about 20k samples, non trivial to say the least! So my proposition of “why not just estimate from the Hurst exponent” is not really practical in many cases. One might interpret that limit in a pessimistic fashion given the idea of the Null Hypothesis.

    • You need to add in a truly random noise source,

      Are you suggesting something like heating the credit credit above the temperature of the ruler until the edge of the card seems centered on a ‘pip’? And a measurement system that yields information ‘at’ the pips but not in between, like a balance scale for weights (or electrical +/- comparator with precision source)? And once the balance point is reached, the degree of temperature rise and known property of the material used to ‘back estimate’ where between the pips the card edge was at ambient/ruler temperature? Proxies like turtles all the way down?

      Molecules are so skittery and slippery and everything is throbbing along an edge as long as the coastline of Britain. That’s why God created the photon, to count integral sheep and get some rest.

  55. Willis, I was researching this online recently. There is conflicting information but I believe it to be correct that measurement resolution is a systematic error and not a random one. Hence it doesn’t reduce by sqrt(N) but is irreducible. What is the uncertainty here? Some say one tenth of the smallest division (i.e. 0.1mm) but I’ve seen others say that is too low and around half this resolution (i.e. 0.5mm). Either way in your example, I do not believe the credit card length, using the ruler specified, can be known with an uncertainty less than 0.1mm no matter how many people measure it.

      • I’ve no idea what your reply means M. I know the difference between error and uncertainty but that has absolutely nothing to do with my post. My post is about random error (reducible) v systematic error (irreducible) and whether or not measurement minimum resolution is a random or systematic error.

  56. Surely the accuracy of the satellite measurements are testable?

    Take a the number of passes to required establish mean sea level according to the theory, then do the same a number of times within a short time-frame (months) . If the system is accurate the precision will be tight.

    Not rocket science, surely?

    I have the feeling that we are not being told everything. I would love to know what variation they are really getting.

    Cheers

    M

    • “If the system is accurate the precision will be tight”

      This is not the case.
      From the Wikipedia on accuracy and precision:

      “A measurement system can be accurate but not precise, precise but not accurate, neither, or both. For example, if an experiment contains a systematic error, then increasing the sample size generally increases precision but does not improve accuracy. The result would be a consistent yet inaccurate string of results from the flawed experiment. Eliminating the systematic error improves accuracy but does not change precision.”

      Additionally and separately:
      Measuring the height of the ocean is not something like measuring a credit card, which holds still and does not change (much, although atoms are likely eroding from the edges and some slight thermal effects are occurring).
      The ocean and the height of it never holds still and is never the same even in the same place over a short period of time.
      And water blown or flowing away from one place winds up somewhere else, except when it does not (evaporation, precip, rivers, volumetric changes due to temp, etc)
      Nor is it like doing lab experiments in which there is an “accepted value” to compare results against.
      But we do have tide gages.

      In case you missed it, there is a link to an essay from Kip Hansen which gets at these issues very succinctly.
      https://wattsupwiththat.com/2017/10/14/durable-original-measurement-uncertainty/

      • Kip’s article is wrong, and the comments give it a thorough debunking. I’ll go so far as to say that it’s presence on WUWT degrades the credibility of WUWT.

        • “Wrong” is a broad term.
          Do you mean to say everything he wrote is the opposite of true, or he made one or a few errors in his analysis?
          One could read this thread and come away saying Willis is wrong, or he is correct, depending on who you decide to pay attention to.
          His article and the discussion are important, for different reasons.
          In every ne of these discussions, we have some people comparing apples to oranges, and others just mixing up terminology, and some others refuting what is accepted and taught in every science classroom in every university in the country, or at least used to be…who the heck knows what they are teaching now.
          Saying something is wrong without being specific is wrong, as it unhelpful for a discussion.

          IMO, this is wrong:

          “Surely the accuracy of the satellite measurements are testable?
          Take a the number of passes to required establish mean sea level according to the theory, then do the same a number of times within a short time-frame (months) . If the system is accurate the precision will be tight.
          Not rocket science, surely? ”

          If you were in charge, we would know the sea level and variations in the world ocean and the issue would be settled…is that your position?
          Everyone else is an idiot?

          • Kip’s column had a main assertion which was 100% dead wrong, and proved wrong by math, as well as proved wrong by real world examples.

            I don’t know much about satellite measurement technology and so can’t comment on it. But I know enough about statistics to point out mathematical errors.

          • Steve O

            You said, “I know enough about statistics to point out mathematical errors.” Then please do, instead of just asserting that Kip is wrong.

          • I agree with Clyde.
            We mostly all seem to be here for a productive and informative conversation.
            Simply saying that you know enough point out errors, or that math proved his main assertion is wrong, amounts to an unsupported opinion.
            Opinions are fine, and you opinion may be correct.
            But it is unhelpful to make such statements without being specific, unless you merely wish to express your opinion and be done with the discussion.
            Kip’s post was a fairly long one, and the comments was one of the most extensive I have ever seen…over 500 separate comments!
            I stated up top just what I am thinking now: There are a lot of disagreements on these threads regarding the topics of uncertainty, precision, accuracy, and when statistical analysis is and is not valid. And several reasons for the disagreements, including people talking about different things (apples vs oranges disagreements), people using inconsistent terminology or the wrong verbiage for what they are trying to say, whether measurements are of the same thing or different things if the thing being measured is temperatures or the level of the sea, and some others.
            Lots of people here know plenty, and yet disagreements abound.
            If we all make an effort to at least communicate effectively, we can at the very least have a chance of learning something new, or of giving someone else another way of looking at something, or even just offering an alternative viewpoint or an interesting perspective.
            Of course, sometimes one might just wish to offer one’s own opinion and not spend time doing more than that. I do this sometimes.
            In such cases I try to recognize and say when something is merely my own opinion or view, but all too often we find others offering their opinion as if they were facts or settled or incontrovertible.
            I am not even sure if we disagree on anything, but I cannot tell unless you want to be specific.
            Saying read the comments when there are over 500 of them, and saying they give a thorough debunking when they are anything but all in agreement, and not even taking the time to say WHICH of his assertions have supposedly been debunked…is unhelpful.
            If there is one thing years at this site and others like it has taught us, it is that no one has a monopoly on correct opinions or valid arguments.
            Warmistas who claim to be experts at math and science tell us that Earth is burning up and that we are doomed unless we believe that the matter is settled and we do what they say.
            Some of them have PhDs in math, and yet many of us are certain they are full of crap and wrong

        • Steve O,
          Without paying attention I thought your comment came from Michael Carter.
          In any case, I never said that Kip was correct and his article settles the issues being discussed.
          I said he lays out the issues succinctly.
          I am not a statistician…and I am not going to be the one to settle these back and forth debates about what averaging can and cannot do.
          For one thing various people keep using sloppy language.
          Another is that the properties and stats describing a series of numbers are not the same as measuring something, that is changing with time, with a device.
          Anyway, I have no interest in arguing about statistics, but only getting at how some people misuse them to further an agenda.

          • Steve O,

            Please tell us what is the incorrect assertion in Kip’s post. To say something is wrong and not defining the actual statement does not help with the correction, or discussion. Blanket statements are next to worthless in technical discussions.

          • “Measuring the air on different days in different places is not measuring the same thing.
            You keep ignoring that!”

            I’m not ignoring it. I’m actively disputing that it’s as important as you think it is.

            Let’s say you measure the temperature, and then a second later measure it again. You take 3600 readings over the course of an hour. How important is it that you’re taking the reading at a different time, when the temperature is slightly different? It’s not important at all, especially when what you are interested in is the average temperature over the one hour period. It’s exactly what you need to do. You cannot throw out mathematical statistics just because you’re “measuring different things.” Math still applies!

            It’s the same when determining the average temperature for a year-long period. You take 365 measurements of daily mid-range, covering every single day of the year. You can indeed determine the average of the mid-ranges for the year to within a decimal point of its true value, even with a wide error band. It is also true that you cannot know to within a decimal point the temperature/mid-range on any particular day. But nobody cares about any particular day.

            I’m not some warmist troll. Climate science is absolutely sick with bad methodology, but it’s important that criticisms be scientifically sound in order to maintain credibility. There’s enough that is unsound that we don’t have to attack the parts that are rock solid.

          • “You said, “I know enough about statistics to point out mathematical errors.” Then please do, instead of just asserting that Kip is wrong.”

            — There’s no reply button for your previous comment, so let me reply here. Then I’ll go back and read your columns.

            Kip claimed that since temperature measurements were rounded to the nearest whole number, that estimating historical temperature to the tenth of a degree was not possible, even if the statistical math that everyone uses says otherwise.

            I would agree that if you have only one temperature measurement, that he would be correct. But when you have hundreds of measurements, the errors can be largely eliminated with multiple readings. And in one year, you have 365 readings. You can round the temperature of each day’s reading to the nearest degree, and you can still determine the average temperature to a tenth of a degree. Obviously, systematic errors will not be corrected, but that’s another issue.

            I created for him a math experiment that simulated measurements, with rounding, and it returned a result which would have been impossible if he had been right. The experiment said he was wrong. The math said he was wrong. Other people offered their own examples to show why he was wrong.

            I describe my experiment in other comments, but it’s short enough that I can repeat it here: Enter a number to represent a true value being measured, and copy it down for 1000 rows in Excel. Enter a column of random numbers, and another column with those columns added together. In the next column, round the numbers to the nearest whole number. The average of all the rounded numbers will be very close to the true value.

            If you have only one measurement, your error range is very wide. If you have 1000 measurements, you can get very, very close to the true value.

          • Steve O,
            You said:
            “Obviously, systematic errors will not be corrected, but that’s another issue.”
            It is not another issue, it is central to the questions at hand.
            No one here disputes statistical mathematics.
            We are concerned with what the actual temperature is, whether it is known to a high degree of certainty for the entire planet back in the 19th century, and if is it changing and if so by how much.
            We know it changes.
            We do not know the actual values, we only know what some measurements said it was.
            Measuring the air on different days in different places is not measuring the same thing.
            You keep ignoring that!
            There may or may not be a such thing as a global average temperature, but if there is, it is not a measure that means what people think it means, and there are not enough data points to know what it was.
            You have to ignore a lot to be a warmista, but to be a scientist you have to take everything known into account.
            You are ignoring a lot here Steve.
            Why?

          • Steve O
            You said, “And in one year, you have 365 readings.” You have 365 DIFFERENT readings of 365 DIFFERENT items. That is analogous to taking 365 pieces of a puzzle, weighing them, and reporting the average weight of all the pieces. Each measurement has associated with it the error and uncertainty of the balance used to measure the pieces. The Standard Deviation can be calculated, and as a sanity check, estimated from the Empirical Rule. That tells you something about the probability of another piece of the puzzle having a weight within the range of the original 365 samples. However, the mean weight is of questionable practical value. The uncertainty of the individual weights is nominally the same for all pieces. The Standard Error of the Mean may give you a sense of having improved the precision. But, what is really important here is the range of values, because the 2 SD error bar tells you that 98% of the samples fall within that range.
            Compare that procedure with taking 365 measurements of the diameter of a precision ball bearing. It is the same, unvarying diameter (except for possible negligible ellipticity), measured within a short enough time period that other issues like instrumental calibration drift can be ignored. That is to say, almost all the variance can be attributed to random error, not systematic changes in the measured object or the measuring instrument, and it is reasonable to expect that the random errors will cancel. Thus, the use of dividing by the square root of the number of measurements because a large number of measurements will cancel more effectively than just a couple.

          • Steve O –> “If you have only one measurement, your error range is very wide. If you have 1000 measurements, you can get very, very close to the true value.”

            No you can not. That is where you don’t understand what you are calculating. You can take one million readings and average them out to 10 decimal places. You are calculating the mean to a very accurate number with a concurrent small uncertainty of the mean. THAT IS NOT THE TRUE VALUE except for very specific circumstances.

            As Clyde said “But, what is really important here is the range of values, because the 2 SD error bar tells you that 98% of the samples fall within that range.”

            The standard deviation is very important. I know you and others think you can average days to months, months to years, and years to decades and tell whether the average is increasing or decreasing. What we are trying to tell you is that you can’t make that assumption for a number of reasons.

            The range of values determines how far apart the actual value can be. As you move to 2 and 3 standard deviations you increase the range any measurement can be. As you move to more and more SD’s you approach the limit of the original error. Consequently, you must quote the mean with an error that ultimately is the measurement error, i.e. +- 0.5 degrees.

            Look at my post below describing target shooting. Do you really think the mean will tell where the rifle described will shoot? Ultimately, you can only describe a circle that tells you where the errors are.

          • Clyde, sorry it took me so long to get back on this. You might not be there anymore. Anyway, regarding your first article, I see a couple of problems.

            Nobody cares what the temperature is on any particular day. When people reference “the temperature in 1850” nobody ever picks out one particular day. They are referring to the average temperature for the year. (Actually, the mid-range or two measurements, which is only a proxy, but it’s easier to just say “temperature.”)

            You say that to determine the temperature for a day, you should have continuous measurements, and integrate under the curve, which is true. How much different is that from taking 365 daily measurements to estimate the average temperature for a year? Instead of taking 3,600 measurements to determine the average temperature of a day, you are taking 365 mid-ranges to determine the average temperature of a year.

            If you do that in 1850 and again in 1950, and use this method to estimate a change in average temperature, how much different will your answer be if you plotted a curve and applied calculus to the values in 1850 and again in 1950?

            You also have a math error. Mathematically, it does not matter if you have rounded off each day’s measurements to the nearest degree.

  57. It might be helpful to repeat the mathematical requirements for using multiple measurements to reduce uncertainty. Very briefly they only apply to random errors of independent, equally distributed measurements of the same value.

    1. Errors must be random, more measurements have no effect whatsoever on systematic errors

    2. Measurements must be independent, i. e. measurements must not have any influence on each other. This for example makes repeated measurements by the same person dubious since his memory of earlier measurements may affect his readings.

    3. Measurements must be equally distributed i. e. they must have the same distribution function. This means that measurements with different methods can only be pooled if it is known that they have the same distribution (normal or whatever)

    4. The same value must be measured, for example in the credit-card measurement above the temperature of the card mustn’t change during the measurement. It need hardly be pointed out that a moving satellite never measures the same value twice.

  58. “Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.”

    I’m not sure this is correct.

    Let’s say that we know from another instrument that the true length of the card is 85.7525 mm +/-0.0025 mm. Out of 1,000 people, we have 750 who give us 86mm and 250 who give us 85mm. Our estimate for the card length would be 85.75.

    If we repeated the experiment with more cards and more 1,000’s of people we could even determine if there was a systematic error we needed to adjust for, and we could establish error bars around our point estimate. But as long as the measurement errors are randomly distributed, you can add accuracy by adding measurements.

    If the next card had an actual length of 86.05mm, the reported observations might be centered around 86mm, with 900 people reporting 86mm, 49 reporting 85mm, and 51 reporting 87mm.

    When Kip posted a column about temperature readings being only to the nearest degree, he also claimed a limit in the accuracy of estimating the true value. I created a mathematical experiment in Excel, laying in random “measurement errors” around a true value. The measurements themselves were then rounded to the nearest whole number. An average of the observations was still correct within several decimal points. Rounding to the nearest whole value had almost no impact on the accuracy of the estimation of the mean.

    • It seems that a percentage of people have the view that a device’s measurement resolution can be increased by just doing many iterations of the same process of measurement.
      This ignores that the device has a resolution limit.
      Using this logic, we could use yardsticks to measure the length of ants by just repeating the measurement a whole bunch of times.
      Or maybe I am missing something.

      A separate issue involves conflating such things as a measurement of temperature of the air and a series of random numbers.
      I do not think this is valid.
      This article may be relevant:

      “Random errors are errors in measurement that lead to measurable values being inconsistent when repeated measurements of a constant attribute or quantity are taken. Systematic errors are errors that are not determined by chance but are introduced by an inaccuracy (involving either the observation or measurement process) inherent to the system.[3] Systematic error may also refer to an error with a non-zero mean, the effect of which is not reduced when observations are averaged.[”

      And:

      “Random error (or random variation) is due to factors which cannot or will not be controlled. Some possible reason to forgo controlling for these random errors is because it may be too expensive to control them each time the experiment is conducted or the measurements are made. Other reasons may be that whatever we are trying to measure is changing in time (see dynamic models), or is fundamentally probabilistic (as is the case in quantum mechanics — see Measurement in quantum mechanics).”
      https://en.wikipedia.org/wiki/Observational_error#Systematic_versus_random_error

      • “It seems that a percentage of people have the view that a device’s measurement resolution can be increased by just doing many iterations of the same process of measurement.”

        Andrew Preece has a comment above that explains how this is done — not with a thought experiment, or on a spread sheet, but how electronics engineers do this in real world applications.

        • Andrews comment deals with a situation where the measured value varies randomly around a fixed value with an amplitude greater than the precision of the measuring device. In this case it is possible to narrow down the uncertainty, but only if you know that the average really is fixed, that the variation really is random and you also know the distribution of the errors (it doesn’t need to be normally distributed, but it must be known).

    • Steve O –> You are missing several attributes of the problem. First, your ‘random’ numbers probably have a normal distribution. No one doubts that this would happen with that. Second, you are missing Kip’s point. You don’t have 1000 people line up to read a thermometer. You have one person who reads it once and in the past, rounded it to the nearest degree. Please tell us how you can determine what the real temperature was when the accuracy was +- 0.5 degrees. I suspect you’ll realize that you won’t be able to know if the temperature was 50.5 or 49.5 degrees or anywhere in between.

      The same thing applies when you begin to average temperatures that have been read once and rounded. If Tmax is 55.0 +- 0.5 and Tmin is 40 +- 0.5, what is the true average? Is it (55.5 + 40.5)/2 = 48, or (55.5 + 39.5)/2 = 47.5, or (54.5 + 39.5)/2 = 47. Or worse, anywhere in between?

      Look at the range you get. It is somewhere between 47 and 48, i.e. the nearest degree, and you have no way to decide where it actually is. This same error carries through any number of temperature readings, of daily, monthly, or yearly averages.

      • My errors don’t need to be normally distributed. As long as there is no bias, or skewness in the distribution the errors will be symmetrically distributed and everything will work.

        If you take five minutes to run your own experiment, you’ll see that Kip was wrong and you’ll be able to figure out exactly how you get greater accuracy starting with data readings that are rounded to the nearest whole number.

        In Kip’s case of reading the historical temperatures, you don’t have 1,000 people reading each thermometer, but you do have a reading for every single day, and you have thermometers at multiple locations. It is only true that you do not have very good accuracy for any one location on any one particular day. The rounding or measurements has almost no impact on your ability to determine the true value to within a decimal point. His point was that rounding the readings and restricting the data to whole numbers degraded the accuracy and limited the precision in determining the true value and that’s not correct.

        So, open up Excel and copy down 1000 instances of any “true value” you choose. In the next column, create measurement errors between 1 and -1 using a random function, and in the next column add the two columns together to come up with 1000 measurements. In the next column, round the measurements to the nearest whole number.

        Now you have 1,000 measurements of incorrect measurements which are all rounded to the nearest whole number. How close do you think the average of all your bad measurements will be to the true value?

        • For your information climate time series are almost never normally distributed and are often skewed to some extent (and usually rather strongly autocorrelated).

        • Now you have 1,000 measurements of incorrect measurements which are all rounded to the nearest whole number. How close do you think the average of all your bad measurements will be to the true value?

          I’ve followed the procedure and it does work nicely for me: mean of incorrect measurements is very close to the mean of true values. However, this is because randomizers usually generate a sample (or samples) from the ‘standard normal’ distribution. When I use random numbers generated by algorithms used for common cryptography it does introduce a detectable error.

          • That’s a perfectly valid observation. If there is a skewness in the errors, or any systematic bias, then that error will not be corrected.

        • You are dealing with numbers and how to handle them. I am dealing with measurements.

          You are ignoring the questions I asked about the temperature measurements. The numbers there are very pertinent to this.

          Please answer my questions if you can.

          Then you may realize what the problems are.

          • I reckon what Steve tries to say is that even if single measurements contain significant uncertainty once sample size increases this uncertainty diminishes.

            The same thing applies when you begin to average temperatures that have been read once and rounded. If Tmax is 55.0 +- 0.5 and Tmin is 40 +- 0.5, what is the true average? Is it (55.5 + 40.5)/2 = 48, or (55.5 + 39.5)/2 = 47.5, or (54.5 + 39.5)/2 = 47. Or worse, anywhere in between?

            I would say then it is 47.5 +/- 0.5.

            What if you have, say, 1000 such measurements each with +/- 0.5 deg accuracy and the trend is +0.3? Is this trend detectable or not?

          • Para –> The problem is that you’re not measuring the same thing 1000 times. You are measuring temperature a day later. This is a brand new measurement of a brand new thing. You can’t average them and say you have a more accurate measurement of either. The measurement error must be carried through.

            Climate scientists use 47.5 and drop the +-0.5. This immediately makes one assume the data is more accurate than it really is.

            To answer your question, you can’t recognize a trend that is less than the error. However, climate scientists do since they don’t even acknowledge the measurement errors! Too many deal with numbers in a computer and believe these are real. They are not.

          • When you take the measurement of a temperature, it is recorded as a number. I don’t know how to get around that. I don’t understand why such a distinction is important.

            “You have one person who reads it once and in the past, rounded it to the nearest degree. Please tell us how you can determine what the real temperature was when the accuracy was +- 0.5 degrees. ”

            You don’t have one person who reads it once in the past. You have one person who read it once each day for 365 days over the course of a year. 3,650 times over the course of 10 years. You do not know the temperature on any particular day, but you DO know the average of your readings, even if the readings are rounded to the nearest degree. As far as knowing the average temperature on any particular day, you’ll need to provide me readings throughout the day, which don’t exist. Nobody knows what the temperature was on any particular day, but it’s also not important. If you have a reading taken at the same time each day, a lot of errors will wash out. You can determine if there is any trend in the data over time, to within a tenth of a degree, even if you take only one reading each day (at the same time) over the course of long period of time.

            When errors are symmetrically distributed, the errors can be eliminated with multiple sampling.

        • Dithering is not about the randomness of the signal but a dithering signal who’s amplitude greater than the resolution of the measuring device. If your resolution is 1mV then your random dithering would have to vary the signal randomly with some known distribution and amplitude (i.e. 2-5mV), the larger the signal the more resolution you would achieve. You would also have to have this occur at a frequency much greater than your sample period to meet Nyquist criteria. If you are wanting a 1kHz sampling rate (1 reading every millisecond) and you want to use a dithering average of 200 samples, the effective sampling rate would need to be 200kHz, and the dithering would have to be at 200kHz or greater or you would not be doing a full dithering sample which you hope isn’t changing during the 200 samples of dithered sampling. Also the 200 samples are not independent, they are of the same value with a displacement of time and random signal. For dithering to work the value must not change.

          This additional overhead is why I do not use this method in my systems. We cannot afford the processing power to make this work. And as such since we were talking about measuring signals without any dithering applied at the time of measurement, we are still limited by the base resolution of the measurement itself. Thus, my contention still stands that averaging a group of purely independent measurements with both time and spacial displacement does not give you any more accuracy than the base resolution used.

    • Steve O
      The problem with your Excel ‘experiment’ is that you are starting with a single value instead of many values.

      • “The problem with your Excel ‘experiment’ is that you are starting with a single value instead of many values.”

        It only takes a few minutes to change the experiment so that you start with a set of values. At the end of the experiment, you will still be able to determine the numerical average of the set. This would more closely resemble an estimation of the average temperature of a year where you have 365 mid-range estimates based on two daily readings.

        Or, you could say it applies to the average location of a chicken in a yard. Your estimated location can be off by 20 feet, and the chicken can be moving all over the yard, and you’ll be able to determine the “average location” of the chicken. You can even round off the last digit in order to make your inaccurate measurement less accurate and it will hardly make any difference.

  59. Hey Willis,

    Good stuff. This subject was discussed here several times in different forms. I reckon belief that ‘things will average out over when sample size increases’ is widespread in the climate science community. It’s convenient way of hiding all uncertainties and measurement errors.

    I would echo question posted earlier: how situation looks with respect not to averaging of direct measurements but to averaging averages of those? So, say you’ve got mean of 1000 measurements, then another mean from another 1000 and so on. Does it change much?

  60. Willis

    In a previous article post on this blog. There was a mention of mixing satellite altimetry data and terrestrial tide gauge data.

    In my day to day life I work with tide gauges from older mechanical gauges with rubber diaphragms to SOA electronic gauges available today. Measuring tide (especially offshore) was always one of the largest components of my error budget.

    Disregarding the geodesy and its associated errors, from a starting point the instruments I use claim a precision of 1.0mm and an accuracy of +/- 10mm in optimum conditions, however in the real world we never see that.

    My question is how can ‘scientists’ see sea level rises of 3mm per year.

    • tthey can’t , they ‘model them ‘
      for no other reason than the very limited amount of measurements compared to vast area to be measured. And it before we get to allowing for errors and other factors.
      We are once again in an area where ‘that will do’ was the norm because that was often all that was needed. Now its claims of unquestionable accuracy and demands for fundamental changes being based on this claims, while the data gather process has mostly remained at ‘that will do ‘ standard .
      Its still the case that they cannot make a prediction for weather for more than 72 hours ahead worth much more than ‘in the summer it is warmer than in winter ‘ hence why they still get it wrong so often. Add that issue into the data collection problems and you can see the best ‘stunt ‘ they ever pulled was to get so many to buy into the idea that climate ‘science ‘ is ‘settled science ‘ in the first place.

  61. Willis,
    This discussion brings to mind the old adage: “It’s easy to read a thermometer, it’s difficult to measure temperature”. A corollary might be: “It’s easy to read a ruler, it’s difficult to measure length”. This is partially because we’re co-mingling more than one task. With more precise measuring tools, we would find that there is no single value for the ‘length’ of the credit card. Its edges are not exactly parallel nor are they exactly straight, along both their lengths and along their thicknesses. This makes a single value for the precise length meaningless. So now we must add the dreaded – statistics.

    No number of multiple readings will make an instrument more accurate than its inherent accuracy, nor more precise than its inherent precision. (I include the ability of an observer to interpolate as part of inherent accuracy.) However, multiple observations of an instrument CAN, indeed, improve statistical summaries of its data. Torturing data sets with statistics, can also make the results meaningless. We see that a lot from Climate “Scientists”.

    This brings about another corollary: “It’s easy to calculate an average of a set of numbers, it’s difficult to know what it means.” There is, indeed, an average temperature of earth. I doubt that anyone has come even close to determining what it is. It’s too bad we can’t simply place a rectal thermometer in Washington DC, and read it.

    • You can put the rectal thermometer anywhere in DC and get the global temperature. If “global temperature” means anything, it means the whole globe, anywhere on it, whatever point you want. It also means you really need only one thermometer to measure it, anomalies included.

  62. Clearly the card is 3 3/8’. Makes it 85.725 mm. /sarc

    If all the measurements use the same units and same rounding rule your std deviation would be zero, implying a very precise measurement. If you randomly chose the rounding rule and calculate the stats then all you know is the mean and std deviation if the choice of the rounding rule. On top of real data it looks like bias. Which is the problem I have when looking at year over year temperature trends.

  63. No matter how many measurements are taken with a ruler marked to whole millimeters and recorded as whole millimeters, when one averages the records, one gets the Mean of the Measurements.

    The Mean of Measurements can be considered accurate to 1/10th mm.

    However, the Mean of Measurements must not be claimed to identify the actual length of the credit card. The Mean of Measurements only refers to the measurements taken — not to the physical object being measured.

    For data sets of measurements such as Sea Surface Height — very complex systems involving multiple numerous conversions of electronic signals into ‘distances’ and multiple confounding factors — knowing that the accuracy and precision of your results only apply to your averages themselves and not to the physical thing (the distance from the satellite to the sea surface) is extremely important.

    • Realistically, there is no way to measure the “physical thing”, but our measured impressions of the physical thing is a practical necessity — that’s why we measure — to attempt to synchronize human consciousness with a reality that we can never really grasp, but with which we can interact in a more controlled fashion, using our measurements of it.

      That being said, we should know the limitations of our measurements — we should know when they are anywhere near real or somewhere in the realm of made-up or overly stated in terms of precision.

      I still find the measurement of sea level from outer space mysterious. Where’s the reference point ? How can we establish the fact that any reference point in space stays in its place ? I mean, the Earth is moving constantly in all sorts of ways. How do you determine the fixed place in space, with respect to a constantly moving Earth? — star-field background, maybe? But, even so, there still seems to be lots of movements to take account of, in order to established that fixed reference point. And if the Earth is moving with respect to that eventually established reference point, then are you still measuring a sea-level height or a drift in distance that Earth has moved ?

      Seems like lots of room for error.

      • Coordinates based on the Earth’s center-of-mass. These are known to a high degree of accuracy, though not to millimeter precision.
        NASA is planning a new series of small geodetic satellites (GRITSS) that will hopefully improve the accuracy of the coordinate system to about 1 mm, i e about 10 times worse than already claimed by sea-level measurements.

        Interestingly satellite measurements of arctic ice thickness are universally acknowledged to have decimeter precision at best, despite being vastly simpler than sea-level measurement since they measure the difference in altitude between the top of the ice floes and the intervening leads, and so are unaffected by atmospheric effects and independent of the absolute level of the sea.

  64. Einstein’s doctorate, and 5 papers, on Brownian motion, used statistics together with physics to prove the existence of, at that time immeasurably small, molecules.
    https://www.britannica.com/science/probability-theory/Brownian-motion-process#ref407453
    Einstein, who never accepted statistics as natural law, was indeed the expert there, and as a physicist not limited by measurement accuracy, precision, repeatability….

    Considering radar propagation through atmosphere, ionosphere, is there data from Jason-x-type satellites over at least 1 solar 22 year cycle available? Older data might not have GPS info.

    • Brownian motion follows the normal distribution. The probability density of normal distribution:
      P (x) = A (2 pi o^2)^(-1/2) e^-((x – u)^2/(2 o^2))
      Where: A = normalization factor, o = standard deviation, u = mean

      Einstein replaced the probability density (P) with mass density (p), and substituted: o^2 = 2 D t, u = 0, A = N and expressed it as a function of two variables (x, t)
      p (x, t) = N (2 pi 2 D t)^(-1/2) e^-((x^2)/(4 D t))
      where: N = number of particles, D = mass diffusivity, t = time

      Solving this equation enabled Einstein to determine the number of atoms in a mole thus determining Avogadro’s constant.

  65. This is why I threw out all my Starret mics, and just use a wooden yardstick in the machine shop.
    All I have to do is keep measuring with that yardstick and I can get whatever accuracy I want. Works great on crankshafts,,, to 100 thou, no problem /sarc

  66. This is all rather discouraging! From my long-time reading of WUWT, I conclude that the commenters are generally bright, well-educated, and technically experienced. Yet, we can’t reach agreement on whether or not a large number of readings of some constant value, let alone a variable, can improve the precision of the estimate of the mean of the many readings, or reduce the standard deviation of the samples. What to do! Can we get William Briggs to weigh in on this?

    • The answer unfortunately is “it depends”.

      If the value is really constant, the errors are random and the measurements are independent and equally distributed it does. Otherwise it doesn’t, or at least not proportional to the root of the number of measurements.

      • tty
        Personally, I agree with your assessment. However, my concern is how do we convince those who disagree with us that they are wrong?

        • Very discouraging, Clyde.
          Steve O in particular just will not get away from his conviction that measuring an ever changing value with a device, and doing so over many days in many places, will somehow wash out all errors and give you a number that very closely describes a trend over the time period in question.
          I have started to read the posts you linked to, to articles here that you wrote over the past couple of years on this topic, and am now swimming through the long comments sections.
          In addition to what we have from Steve O, I have noted that some people claim that different temperature measurements in multiple places, taken on multiple days (and even in-filled fake numbers) over some period of time, are not measuring something different each time…they are measuring “global average temperature”.
          And since they give it that name, this magical number is “one thing”, that can be precisely and accurately determined with a long series of not particularly precise measurements of unknown accuracy…as long as we have lots and lots of them!

          I refer back to my analogy of weighing cats, and claiming that you now know how much “cats” weigh, if you do it lots of times over a long period…even though they are different cats, and sometimes the same cat on a different day…or not. And they are not holding still while you weight them but just running around and sometimes running over one of the scales you have set up in cat town.
          But the weight of cats may hover around a fixed value when averaged over many cats and many years. It may not, but it may. But can you say after a hundred years that you know what cats weigh to the nearest 1/100th of a gram and if there is a trend of 0.01 grams per decade?
          Even if your scale only measured to the nearest pound?
          Even if the scale outside the Midnight Cat Café was replaced with a laser scanner in 1994, and then moved to the Hissing Fence, which now has the dumpster of a fish market next to it?
          *sorry, sometimes you just have to find something to lighten up the mood*

          Everything I have learned has informed me that LLN applies to measuring the same thing, and we have several eloquent explanations for how it is simply not applicable to the subjects at hand, even if someone has a name for it that implies it is one thing.

          • What if we change what we say we are measuring? We’re not measuring the average daily temperature by taking a high reading and a low reading. Instead, let’s say that we’re measuring the average temperature for a one year period — by taking 2 measurements each day for 365 days.

            Do we really need to take 240 measurements each day or can we get away with taking high and low readings and using the mid-range in our calculations?

          • I have no idea why you ignore what everyone else is saying Steve.
            No one is ignoring what you are saying.
            And in your own words, you confirmed that Clyde is correct.
            I am coming to the conclusion that you may well be a warmista troll.

          • “And since they give it that name, this magical number is “one thing”, that can be precisely and accurately determined with a long series of not particularly precise measurements of unknown accuracy…as long as we have lots and lots of them…”

            Maybe this is the core of the disagreement. What makes an average temperature a magical thing? Does it not actually exist in reality? Why does the math not apply?

            If you could measure the surface temperature at each square centimeter, simultaneously, around the entire globe, with precise instruments, could we not determine the average value with some great amount of accuracy? If we reduce the number of thermometers by a factor of a 1000, such that we only measure each square foot, will our accuracy decline appreciably? No. If we rounded each thermometer’s reading to the nearest degree, would average of them change? Not by very much. If we added random errors to each thermometer, would our average of the reading change? Not by very much.

            To get within a certain level of accuracy is a matter of the number of readings you take. Take enough readings, in enough places and you can determine the average according to whatever degree of accuracy you desire. Or, start with the number of readings that are available, and you can determine the level of accuracy which you have achieved.

            I also do not see why it matters that the temperature changes over the course of a year. You can measure the average location of a chicken in a chicken coop, with measurement errors that are also rounded off, and you’ll still be able to determine the “average location” of the chicken.

            The same is true with the average weight of the cats of the world. With enough measurements, you can indeed know the average weight to within 1/100th of a gram. And if there is a trend of 0.01 grams per decade, with enough measurements you’ll be able to detect it.

          • Keep in mind that at no time were all the cats in the world measured, and there was no systematic effort to make sure that the cats measured were truly representative of the average of all cats in the world.
            Let alone that the subsequent measurements tracked the average of the true value of how all the cats in the world were changing over time.

  67. Here’s a simple thought experiment based on the method used by Morice, Kennedy, Rayner and Jones in “Quantifying uncertainties in global and regional temperature change
    using an ensemble of observational estimates: the HadCRUT4 data set ”

    The maximum and minimum temperatures for a given day can be averaged to give a daily mean temperature.
    Calculate the uncertainty as the standard deviation of the daily means. For the sake of argument say it’s 0.5 degrees Celsius. Half a degree? That’s not very accurate.

    To get the average monthly temperature, note that there’s about 30 days in a month, and two measurements per day. That’s 60 measurements. Divide the average daily temperature’s standard deviation by SQRT(60). The gives us an uncertainty of 0.5/SQRT(60) = 0.065 degrees.
    Now divide by SQRT(12) since there’s 12 months in a year. That reduces the uncertainty of the yearly mean to just 0.019 degrees.
    Why stop there? We can keep going and calculate a decadal mean. Since there 10 years in a decade, divide by SQRT(10). Now the uncertainty is 0.0059 degrees.
    By this same line of reasoning we can divide by SQRT(10) again and we now know the average temperature over the past century to an amazing 0.0019 degrees Celsius.
    None of this required any equipment more complicated that a pocket calculator. I’m old enough to be able to do the calculations by hand with pencil and paper. (Yes, I was taught how to extract square roots in primary school.)

    • “The maximum and minimum temperatures for a given day can be averaged to give a daily mean temperature.”

      However this is not the mathematically correct daily mean temperature. That can only be determined by integrating the temperature curve over the given day, and often will not be even close to the average of the maximum and minimum temperatures. This latter definition is quite honestly solely motivated by the fact that many weather stations have had maximum and minimum thermometers for a long time, so this data is available while the integral is not. And even that value is dependent on when the thermometers are read off, hence the need for TOBS corrections (which is one correction that is actually defensible, though it will of course increas the uncertainty even more).

      • tty and Pat Lane,
        Yes, this has been gone over before. The value computed from TMax and TMin is at best a median, more properly called a mid-range value.

        https://en.wikipedia.org/wiki/Mid-range

        A mid-range has none of the characteristics of a true arithmetic mean, and as tty correctly points out, may not even be close to a true mean as derived from many discrete observations or the integration of a function.

        Therefore, any subsequent calculation of the means of daily medians, and representing them as the mean of means is is a misrepresentation. At best, one is looking at the distribution of daily medians. This is the state of the ‘science’ of climatology.

      • If you feel better calling this a mid-range value, then I don’t see any harm. But I also don’t believe it changes anything. Whatever difference there may be between the mid-range value and the area under the curve of a continual reading throughout the day is not important. If because of the warming and cooling pattern, the average difference is 0.487 degrees to one side or the other, it doesn’t have any impact as long as we are always using the mid-range value. And if we mis-name the value, calling it an average or a mean, it may be a bit sloppy but I don’t see how it changes anything.

  68. Willis
    Assuming the lengths of credit cards have normal distribution, you can predict but not measure lengths smaller than the 1-mm resolution of the ruler.
    Let:
    u = mean length, o = standard deviation, x = 10 = multiple of standard deviation
    Since 86 and 85 mm are the upper and lower limits, the mean is the midpoint of the limits:
    u = (86 + 85)/2 = 85.5

    Upper limit:
    u + x o = 86
    85.5 + 10 o = 86 (Eq. 1)
    Lower limit:
    u – x o = 85
    85.5 – 10 o = 85 (Eq. 2)
    Solving for o from Equations 1 and 2:
    o = (86 – 85)/(2 x 10) = 0.05

    Taking 10,000 samples, all of them will have lengths greater than 85 and less than 86. You can predict the length L at 1 standard deviation = 68%
    10,000 (0.68/2) = 3,400 samples have a length:
    L ≤ u + o = 85.5 + 0.05 = 85.55
    Another 3,400 samples have a length:
    L ≥ u – o = 85.5 – 0.05 = 85.45

    • Dr. Strangelove, I don’t understand your example and calculation. If instead of x =10 in your example, I assume x = 50: then I must conclude from your math that the standard deviation of all measurements has jumped downward from 0.05 to 0.01. Since just four standard deviations (Z=4, or “4-sigma”) includes 99.997% of all data points in a normal Gaussian distribution, how would adding the few extra data points associated with 50 standard deviations (Z=50, or “50-sigma”) between 85.0 and 85.5, and then a few more data points between 85.5 and 86.0, change the 1-sigma (one standard deviation) value so drastically?

  69. how different is Mike’s Nature Trick (TM), when comparing electronic measurements to historic primitive thermometer values, read by eye? I call for a primitive instrument revival to be read for parallel values.

    iron brian

  70. I used to model pump stations with ocean intakes, using tide heights tables to change the suction head for the pump intakes. The predicted tide height was only an estimate, and barometric pressure can surely change actual water heights, by storm surge for example.

    Do the satellite measurements use tide tables? what is the resolution of their tide calculations?

    iron brian

  71. NIST Technical Note 1297

    The stated NIST policy regarding reporting uncertainty is (see Appendix C):

    Report U together with the coverage factor k used to obtain it, or report uc.

    When reporting a measurement result and its uncertainty, include the following information in the report itself or by referring to a published document:

    A list of all components of standard uncertainty, together with their degrees of freedom where appropriate, and the resulting value of uc. The components should be identified according to the method used to estimate their numerical values:

    those which are evaluated by statistical methods,
    those which are evaluated by other means.

    A detailed description of how each component of standard uncertainty was evaluated.
    A description of how k was chosen when k is not taken equal to 2.

    It is often desirable to provide a probability interpretation, such as a level of confidence, for the interval defined by U or uc. When this is done, the basis for such a statement must be given.

  72. Willis,

    You wrote:-

    “Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.”

    There are numerous realworld examples where precision error is reduced to negligible contribution by repeat measurements. The critical factor in determining whether precision error is reducible in this sense is the ratio of the range of measurements to the error range that comes from measurement precision. If this ratio is less than or equal to one (as is the case of your credit card example) then it is correct to say that precision error becomes irreducible in the average value no matter how many repeat experiments you carry out. Otherwise, your statement is simply not true.

    You can test this for yourself in just a few minutes on a spreadsheet or by generating a simple R script.

    (1) Take 500 random samples from an N(0,1) distribution, and calculate the sample mean and variance.

    (2) Round the sample values to the nearest integer value. You should find that nearly all of the values take an integer value between -2 and plus 2. You have now imposed a precision error on the sample dataset equivalent to a Uniform distribution on the interval (-0.5, +0.5). It is equivalent to measuring only to the nearest unit value.

    (3) Calculate the difference between the sample mean from (1) and the mean of the rounded values from (2).

    (4) Record the value in (3)

    (5) Choose a new set of 500 random numbers, return to (1), rinse and repeat.

    The difference in means which you calculated in (3) represents the difference between using precise measurements (of each realisation from the N(0,1) distribution) and using rounded measurements with a precision error of plus or minus 0.5. If you repeat the above numerical experiment enough times, you will find for the values above that these differences have a mean of zero and a standard deviation of just 0.01291. This latter value is the contribution of the precision error to the estimate of the mean value of the 500 measurements. You should note that (i) it is far smaller than the precision error itself and (ii) it is further reducible by increasing the number of sample measurements.

    • kribaez December 22, 2018 at 4:40 am

      Willis,

      You wrote:-

      “Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.”

      There are numerous realworld examples where precision error is reduced to negligible contribution by repeat measurements.

      Indeed there are. And there are also numerous real-world examples where precision error CANNOT be reduced to negligible contribution by repeat measurements. Which was my point.

      I see over and over people claiming that we can know e.g. the temperature in 1850 to the nearest hundredth of a degree, simply because of the number of measurements involved. Those kinds of claims are the subject of the post.

      Here’s one way that I use to understand the issues. If you want one more decimal of accuracy in your estimate of the mean, you need one hundred times as much data. And the reverse is true—if you have a hundredth of the data you lose one decimal point of uncertainty. It varies as the square root of N.

      Here’s an example. Berkeley Earth says that the North America temperature in 1870 was about 1.7 ±0.3°C. They also say that there were 100 temperature stations at that time.

      And by the immutable laws of statistics, that would mean that if they only had one temperature station at that time, they could tell the temperature of North America to ±3°C … and I don’t believe that for one minute.

      You can see why this is an important issue … you cannot blindly apply the Central Limit Theorem to some given dataset.

      Best regards,

      w.

    • Using your example of 500 random samples. Do you really think that a glass thermometer in the late 1800’s had 500 sample measurements for each recorded temperature? How about satellite measurements? Do we have 500 satellites taking the measure of a particular stretch of ocean at a time?

      Take the ADC discussion above. In essence you have a one bit sample of 0 or 1 taken ONCE. From that you have to determine where the signal level actually is. How accurate would that really be?

  73. Hey Steve,

    When you take the measurement of a temperature, it is recorded as a number. I don’t know how to get around that. I don’t understand why such a distinction is important.

    My understanding of that: Jim is saying that there measuring one thing 1000 times (as per Willis analogy with a credit card length) is a different kind of fish that measuring variable temperature signal twice per day for 1000 days where each measurement represents different temperature. For the former you may hope that mean of repeated measurements converges to a true value. For the latter, where each measurement has associated uncertainty, say +/- 0.5 deg C such hope may be questionable.

    • “Jim is saying that there measuring one thing 1000 times (as per Willis analogy with a credit card length) is a different kind of fish that measuring variable temperature signal twice per day for 1000 days where each measurement represents different temperature.”

      — I guess there is something here that I don’t understand. Why does that matter? Yes, a credit card is an unchanging value. Temperature is a changing value. Why is that an important distinction? The AVERAGE of the temperature over a period of time is not a changing value. At the end of a year, there are 365 measurements of a temperature that has changed over time. But there is only one average.

      • Hey Steve,

        I hope you had a glorious Christmas and you’re getting ready for New Year ball madness.

        — I guess there is something here that I don’t understand. Why does that matter? Yes, a credit card is an unchanging value. Temperature is a changing value. Why is that an important distinction? The AVERAGE of the temperature over a period of time is not a changing value. At the end of a year, there are 365 measurements of a temperature that has changed over time. But there is only one average.

        Again, my understanding is as follows: yes you’ve got only one average made out of 365 daily midrange values. Still, the uncertainty associated with each measurement, say +/-0.5 deg C persists and does not disappear as you go along with averaging. So, if you’ve got a trend, say +0.3 deg C this trend is not detectable due to uncertainty of +/- 0.5 deg, even if your average number is, say, 8.231 deg, i.e. to the thousands of Celsius.

        • ” Still, the uncertainty associated with each measurement, say +/-0.5 deg C persists and does not disappear as you go along with averaging.”

          Thanks, but the math says otherwise. As long as the +/-0.5 deg C errors are symmetrically distributed, the error is reduced with sampling. When you are measuring the average temperature for an entire year, you have a LOT of measurements. The argument is that because the temperature is changing, that the Central Limit Theorem does not apply because you’re measuring something that is different. That argument is simply not correct.

          A secondary argument is being made that there is a systematic error in the measurements. That’s another topic, and fails to refute the fact that the random errors go away with sampling. But the real topic applies to changes over time. And even with a systematic error, you can measure changes over time with a high degree of accuracy as long as the systematic error remains constant. Without a reason to claim that the systematic error has changed, it’s a reasonable assumption to make. If there is a reason to claim that the systematic error has changed, then that has to be taken into account.

  74. Willis wrote:

    Now … raise your hand if you think that we’ve just accurately measured the length of the credit card to the nearest three thousandths of one millimeter.

    Of course not. And the answer would not be improved if we had a million measurements.

    Yet you provide no proof.

    I here provide proof of existence. It probably sits in your living room stereo if you have anything digital such as a CD player.

    I propose that if your person doing the measurement has Parkinson’s and shakes randomly with a oscillation that exceeds about 1mm, then with a million measurements precision would improve according to the formula you describe.

    Now, unless your calibrator of that ruler also shook randomly with an oscillation that exceeds about 1mm, then the accuracy would not be very useful given that precision.

    This is precisely how delta-sigma converters work in your stereo system. They utilize noise above the final pass-band (20-20Khz) that works like the hypothetical Parkinson’s ruler user to provide that sigma/sqrt(n) improvement in the pass-band.

    The actual improvement for a well-engineered system is actually much better than 1/sqrt(n) because the noise is shaped. That’s pretty advanced math so I’ll let you read up on it if you wish.

    That also brings up the improvement can be worse than 1/sqrt(n) if the noise is correlated. Which in nature is quite often the case. I’ll leave that math for another day.

    Nevertheless, your example is not correct unless it is clearly stipulated that there is no noise whose amplitude exceeds that of the one-measurement precision.

    best regards,

    Peter

    • There’s an experiment you can try yourself with just your finger.

      Put your finger on something with a slight amount of texture without moving it. Note to yourself how much texture you feel.

      Now move your finger. Notice how much more texture you feel.

      It requires a motion that exceeds the size of the texture (and the distance between your finger’s nerves), but you can notice the improvement with motion.

      • You keep changing the experiment. Deal with the experiment as it was originally stated.

        Perhaps a different example of what we are talking about will help also. We are going to discuss target shooting. I have a rifle that consistently shoots in a two inch circle around the point of aim. We decide to shoot 100 times a day for 30 days and guess what, that gun shoots in a two circle around the point of aim. We then take each target put a grid on it and measure the distances from a fixed point to each hole in order to determine data points. You create a spreadsheet of the distances and determine the mean and the uncertainty of the mean. This should be the point of aim, and guess what, you can calculate that out to 5 digits if you want. We can add more and more shots to get any uncertainty you want.

        Now we go out to shoot on the 31st day and I ask you to put a mark on the target where you think the first shot will go. Will you use your mean that is calculated out to 5 digits or will you use something else?

        The same thing occurs when you start to average temperatures. Do you still think the mean of several days, months, or years give you any better idea of what the real temperature was? I think you’ll find the errors carry forward and that is all the accuracy you can get.

        • “Now we go out to shoot on the 31st day and I ask you to put a mark on the target where you think the first shot will go. Will you use your mean that is calculated out to 5 digits or will you use something else? ”

          I forgot to add, “Let’s assume you choose to put a mark elsewhere, what is the probability the bullet will hit your mark? What if you choose the mean, what is the probability the bullet will hit there?

          Same thing applies to temperatures. You can calculate a mean or average and a given uncertainty. But, what is the probability that mean is the actual value you should be using?

        • Your shooting experiment is not the same the ruler or global temperature experiments, so “let’s not change the experiment”.

          If you asked me where the mean of the next 100 shots would be on the 31st day, I’d put a mark on the target at the mean for the prior 31 days. However, being an actual shooter, if the temperature or wind has changed (or I don’t have a $2000 scope), I’d not have much confidence in the prior mean.

          That being said, shooting is not the same experiment as the ruler, global temperatures, or sea levels from satellites. In fact the closest related experiments in the set { ruler, shooting, global temps, sea levels } are global temperatures and sea levels. The ruler and shooting experiments have completely different sources of errors and completely different noise spectrum.

      • You keep changing the experiment. Deal with the experiment as it was originally stated.

        but if we change the experimental conditions you get a different answer … ya think?

        The ruler thought experiment is a false analogy to global temperature or sea level measurements because unlike the climate measurements the ruler measurement has no noise. I instinctively changed the experiment to make a correct analogy because I don’t like to see someone I respect using false analogies.

        With the global temperature and sea level measurements there’s high frequency noise that provides the resolution not otherwise available with low resolution instruments. Just like a delta-sigma converter gets 24 bits of resolution at 20Khz on your stereo with one bit of output (but the 1 bit is oscillating in the 10Mhz+ range).

        That noise is in 3 dimensions – the location on earth and the time. Both of which are auto-correlated, but that’s not what Willis was attempting to show.

        I do thank you for the link on how to determine N from the Hurst exponent Good write-up. Adding it to my bookmarks library.

        • You didn’t address my questions. They are pertinent to determining the actual signal in temperature.

          The result is that the measurement error carries through to the end. If the variance you see is less than the error measurement of independent measurements of different things, then you simply can not quote a figure with higher resolution than the errors in the independent measurements.

          Answer the questions about target shooting and you will see problem.

    • Peter Sable December 23, 2018 at 10:36 am

      Willis wrote:

      Now … raise your hand if you think that we’ve just accurately measured the length of the credit card to the nearest three thousandths of one millimeter.

      Of course not. And the answer would not be improved if we had a million measurements.

      Yet you provide no proof.

      I here provide proof of existence. It probably sits in your living room stereo if you have anything digital such as a CD player.

      I propose that if your person doing the measurement has Parkinson’s and shakes randomly with a oscillation that exceeds about 1mm, then with a million measurements precision would improve according to the formula you describe.

      Nevertheless, your example is not correct unless it is clearly stipulated that there is no noise whose amplitude exceeds that of the one-measurement precision.

      best regards,

      Peter

      Thanks, Peter. I love how guys like you say well, yes, but if we change the experimental conditions you get a different answer … ya think?

      You’re right, Peter. I didn’t expressly rule out that the person doing the measuring has that rare and special kind of illness called “Absolutely Symmetrical Parkinson’s Disease”, the one where she shakes to the right just exactly as often and as much as she shakes to the left … how foolish of me.

      I am as aware as anyone of the Central Limit Theorem and how it can reduce uncertainty. I use it in my analyses all the time. I also know about the issues with autocorrelation, and I independently discovered the method of Koutsoyiannis to deal with it.

      However, I am also aware that in the real world of climate science, just as in the real world of Parkinson’s Disease, errors are rarely symmetrical around the true value.

      w.

  75. I have the same problem with the satellite measurements of atmospheric temperatures at various levels.
    What is the accuracy? +/- 1 deg C. So that is the accuracy of the global temperature at that level. It seems to be quoted to 0.01 deg averages.

  76. “Regardless of the number of measurements, you can’t squeeze more than one additional decimal out of an average of real-world observations.”

    If the math says that you can, why is it that you can’t? This seems to be the crux of much disagreement.

    The point regarding sea-level rise it that if we can assume that any systematic error is constant then we can measure changes in sea level over time. For the record, because satellite measurements are diverging from the long-established tide level records, there may be reason to suspect that the systematic error is changing for some unknown reason. But the disagreements here aren’t about sea level measurements — they’re about applied math.

Comments are closed.