Durable Original Measurement Uncertainty

Guest Essay by Kip Hansen

 

GRID1_smallIntroduction:

Temperature and Water Level (MSL) are two hot topic measurements being widely bandied about and vast sums of money are being invested in research to determine whether, on a global scale, these physical quantities — Global Average Temperature and Global Mean Sea Level — are changing, and if changing, at what magnitude and at what rate. The Global Averages of these ever-changing, continuous variables are being said to be calculated to extremely precise levels — hundredths of a degree for temperature and millimeters for Global Sea Level — and minute changes on those scales are claimed to be significant and important.

In my recent essays on Tide Gauges, the question of the durability of original measurement uncertainty raised its toothy head in the comments section.

Here is the question I will try to resolve in this essay:

If original measurements are made to an accuracy of +/- X (some value in some units), does the uncertainty of the original measurement devolve on any and all averages – to the mean –  of these measurements?

 Does taking more measurements to that same degree of accuracy allow one to create more accurate averages or “means”?

My stated position in the essay read as follows:

If each measurement is only accurate to ± 2 cm,  then the monthly mean cannot be MORE accurate than that — it must carry the same range of error/uncertainty as the original measurements from which it is made.   Averaging does not increase accuracy.

It would be an understatement to say that there was a lot of disagreement from some statisticians and those with classical statistics training.

I will not touch on the subject of precision or the precision of means.  There is a good discussion of the subject on the Wiki page: Accuracy and precision .

The subject of concern here is plain vanilla accuracy:  “accuracy of a measurement is the degree of closeness of measurement of a quantity to that quantity’s true value.” [ True value means is the actual real world value — not some cognitive construct of it.)

 The general statistician’s viewpoint is summarized in this comment:

“The suggestion that the accuracy of the mean sea level at a location is not improved by taking many readings over an extended period is risible, and betrays a fundamental lack of understanding of physical science.”

I will admit that at one time, fresh from university, I agreed with the StatsFolk.  That is, until I asked a famous statistician this question and was promptly and thoroughly drummed into submission with a series of homework assignments designed to prove to myself that the idea is incorrect in many cases.

 First Example:

Let’s start with a simple example about temperatures.   Temperatures, in the USA, are reported and recorded in whole degrees Fahrenheit.  (Don’t ask why we don’t use the scientific standard.  I don’t know).  These whole Fahrenheit degree records are then machine converted into Celsius (centigrade) degrees to one decimal place, such as 15.6 °C.

This means that each and every temperature between, for example, 72.5 and 71.5 °F is recorded as 72 °F.  (In practice, one or the other of the precisely .5 readings is excluded and the other rounded up or down).  Thus an official report for the temperature at the Battery, NY at 12 noon of “72 °F” means, in the real world, that the temperature, by measurement, was found to lie in the range of 71.5 °F and 72.5 °F — in other words, the recorded figure represents a range 1 degree F wide.

In scientific literature, we might see this in the notation:  72 +/- 0.5 °F.  This then is often misunderstood to be some sort of “confidence interval”, “error bar”, or standard deviation.

It is none of those things in this specific example of temperature measurements.  It is simply a form of shorthand for the actual measurement procedure which is to represent each 1 degree range of temperature as a single integer — when the real world meaning is “some temperature in the range of 0.5 degrees above or below the integer reported”.

Any difference of the actual temperature, above or below the reported integer is not an error.   These deviations are not “random errors” and are not “normally distributed”.

Repeating for emphasis:  The integer reported for the temperature at some place/time is shorthand for a degree-wide range of actual temperatures, which though measured to be different, are reported with the same integer.  Visually:

Temperature_72_plus

Even though the practice is to record only whole integer temperatures, in the real world, temperatures do not change in one-degree steps — 72, 73, 74, 72, 71, etc.  Temperature is a continuous variable.  Not  only is temperature a continuous variable, it is a constantly changing variable.  When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another.  Further, any and all values in the range shown above are equally likely — Nature does not “prefer” temperatures closer to the whole degree integer value.

[ Note:  In the U.S., whole degree Fahrenheit values are converted to Celsius values rounded to one decimal place –72°F is converted and also recorded as  22.2°C.  Nature does not prefer temperatures closer to tenths of a degree Celsius either. ]

While the current practice is to report an integer to represent the range from integer-plus-half-a-degree to integer-minus-half-a-degree, this practice could have been some other notation just as well.  It might have been just report the integer to represent all temperatures from the integer to the next integer, as in 71 to mean “any temperature from 71 to 72” — the current system of using the midpoint integer is better because the integer reported is centered in the range it represents — this practice, however,  is easily misunderstood when notated 72 +/- 0.5.

Because temperature is a continuous variable,  deviations from the whole integer are not even “deviations” — they are just the portion of the temperature measured in degrees Fahrenheit normally represented by the decimal fraction that would follow the whole degree notation — the  “.4999” part  of 72.4999°F.  These decimal portions are not errors, they are the unreported, unrecorded part of the measurement and because temperature is a continuous variable, must be considered evenly spread across the entire scale — in other words, they are not, not, not “normally distributed random errors”.  They only reason they are uncertain is that even when measured, they have not been recorded.

So what happens when we now find the mean of these records, which, remember, are short-hand notations of temperature ranges?

Let’s do a basic, grade-school level experiment to find out…

We will find the mean of a whole three temperatures; we will use these recorded temperatures from my living room:

11:00     71 degrees F
12:00     72 degrees F
13:00     73 degrees F

As discussed above, each of these recorded temperatures really represent any of the infinitely variable intervening temperatures, however I will make this little boxy chart:

GRID1

Here we see each hour’s temperature represented as the highest value in the range, the midpoint value of the range (the reported integer), and as the lowest value of the range.  [ Note: Between each box in a column, we must remember that there are an infinite number of fractional values, we just are not showing them at this time. ]   These are then averaged — the mean calculated — left to right:  the three hour’s highest values give a mean of 72.5, the midpoint values give a mean of 72, and the lowest values give a mean of 71.5.

The resultant mean could be written in this form:  72 +/- 0.5 which would be a short-hand notation representing the range from 71.5 to 72.5.

The accuracy of the mean, represented in notation as +/- 0.5, is identical to the original measurement accuracy — they both represent a range of possible values.

 Note:  This uncertainty stems not from the actual instrumental accuracy of the original measurement, which is a different issue and must be considered additive to the accuracy discussed here which arises solely from the fact that measured temperatures are recorded as one-degree ranges with the fractional information discarded and lost forever, leaving us with the uncertainty — a lack of knowledge — of what the actual measurement itself was.

Of course, the 11:00 actual temperature might have been 71.5, the 12:00 actual temperature 72, and the 13:00 temperature 72.5.  Or it may have been 70.5, 72, 73.5.

Finding the means kiddy-corner gives us 72 for each corner to corner, and across the midpoints still gives 72.

Any combination of high, mid-, and low, one from each hour, gives a mean that falls between 72.5 and 71.5 — within the range of uncertainty for the mean.

GRID23

Even for these simplified grids, there are many possible combinations of one value from each column.  The means of any of these combinations falls between the values of 72.5 and 71.5.

There are literally an infinite number of potential values between 72.5 and 71.5 (someone correct me if I am wrong, infinity is a tricky subject) as temperature is a continuous variable.  All possible values for each hourly temperature are just as likely to occur — thus all possible values, and all possible combinations of one value for each hour, must be considered. Taking any one possible value from each hourly reading column and finding the mean of the three gives the same result — all means have a value between 72.5 and 71.5, which represents a range of the same magnitude as the original measurement’s, a range one degree Fahrenheit wide.

The accuracy of the mean is exactly the same as the accuracy for the original measurement — they are both a 1-degree wide range.     It has not been reduced one bit through the averaging process.  It cannot be.

Note: For those who prefer a more technical treatment of this topic should read Clyde Spencer’s “The Meaning and Utility of Averages as it Applies to Climate” and my series “The Laws of Averages”.

And Tide Gauge Data?

It is clear that the original measurement accuracy’s uncertainty in the  temperature record arises from the procedure of reporting only whole degrees F or degrees C to one decimal place, thus giving us not measurements with a single value, but ranges in their places.

But what about tide gauge data?  Isn’t it a single reported value to millimetric precision, thus different from the above example?

The short answer is NO, but I don’t suppose anyone will let me get away with that.

What are the data collected by Tide Gauges in the United States (and similarly in most other developed nations)?

sensor_specs_water_level

The Estimated Accuracy is shown as +/- 0.02 m (2 cm) for individual measurements and claimed to be +/- 0.005 m (5 mm) for monthly means. When we look at a data record for the Battery, NY tide gauge we see something like this:

Date Time Water Level Sigma
9/8/2017 0:00 4.639 0.092
9/8/2017 0:06 4.744 0.085
9/8/2017 0:12 4.833 0.082
9/8/2017 0:18 4.905 0.082
9/8/2017 0:24 4.977 0.18
9/8/2017 0:30 5.039 0.121

Notice that, as the spec sheet says, we have a record every six minutes (1/10th hr), water level is reported in meters to the millimeter level (4.639 m) and the “sigma” is given.  The six-minute figure is calculated as follows:

“181 one-second water level samples centered on each tenth of an hour are averaged, a three standard deviation outlier rejection test applied, the mean and standard deviation are recalculated and reported along with the number of outliers. (3 minute water level average)”

Just to be sure we would understand this procedure, I emailed CO-OPS support [ @ co-ops.userservices@noaa.gov ]:

To clarify what they mean by accuracy, I asked:

When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.

 And received the answer:

That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time. 

 [ Note:  In a separate email, it was clarified that “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.” ]

The question and answer verify that both the individual 1-second measurements and the 6-minute data value represents a range of water level 4 cm wide, 2 cm plus or minus of the value recorded.

This seemingly vague accuracy — each measurement actually a range 4 cm or 1 ½ inches wide — is the result of the mechanical procedure of the measurement apparatus, despite its resolution of 1 millimeter.  How so?

tide_gauge_detail

NOAA’s illustration of the modern Acoustic water level tide gauge at the Battery, NY shows why this is so.  The blow-up circle to the top-left shows clearly what happens at the one second interval of measurement:  The instantaneous water level inside the stilling well is different than the instantaneous water level outside the stilling well.

This one-second reading, which is stored in the “primary data collection platform” and later used as part of the 181 readings averaged to get the 6-minute recorded value, will be different from the actual water level outside the stilling well, as illustrated.  Sometimes it will be lower than the actual water level, sometimes it will be higher.  The apparatus as a whole is designed to limit this difference, in most cases, at the one second time scale, to a range of 2 cm above or below the level inside the stilling well  — although some readings will be far outside this range, and will be discarded as “outliers” (the rule is to discard all 3-sigma outliers — of the set of 181 readings — from the set before calculating the mean which is reported as the six-minute record).

We cannot regard each individual measurement as measuring the water level outside the stilling well — they measure the water level inside the stilling well. These inside-the-well measurements are both very accurate and precise — to 1 millimeter. However, each 1-second record is a mechanical approximation of the water level outside the well — the actual water level of the harbor, which is a constantly changing continuous variable  — specified to the accuracy range of +/- 2 centimeters. The recorded measurements represent ranges of values.  These measurements do not have “errors” (random or otherwise) when they are different than the actual harbor water level.  The water level in the harbor or river or bay itself was never actually measured.

The data recorded as “water level” is a derived value – it is not a direct measurement at all.  The tide gauge, as a measurement instrument, has been designed so that it will report measurements inside the well that will be reliably within 2 cm, plus or minus,  of the actual instantaneous water level outside the well – which is the thing we wish to measure.  After taking 181 measurements inside the well, throwing out any data that seems too far off, the remainder of the 181 are averaged and reported as the  six-minute recorded value, with the correct accuracy notation of +/- 2 cmthe same accuracy notation as for the individual 1-second measurements.

The recorded value denotes a value range – which must always be properly noted with each value — in the case of water levels from NOAA tide gauges, +/- 2 cm.

NOAA quite correctly makes no claim that the six-second records, which are the means of 181 1-second records, have any greater accuracy than the original individual measurements.

Why then do they make a claim that monthly means are then accurate to +/- 0.005 meters (5 mm)?    In those calculations, the original measurement accuracy is simply ignored altogether, and only the reported/recorded six-minute mean values are considered (confirmed by the author) — the same error that is made as with almost all other large data set calculations, applying the inapplicable Law of Large Numbers.

Accuracy, however, as demonstrated here, is determined by the accuracy of the original measurements when measuring a non-static, ever-changing,   continuously variable quantity and which is then recorded as a range of possible values — the range of accuracy specified for the measurement system —  and cannot be improved when (or by) calculating means.

Take Home Messages:

  1. When numerical values are ranges, rather than true discrete values, the width of the range of the original value (measurement in our cases) determines the width of the range of any subsequent mean or average of these numerical values.
  2. Temperatures calculated from ASOS stations however are recorded and reported temperatures as ranges 1°F wide (0.55°C), and such temperatures are correctly recorded as “Integer +/- 0.5°F”. The means of these recorded temperatures cannot be more accurate than the original measurements –because the original measurement records themselves are ranges,  the means must be denoted with the same +/- 0.5°F.
  3. The same is true of Tide Gauge data as currently collected and recorded. The primary record of 6-minute-values, though recorded to millimetric precision, are also ranges with an original accuracy of +/- 2 centimeters.  This is the result of the measurement instrument design and specification, which is that of a sort-of mechanical averaging system.  The means of tide gauge recorded values cannot be made more accurate the +/- 2 cm — which is far more accurate than needed for measuring tides and determining safe water levels for ships and boats.
  4. When original measurements are ranges, their means are also ranges of the same magnitude. This fact must not be ignored or discounted; doing so creates a false sense of the accuracy of our numerical knowledge.  Often the mathematical precision of a calculated mean overshadows its real world, far fuzzier accuracy, leading to incorrect significance being given to changes of very small magnitude in those over-confident means.

# # # # #

Author’s Comment Policy:

Thanks for reading — I know that this will be a difficult concept for some.   For those, I advise working through the example themselves.  Use as many measurements as you have patience for. Work out all the possible means of all the possible values of the measurements, within the ranges of those original measurements, then report the range of the means found.

I’d be glad to answer your questions on the subject, as long as they are civil and constructive.

# # # # #

 

Advertisements

514 thoughts on “Durable Original Measurement Uncertainty

    • these physical quantities — Global Average Temperature and Global Mean Sea Level

      The first place to start is to point out that Global Average Temperature is NOT a “physical quantity”. You can not take the average of temperature, especially across vastly different media like land sea and ice. It’s scientific bullshit.

      Are land + sea temperature averages meaningful?
      https://judithcurry.com/2016/02/10/are-land-sea-temperature-averages-meaningful/

      Before you start arguing about uncertainty ( which is a very good argument to get into ) you need to make sure are measuring something that is physically meaningful.

      • Greg, if you don’t think there is a physical “global temperature” what is your opinion of the global average of temperature anomalies? Ditto for sea surface levels.

      • This whole subject of uncertainty and measurement error is very complex out side a carefully constructed lab experiment. It is certainly key to the whole climate discussion and is something that Judith Curry has been pointing out fro at least a decade now.

        However, this simplistic article by Kip does not really advance the discussion and sadly is unlikely to get advanced very much an anarchic chain of blog posts.

        Kip clearly does not have the expertise to present a thorough discussion. It would be good if someone like his stats expert could have would have written it. This definately does need a thorough treatment and the currently claimed uncertainties are farcical, I will second him on that point.

      • Greg. You won’t get any argument from me that “Global Average Temperature” isn’t a poor metric. It’s very sensitive to the constantly changing distribution of warm water in the Pacific Ocean basin. Why would anyone not working on ENSO want a temperature metric that behaves like that? But it really is a physical quantity — if an inappropriate one for the purposes it’s being used for. Don’t you think it was almost certainly lower at the height of the last glaciation, or higher during the Cretaceous?

      • “if you don’t think there is a physical “global temperature”” – It’s not an opinion. It stems from the definition of temperature. They do indeed extend the notion of temperature in some very special cases for systems out of thermodynamic equilibrium, but typically it’s for dynamical equilibrium and they do lead to nonsense when taking out of context (such as absolute negative temperature). But for systems that are not even in dynamical equilibrium, such as Earth, it’s pure nonsense to average an intensive value that can be defined only locally, due of cvasiequilibrium. It’s not only pure nonsense, but it’s very provable that if you still insist of using such nonsense, you’ll get the wrong physical results out of calculation, even for extremely simple systems.

      • Don , maybe you should read the link in my first comment. There is a whole article explaining why global mean temperature is not physically meaningful.

      • Greg ==> I don’t disagree about global means — but one has to call them something — they certainty are a hot topic of conversation and research, even if they don;t really exist.

      • Dr. Curry’s point are well taken, many people do not understand the differences between energy and temperature. I also point out that “average daily temperature,” which has been interpreted as the average of the daily maximum and minimum is also misunderstood. We are now able to take temperature at the interval of our choice and come up with a weighted average. The average computed from just one daily maximum and one daily minimum assumes the temperatures spend equal amount of time clustered around the average. This is clearly not the case. So when comparing historical temperatures to newer values, it is important to realize the differences.

      • just to be clear oeman50, that was my article that Judith Curry published on here site. Note the credit just below the title. ;)

      • The main problem with averaging anything globally is that no living thing on Earth actually experiences the global average. Additionally, the average temperature tells us nothing about the daily range of temperatures. If I experience a day which is 60 degrees in the morning, and 100 degrees in the afternoon, is it not hotter than a day which starts out at 75 and reaches a high of 95? Yet once averaged, the 95 degree day is reported as 5 degrees hotter than the 100 degree day. Of course it gets more complex, but it would be like calculating a globally averaged per capita crime rate. You could do it, but it would be a useless number because the only thing that is important is the criime rate where you are or plan to be. Same with temperature. If we experience a decade where the global average temperature goes up a small amount, was it higher daytime highs that caused it? Was it higher daytime lows that caused it? Was the range the same, but the heat lingered on a little longer after sunset? You can’t tell what is happening unless you look at local specifics, hour by hour. It would be like trying to tell me what song I’m thinking of if I just told you what the average musical note was. Meaning is in the details.
        In the same vein, I’ve always wondered why we track the CO2 content of the atmosphere without tracking all of the other greenhouse gases as closely. If CO2 concentration goes up, do we know for a fact that that increases the total amount of greenhouse gases? Could another gas like water vapor decrease at times to balance out or even diminish the total?
        It just seems to me that we are standing so far back trying to get the “big picture” that we are missing the details that would have told us the picture was a forgery.
        I’m no scientist, so blast me if I’m wrong, but the logic of it all seems to be lost.

      • Which is why only satellite, radiosonde and atmospheric reanalysis information [I hesitate to use “data.”] are appropriate for use in determining any averages, trends, etc.

        In a few [number of?] years ARGO may be useful. Early ARGO information shows no worrisome patterns.

      • @ Greg “This whole subject of uncertainty and measurement error is very complex”

        Yes it is: “In 1977, recognizing the lack of international consensus on the expression of uncertainty in measurement, the world’s highest authority in metrology, the Comité International des Poids et Mesures (CIPM), requested the Bureau International des Poids et Mesures (BIPM) to address the problem in conjunction with the national standards laboratories and to make a recommendation.”

        It took 18 years before the first version of a standard that deals with these issues in a successful way, was finally published. That standard is called: ´Guide to the expression of uncertainty in measurement´. There now exists only this one international standard for expression of uncertainty in measurement.

        “The following seven organizations supported the development of the Guide to expression of uncertainty, which is published in their name:
        BIPM: Bureau International des Poids et Measures
        IEC: International Electrotechnical Commission
        IFCC: International Federation of Clinical Chemistry
        ISO: International Organization for Standardization
        IUPAC: International Union of Pure and Applied Chemistry
        IUPAP: International Union of Pure and Applied Physics
        OlML: International Organization of Legal Metrology ..”

        The standard is freely available. I think of it as a really good idea to use that standard for what should be obvious reasons. Even some climate scientists are now starting to realize that international standards should be used. See:
        Uncertainty information in climate data records from Earth observation:
        “The terms “error” and “uncertainty” are often unhelpfully conflated. Usage should follow international standards from metrology (the science of measurement), which bring clarity to thinking about and communicating uncertainty information.”

      • “Before you start arguing about uncertainty ( which is a very good argument to get into ) you need to make sure are measuring something that is physically meaningful.”
        They are connected. The mean of an infinite number of measurements should give you the true value if individual measurements were only off due to random error. You need precise measurements to be sure that the distribution is perfect if you want others to believe that 10 000 measurements has reduced the error by √100. Even the act of rounding up or down means that you shouldn’t pretend that the errors were close to a symmetrical distribution and definitely not close enough to attribute meaning to a difference of 1/100th of the resolution. How anyone could argue against it is beyond me.
        To then do it for something that it not an intrinsic property is getting silly. I know what people are thinking but the air around a station in the morning is not the same as that around it when the max is read.

  1. I worked with IMD in Pune/India [prepared formats to transfer data on to punched cards as there was no computer to transfer the data directly]. There are two factors that affect the accuracy of data, namely:

    Prior to 1957 the unit of measurement was rainfall in inches and temperature in oF and from 1957 they are in mm and oC. Now, all these were converted in to mm and oC for global comparison.

    The second is correcting to first place of decimal while averaging: 34.15 is 34.1; 34.16 is 34.2; 34.14 is 34.1 and 34.25 is 34.3; 34.26 is 34.3; 34.24 is 34.2

    Observational error: Error in inches is higher than mm and Error in oC is higher than oF

    These are common to all nations defined by WMO

    Dr. S. Jeevananda Reddy

    • Dr. Reddy, Very interesting. By the way, you can use alt-248 to do the degree symbol, °.

      Take care,

  2. Thank you for this information. I have always suspected the reported accuracy of many averaged numbers were simply impossible. This helps to clarify my suspicions. I also do not understand how using 100 year old measurements mixed with modern ones can result in the high accuracy stated in many posts. They seem to just assume that a lot of values increases the final accuracy regardless of the origin and magnitude of the underlying uncertainties.

    • Only bullshit results. Even for modern measurements, it’s the hasty generalization fallacy to claim that it applies to the whole Earth. Statisticians call it a convenience sampling. And that is only for the pseudo-measurement that does not evolve radically over time. Combining all together is like comparing apples with pears to infer things about a coniferous forest.

      • Standard calculations in Chemistry carefully watch the significant digits. 5 grams per 7 mililiters is reported as 0.7 g/mL. Measuring several times with such low precision results in an answer with equally low precision. The extra digits spit out by calculators are fanciful in the real world.

    • People assume that modern digital instruments are inherently more accurate than old-style types. In the case of temperature at least this is not necessarily so. When temperature readings are collated and processed by software yet another confounding factor is introduced.
      With no recognition of humidity, differing and changing elevation, partial sampling and other data quality issues, the idea that we could be contemplating turning the world’s function inside out over a possible few hundredths of a degree in 60 years of the assumed process is plainly idiotic.
      AGW is an eco Socialist ghost story designed to destroy Capitalism and give power to those who can’t count and don’t want to work. I’m hardly a big fan of Capitalism myself but I don’t see anything better around. Socialism has failed everywhere it’s been tried.

  3. Kip says: “If each measurement is only accurate to ± 2 cm, then the monthly mean cannot be MORE accurate than that — it must carry the same range of error/uncertainty as the original measurements from which it is made. Averaging does not increase accuracy.”

    WRONG!

    the +/- 2cm is the standard deviation of the measurement. This value is “sigma of x ” in the equation for the standard error of the estimator of the mean:

    https://www.bing.com/images/search?view=detailV2&ccid=CYUOXtuv&id=B531D5E2BA00E15F611F3DAEC1B85110014F74C6&thid=OIP.CYUOXtuvcFogpL3jEnQw_gEsBg&q=standard+error&simid=608028072239301597&selectedIndex=1

    The error bars for the mean estimator depends on the sqrt of “N”

    • roflmao..

      You haven’t understood a single bit of what was presented, have you johnson

      You have ZERO comprehension when that rule can and can’t be used, do you. !!

      (Andy, you need to do better than this when you think Johnson or anyone else is wrong. Everyone here is expected to moderate themselves according to the BOARD rules of conduct. No matter if Johnson is right or wrong,being rude and confrontative without a counterargument,is not going to help you) MOD

      • Andy, how about you drop the aggressive, insulting habit of addressing all you replies to “johnson”. If you don’t agree with him, make you point. Being disrespectful does not give more weight to your point of view.

        Also getting stroppy from the safely of your keyboard is a bit pathetic.

      • ROFL^2

        You are a bit rude, Andy, but you are right.

        Can we all TRY to be both polite and scientifically /mathematically correct please. It makes for a better blog all round.

      • “Greg October 15, 2017 at 12:32 am
        Andy, how about you drop the aggressive, insulting habit of addressing all you replies to “johnson”. If you don’t agree with him, make you point. Being disrespectful does not give more weight to your point of view.

        Also getting stroppy from the safely of your keyboard is a bit pathetic.”

        “MarkW October 15, 2017 at 7:25 am
        lighten up greg”

        “The Reverend Badger October 15, 2017 at 9:08 am
        ROFL^2

        You are a bit rude, Andy, but you are right.

        Can we all TRY to be both polite and scientifically /mathematically correct please. It makes for a better blog all round.”

        Is Andy any ruder than Johnson was?

        Especially when Johnson ignores facts, documentation and evidence presented in order to proclaim his personal bad statistics superior.
        Nor should one overlook Johnson’s thread bombings in other comment threads.

      • Sorry, but it very obvious that mark DID NOT understand the original post.

        When their baseless religion relies totally on a shoddy understand of mathematical principles, is it any wonder the AGW apostles will continue to dig deeper?

        “I know perfectly well when to use standard error for the estimator of the mean.”

        Again. it is obvious that you don’t !!

      • For those who are actually able to comprehend.

        Set up a spreadsheet and make a column as long as you like of uniformly distributed numbers between 0 and 1, use =rand(1)

        Now calculate the mean and standard deviation.

        The mean should obviously get close to 0.5..

        but watch what happens to the deviation as you make “n” larger.

        For uniformly distributed numbers, the standard deviation is actually INDEPENDENT of “n”

      • darn typo..

        formula is ” =rand()” without the 1, getting my computer languages mixed up again. !!

      • Furthermore, since ALL temperature measurements are uniformly distributed within the individual ranged used for each measurement, they can all be converted to a uniform distribution between 0 and 1 and the standard deviation remains INDEPENDENT OF “n”</strong)

      • Sorry you are having problems understanding, Mark.. Your problem, not mine.

        Another simple explanation for those with stuck and confused minds.

        Suppose you had a 1m diameter target, and, ignoring missed shots”, they were random uniformly distributed on the target.

        Now, the more shots you have, the closer the mean will be to bulls eye..

        But the error from that mean with ALWAYS be approximately +/- 0.5m uniformly distributed.

      • “The mean should obviously get close to 0.5.”
        “Obviously, that means that the standard error is also INDEPENDENT of n”
        Those statements are contradictory. Standard error is the error of the mean (which is what we are talking about). If it’s getting closer to 0.5 (true) then the error isn’t independent of n. In fact it is about sqrt(1/12/n).

        I did that test with R : for(i in 1:10)g[i]=mean(runif(1000))
        The numbers g were
        0.5002 0.5028 0.4956 0.4975 0.4824 0.5000 0.4865 0.5103 0.5106 0.5063
        Standard dev of those means is 0.00930. Theoretical is sqrt(1/12000)=0.00913

    • Seems to me that no matter how data is treated or manipulated there is nothing that can be done to it which will remove the underlying inaccuracies of the original measurements.

      If the original measurements are +/- 2cm then anything resulting from averaging or mean is still bound by that +/- 2cm.

      Mark, could you explain why you believe that averagaing or the mean is able to remove the original uncertainty ? because I can’t see how it can.

      • Btw I can see how a trend might be developed from data with a long enough time series – But until the Trend is greater than the uncertainty it cannot constitute a valid trend.

        e.g. In temperature a trend showing an increase of 1 deg C from measurements with a +/- 0.5 deg C (i.e. 1 deg C spread) cannot be treated as a valid trend until it is well beyond the 1 deg C, and even then it remains questionable.

        I’m no mathematician or statistician but to me that is plain commonsense despite the hard-wired predilection for humans to see trends in everything ………

      • Maybe someone here has experience with information theory, I did some work with this years ago in relation to colour TV transmissions and it is highly relevant to digital TV . All about resolution and what you need to start with to get a final result. I am quire rusty on it now but think it is very relevant here, inability to get out more than you start with.

      • Old England:

        Consider this; you take your temperature several times a day for a period of time.
        Emulating NOAA, use a variety of devices from mercury thermometers, alcohol thermometers, cheap digital thermistors and infra red readers.

        Sum various averages from your collection of temperatures. e.g.;
        Morning temperature,
        Noon temperature,
        Evening temperature,
        Weekly temperature,
        Monthly temperature,
        Lunar cycle temperatures, etc.

        Don’t forget to calculate anomalies from each average set. With such a large set of temperatures you’ll be able to achieve several decimal places of precision, though of very dubious accuracy.

        Now when your temperature anomaly declines are you suffering hypothermia?
        When your temperature anomaly is stable are you healthy?
        When your temperature anomaly increases, are you running a fever or developing hyperthermia?

        Then after all that work, does calculating daily temperatures and anomalies to several decimal places really convey more information than your original measurement’s level of precision?

        Then consider; what levels of precision one pretends are possible within a defined database are unlikely to be repeatable for future collections of data.
        i.e. a brief window of data in a cycle is unlikely to convey the possibilities over the entire cycle.

        Nor do the alleged multiple decimals of precision ever truly improve the accuracy of the original half/whole degree temperature reading.

        Then, consider the accuracy of the various devices used; NOAA ignores error rates inherent from equipment, readings, handlings, adjustments and calculations.

    • “The error bars for the mean estimator depends on the sqrt of “N””

      Only true if the measured quantity consists of independent and identically distributed random variables. Amazing how few people seem to be aware of this.

      Good luck in proving that there is no autocorrelation between sea-level measurements Mark!

    • Mark ==> You present exactly what I point out is the misunderstanding when a tide gauge measurement is presented as an integer — notated as 100 +/- 2cm. The +/-2cm is NOT a standard deviation, not an error bar, not a confidence interval — but it sure looks like one as they are all written in the same way. In actual fact, it is the uncertainty of the measurement, brought about by the physical design of the measurement instrument.

      • Kip/Nick: Actually a stated instrument MU is a confidence interval. It is defined in the ISO Guides and elsewhere (including NIST) as:

        Uncertainty (of measurement): parameter, associated with the result of a measurement, that characterizes the dispersion of the values that could reasonably be attributed to the measurand

        The default is a 95% confidence interval. Thus a measured value of 100 cm can be said to have a true value of between 98 and 102 cm with a 95% confidence if the instrument MU is +/- 2 cm. While it is indeed derived from the standard deviations of various factors that affect the measurement, it is actually a multiple of the combined SDs. Two times the SD for a 95% MU confidence. However, it is not related to the SD of multiple measurements of the measurand. This is a measure of the variability of the thing being measured and such variability is only partly the result of instrument MU. Proper choice of instruments should make instrument MU a negligible issue. Problems arise when the measurement precision required to make a valid determination is not possible with the equipment available. In short, if you want to measure sea level to +/- 1 mm you need a measuring device with an MU of less than 1 mm.

        Put another way, you can’t determine the weight of a truck to the nearest pound by weighing it on a scale with a 10 pound resolution no matter how many times you weigh it.

        Above I referred to multiple sources of MU that need to be combined. This is known as an uncertainty budget. As an example a simple screw thread micrometer includes the following items: repeatability, scale error, zero point error, parallelism of anvils, temperature of micrometer, temperature difference between micrometer and measured item. However, the vast majority of instrument calibrations are done by simple multiple comparisons of measured values of certified reference standards. In these calibrations there are always at least three sources of MU. The uncertainty of the reference standard, one half the instrument resolution and the standard deviation of the repeated comparison deviation from the reference value. In addition, to be considered adequate the Test Uncertainty Ratio (MU of device being calibrated divided by MU of reference) must be at least 4:1.

        This is all basic metrology that should be well understood by any scientist or engineer. But I know from experience that it is not as is clearly evident in these discussions.

      • Thanks again for you clear and well informed opinion on these matters.

        The problem with using S.D as the basis for establishing “confidence intervals” is that it is based soley on statistics and addresses only the sampling error.

        If global mean SST is given as +/-0.1 deg C then a “correction” is made due to a perceived bias of 0.05 deg and the error bars are the same ( because the stats are still the same ) then we realise that they are not including all sources of error and the earlier claimed accuracy was not correct.

        The various iterations of hadSST have not changed notably in their claimed confidence levels yet at one point they introduced -0.5 deg step change “correction”. This was later backed out and reintroduced as a progressive change, having come up with another logic to do just about the same overall change of 0.5 deg C.

        Variance derived confidence levels do NOT reflect the full range of uncertainty, only one aspect: sampling error.

      • Greg ==> Best stay clear of the Statistics Department at the local Uni….they don’t like that kind of talk. Here either…as you see.

    • Mark S, you missed the whole point of why this isn’t so in the case of temperatures and tide gauges. If you measure the length of a board a dozen times carefully, then you are right. But if the board keeps changing its own length, then multiple measurings are not going to prove more accurate or even representative of anything. I hope this helps.

    • If the measurement is made of the same thing, the different results can be averaged to improve the accuracy.
      Since the temperature measurements are being made at different times, they cannot be used to improve the accuracy.
      That’s basic statistics.

    • Mark S Johnson,
      You are quite wrong. If I handed you an instrument I calibrated to some specific accuracy, say plus or minus one percent of full scale for discussion purposes, you had better not claim any measurement made with it, or any averages of those values, is more accurate than what I specified. In fact, if the measurement involved safety of life, you must return the instrument for a calibration check to verify it is still in spec.

      Where anyone would come up with the idea that an instrument calibration sticker that say something like “+/- 2 cm” indicates a standard deviation, I cannot imagine. In the cal lab, there is no standard deviation scheme for specifying accuracy. When we wrote something like “+/- 2 cm”, we meant that exactly. That was the sum of the specified accuracy of the National Bureau of Standards standard plus the additional error introduced by the transfer reference used to calibrate the calibration instrument plus the additional error introduced by the calibration instrument used on your test instrument.

      Again, that calibration sticker does not say “+/- 2 cm” is some calculated standard deviation from true physical values. It means what at each calibration mark on the scale, the value will be within “+/- 2 cm” of true physical value. That does not, however specify the Precision of the values you read. That is determined by the way the instrument presents its values. An instrument calibrated to “+/- 2 cm” could actually have markings at 1 cm intervals. In that case, the best that can be claimed for the indication is +/- 0.5 cm. The claimed value would then be +/- 0.5 cm plus the +/- 2 cm calibration accuracy. Claiming an accuracy of better than +/- 2.5 cm would in fact be wrong, and in some industries illegal. (Nuclear industry for example.)

      So drop the claims about standard deviation in instrument errors. It does not even apply to using multiple instrument reading the same process value at the same time. In absolutely no case can instrument reading values be assumed to be randomly scattered around true physical values within specified instrument calibration accuracy. Presenting theories about using multiple instruments from multiple manufacturers, each calibrated with different calibration standards by different technicians or some such similar example is just plain silly when talking about real world instrumentation use. You are jumping into the “How many angels can dance on the head of a pin” kind of argument.

      • Gary, they do not make an instrument that can measure “global temperature.”

        Measuring “global temperature” is a problem in sampling a population for the population mean. Once you understand this, you may be able to grasp the concept of “standard error” which is comprised of the standard deviation of the instrument used for measurement, divided by the sqrt of the number of obs.

        Now when/if they build an instrument that can measure the global temperature with one reading, then your argument might hold water.

      • Mark,

        Where above do I mention “global temperature”? My statements were about the use of instrument readings (or observations to the scientific folks.) I would suggest that however that “global temperature” be derived, it cannot claim an accuracy better than the calibration accuracy of the instrumentation used. Wishful thinking and statistical averaging cannot change that.

        Remember the early example of averages of large numbers was based upon farm folks at an agricultural fair guessing the weight of a bull. The more guesses that were accumulated, the closer the average came to the true weight. Somehow that has justified the use of averaging in many inappropriate situations. Mathematical proofs using random numbers do not justify or indicate the associated algorithms are universally applicable to real world situations.

      • Gary, the estimator of the population mean can be made more accurate with more observations. The standard error is inversely proportional to the sqrt of the number of obs.
        …..
        Here’s an example.
        ….
        Suppose you wanted to measure the average daily high temperature for where you live on Oct 20th. You measure the temp on Oct 20th next Friday.

        Is this measure any good?

        Now, suppose you do the same measurement 10/20/2017, 10/20/2018, 10/20/2019 and 10/20/2020, then take the average of the four readings.
        ..
        Which is more accurate?…..the single lone observation you make on Friday, or the average of the four readings you make over the next four years?
        ….
        If you are interested in the real climatic average for your location on Oct 20th, you really need 30 years of data to be precise.

      • Gary, RE: weight of bull.

        Here you go again with an incorrect analogy. The weight of an individual bull is not a population mean. Don’t confuse the two. The correct “bull” analogy would be to actually measure the weight of 100 bulls, to determine what the average weight of a bull is. The more bulls you measure, the closer you will get to what the “real” average bull weight is.

      • There will be some of us (like Gary and myself) on here who have regularly sent instruments away to be calibrated and had to carefully consider the results, check the certificates etc. We appear to know rather more about this than some contributors today. I find it interesting that a simple experience like this can help a lot in an important discussion.

      • “the estimator of the population mean can be made more accurate with more observations. The standard error is inversely proportional to the sqrt of the number of obs.”

        Two points here: 1. “estimator” mean guess. 2. your estimator may be made more precise according to a specified estimation algorithm. That does not relate to its accuracy. Your comment about standard deviation only applies to how you derive your guess.

        “If you are interested in the real climatic average for your location on Oct 20th, you really need 30 years of data to be precise.”

        Good now we are on the same page. You are achieving a desired PRECISION. Accuracy, however remains no better than the original instrumentation accuracy and often worse depending upon how the data is mangled to fit your algorithm. (F to C etc.)

        “Here you go again with an incorrect analogy. The weight of an individual bull is not a population mean. Don’t confuse the two. The correct “bull” analogy would be to actually measure the weight of 100 bulls, to determine what the average weight of a bull is. The more bulls you measure, the closer you will get to what the “real” average bull weight is.”

        Nope, the exercise was to determine the accuracy of guesses about the weight of a single bull tethered to a post at the fair. A prize was awarded to the person who guessed the closest. It was not about guessing the weight bulls as a population. The observation about that large numbers of guesses was that the average became closer to true weight of the bull as the number of guess increased, one guess per person. It was never claimed that random guess about random bulls would average to any meaningful or useful number.

      • Guessing the weight of an individual bull is not the same as sampling a population. Hey…..ever hear about destructive testing? It’s what happens when running the test obliterates the item “measured.” For example, how would you insure the quality of 1000 sticks of dynamite? Would you test each one, or would you take a representative random sample and test the smaller number?

      • Mark S Johnson October 15, 2017 at 9:02 am
        “The weight of an individual bull is not a population mean. Don’t confuse the two.”

        He didn’t confuse anything. He said “The more guesses that were accumulated, the closer the average came to the true weight. Somehow that has justified the use of averaging in many inappropriate situations.” But you like to fly off on your own illogical tangent, which just gets in the way of those of us trying to understand the arguments.

    • Then explain how that applies if the measurements are not normally distributed? And if you have no idea if they are normally distributed?Let’s say the sides of the block of metal I have on my desk.

    • Just to clarify Andy’s concerns. Mark Johnson is confusing uncertainty of the estimate with accuracy of the measure; they’re two different things, something Kip attempts to point out in his essay and also something that anyone familiar with measurement theory and statistics would understand from his essay. It’s possible a person without much practical experience in numerical modeling might miss the distinction, but I can assure you it’s there.

      While the “law of large numbers” will reduce the error of estimate as Mark describes, it does nothing to increase accuracy of the measure.

      • Maybe another example is in order?

        If a single measure is accurate +/- 2cm, it has an uncertainty associated with it also, which may perhaps be +/- 5mm. As repeated measures are taken and averaged, the uncertainty (5mm) can be reduced arithmetically as Mark Johnson describes, but the result is a measure accurate +/- 2cm with a lower uncertainty (for example +/- .1 mm).

        I hope that resolves the conflicting views expressed here. I agree there’s no reason for ad hominem by either party. It’s a very confusing subject for most people, even some who’ve been involved with it professionally.

      • Mark S Johnson: The only person on this thread discussing measures of a population mean is you, and it’s almost certain the only training in statistics you’ve ever had involved SPSS.

        Error in a measure is assumed to be normally distributed, not the measure itself. You need to meditate on that. The accuracy of a measure has nothing to do with the uncertainty of the estimate. The “law of large numbers” doesn’t improve accuracy, it improves precision. You’re wrong to argue otherwise.

      • Bartleby,
        That is particularly true if there is a systematic error in the accuracy. If you have a roomful of instruments, all out of calibration because over time they have drifted in the same direction, using them to try to obtain an average will, at best, give you an estimate of what the average error is, but it will not eliminate the error. The only way that you are going to get the true value of the thing you are measuring is to use a high-precision, well-calibrated instrument.

      • Certainly true if there is systemic error, which really means the measure is somehow biased (part of an abnormal distribution); unless the error of estimate is normal, the law of large numbers can’t be used at all. It can never be used to increase accuracy.

        The whole idea of averaging multiple measures of the same thing to improve precision is based on something we call a “normal error distribution” as you point out. We assume the instrument is true within the stated accuracy, but that each individual observation may include some additional error, and that error is normally distributed.

        So, by repeatedly measuring and averaging the result, the error (which is assumed normal) can be arithmetically reduced, increasing the precision of the estimate by a factor defined by the number of measures. This is the “Students T” model.

        But accuracy isn’t increased, only precision. 100 measures using a device accurate +/- 2cm will result in a more precise estimate that’s accurate to +/- 2cm.

        Accuracy and Precision are two very different things.

      • ‘The whole idea of averaging multiple measures of the same thing to improve precision is based on something we call a “normal error distribution”…’

        Normal (or Gaussian) distributions are not required, though a great many measurement error sets do tend to a Normal distribution due to the Central Limit Theorem.

        All that is required is that the error be equally distributed in + and – directions. Averaging them all together then means they will tend to cancel one another out, and the result will, indeed, be more accurate. Accuracy means that the estimate is closer to the truth. Precision means… well, a picture is worth a thousand words. These arrows are precise:

      • Bartleby,
        “100 measures using a device accurate +/- 2cm will result in a more precise estimate that’s accurate to +/- 2cm.

        Accuracy and Precision are two very different things.”
        Yes, if you are talking about a metrology problem, which is the wrong problem here. No-one has ever shown where someone in climate is making 100 measures of the same thing with a device. But there is one big difference between accuracy and precision, which is in the BIPM vocabulary of metrology, much cited here, but apparently not read. It says, Sec 2.13 (their bold):
        “NOTE 1 The concept ‘measurement accuracy’ is not a quantity and is not given a numerical quantity value. “

        Which makes sense. Accuracy is the difference between the measue and the true value. If you knew the true value, you wouldn’t be worrying about measurement accuracy. So that is the difference. If it has numbers, it isn’t accuracy.

      • Nick Stokes (perhaps tongue in cheek) writes: “So that is the difference. If it has numbers, it isn’t accuracy.”

        Nick, if it doesn’t have numbers, it isn’t science. :)

      • Nick, there’s an old, old saying in the sciences that goes like this:

        If you didn’t measure it, it didn’t happen.”

        I sincerely believe that. So any “discipline” that spurns “numbers” isn’t a science. QED.

      • Bartleby,
        I’m not the local enthusiast for use of metrology (or BIPM) here. I simply point out what they say about the “concept ‘measurement accuracy’”.

      • Nick Stokes writes: “I’m not the local enthusiast for use of metrology (or BIPM) here. I simply point out what they say about the “concept ‘measurement accuracy’”

        OK. I don’t think that changes my assertion, that science is measurement based and so requires the use of numbers.

        I’m not sure if you’re trying to make an argument from authority here? Id so it really doesn’t matter what the “BIPM” defines; accuracy is a numerical concept and it requires use of numbers. There’s no alternative.

        If, in the terms of “metrology”, numbers are not required, then the field is no different from phrenology or astrology, neither of which is a science. Excuse me if you’ve missed that up until now. Numbers are required.

    • Mark S Johnson,

      We have a very different take on what Kip has written. My understanding is that the tide gauges can be read to a precision of 1mm, which implies that there is a precision uncertainty of +/- 0.5mm. HOWEVER, it appears that the builders of the instrumentation and site installation acknowledge that each and all of the sites may have a systematic bias, which they warrant to be no greater than 2 cm in either direction from the true value of the water outside the stilling well. We don’t know whether the inaccuracy is a result of miscalibration, or drift, of the instrument over time. We don’t know if the stilling well introduces a time-delay that is different for different topographic sites or wave conditions, or if the character of the tides has an impact on the nature of the inaccuracy. If barnacles or other organisms take up residence in the inlet to the stilling well, they could affect the operation and change the time delay.

      The Standard Error of the Mean, which you are invoking, requires the errors be random (NOT systematic!). Until such time as you can demonstrate, or at least make a compelling argument, that the sources of error are random, your insistence on using the Standard Error of the Mean is “WRONG!”

      I think that you also have to explain why the claimed accuracy is more than an order of magnitude less than the precision.

      • Clyde, a single well cannot measure global average sea level. It does not sample with respect to the geographic dimension. Again there is confusion here with the precision/accuracy of an individual instrument, and the measurement of an average parameter of a population. Apples and oranges over and over and over.

      • Mark S Johnson,

        I never said that a single well measured the average global sea level, and I specifically referred to the referenced inaccuracy for multiple instruments.

        You did not respond to my challenge to demonstrate that the probable errors are randomly distributed, nor did you explain why there is an order of magnitude difference between the accuracy and precision.

        You seem to be stuck on the idea that the Standard Error of the Mean can always be used, despite many people pointing out that its use has to be reserved for special circumstances. You also haven’t presented any compelling arguments as to why you are correct. Repeating the mantra won’t convince this group when they have good reason to doubt your claim.

      • Clyde the reason it’s called Standard Error of the Mean is because I’m talking about measuring the mean and am not talking about an individual measurement.

        This is not about measuring the same block of metal 1000 times to improve the measurement. It’s about measuring 1000 blocks coming off the assembly line to determine the mean value of the block’s you are making.

      • Mark S Johnson,

        You said, “…I’m talking about measuring the mean.” Do you own a ‘meanometer?” Means of a population are estimated through multiple samples, not measured.

        You also said, “This is not about measuring the same block of metal 1000 times to improve the measurement. It’s about measuring 1000 blocks coming off the assembly line to determine the mean value of the block’s you are making.”

        In the first case, you are primarily concerned about the accuracy and precision of the measuring instrument. Assuming the measuring instrument is accurate, and has a small error of precision, the Standard Error of the Mean can improve the precision. However, no amount of measuring will correct for the inaccuracy, which introduces a systematic bias. Although, if the electronic measuring instrument is wandering, multiple measurements may compensate for that if the deviations are equal or random at each event. But, if you have such an instrument, you’d be advised to replace it rather than try to compensate after the fact.

        In the second case, you have the same problems as case one, but you are also confronted with blocks that are varying in their dimensions. Again, if the measuring instrument is inaccurate, you cannot eliminate a systematic bias. While the blocks are varying, you can come up with a computed mean and standard deviation. However, what good is that? You may have several blocks that are out of tolerance and large-sample measurements won’t tell you that unless the SD gets very large; the mean may move very little if any. What’s worse, if the blocks are varying systematically over time, for example as a result of premature wear in the dies stamping them, neither your mean or SD is going to be very informative with respect to your actual rejection rate. They may provide a hint that there is a problem in the production line, but it won’t tell you exactly what the problem is or which items are out of tolerance. In any event, even if you can justify using the Standard Error of the Mean to provide you with a more precise estimate of the mean, what good does it do you in this scenario?

      • “In the second case, you have the same problems as case one, but you are also confronted with blocks that are varying in their dimensions. In this case you shouldn’t be worrying about your instrument, your concern is your manufacturing process!

      • Clyde –

        You’re playing into the hands of someone ignorant. It’s a common fault on public boards like this.

        Both of you (by that I mean Johnson too) are freely exchanging the terms “accuracy” and “uncertainty”; they are not the same. Until you both work that out you’re going to argue in circles for the rest of eternity.

      • Nick Stokes ==> Said: October 16, 2017 at 10:11 pm

        But there is one big difference between accuracy and precision, which is in the BIPM vocabulary of metrology, much cited here, but apparently not read. It says, Sec 2.13 (their bold): “NOTE 1 The concept ‘measurement accuracy’ is not a quantity and is not given a numerical quantity value. “

        And the rest of the note? The very next sentence….. is!

        A measurement is said to be more accurate when it offers a smaller measurement error*.

        This is exactly what Kip Hansen has argued all along and exactly what Bartleby just wrote** and yet you have just gone out of your way to cherry pick the quote and completely butcher the context of the very definition you are referring to!

        *And measurement error is defined at 2.16 (3.10) thusly: “measured quantity value minus a reference quantity value”
        **Bartleby wrote: “100 measures using a device accurate +/- 2cm will result in a more precise estimate that’s accurate to +/- 2c.”

      • SWB,
        “The very next sentence…”
        The section I quoted was complete in itself, and set in bold the relevant fact: “is not given a numerical quantity value. Nothing that follows changes that very explicit statement. And it’s relevant to what Bartleby wrote: “a more precise estimate that’s accurate to +/- 2cm”. BIPM says that you can’t use a figure for accuracy in that way.

    • Mark ==> The +/- 2 cm is not the standard deviation. It is the original measurement accuracy specification, confirmed by NOAA CO-OPS. The “Sigma” is a different figure, provided by NOAA CO_OPS, as the standard deviation of the 181 1-second records being used to create a six-minute mean. That “sigma”, clarified by NOAA CO-OPS as ““Sigma is the standard deviation, essential[ly] the statistical variance, between these (181 1-second) samples.”

      Please re-ready my email exchange with NOAA CO-OPS support:

      To clarify what they mean by accuracy, I asked:

      When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.

      And received the answer:

      That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time.

      +/- 2cm is the ACCURACY of the six-minute means — which are the only permanent record made by the Tide Gauge system from meASUREMENTS.

      • Well said, Kip.

        Mark S: “the +/- 2cm is the standard deviation of the measurement”

        No, it is not the SD. The SD can only be calculated after a set of readings has been made. The 2cm uncertainty is a characteristic of the instrument, determined by some calibration exercise. It is not an ‘error bar’, it is an inherent characteristic of the apparatus. Being inherent, replicating measurements or duplicating the procedure will not reduce the uncertainty of each measurement.

        Were this not so, we would not strive to create better instruments.

        You make an additional error I am afraid: each measurement stands alone, all of them. They are not repeat measurements of ‘the same thing’ for it is well known in advance that the level will have changed after the passage of second. The concept you articulate relates to making multiple measurements of the same thing with the same instrument. An example of this is taking the temperature of a pot of water by moving a thermocouple to 100 different positions within the bulk of the water. The uncertainty of the temperature is affected by the uncertainty of each reading, again, inherent to the instrument and the SD of the data. One can get a better picture of the temperature of the water by making additional measurements, but the readings are no more accurate than before, and the average is not more accurate just because the number of readings is increased. Making additional measurements tells us more precisely where the middle of the range is, but does not reduce the range of uncertainty. This example is not analogous to measuring sea level 86,400 times a day as it rises and falls.

        Whatever is done using the 1-second measurements, however processed, the final answer is no more accurate than the accuracy of the apparatus, which is plus minus 20mm.

      • Nick Stokes==> October 18, 2017 at 12:58 am:

        The section I quoted was complete in itself, and set in bold the relevant fact: “is not given a numerical quantity value“. Nothing that follows changes that very explicit statement. And it’s relevant to what Bartleby wrote: “a more precise estimate that’s accurate to +/- 2cm”. BIPM says that you can’t use a figure for accuracy in that way.

        Talk about perversity – I can’t imagine it would be anything else – if you really are being intellectually honest!

        Here is the whole reference (Their bold):

        2.13 (3.5)
        measurement accuracy
        accuracy of measurement
        accuracy

        closeness of agreement between a measured quantity value and a true quantity value of a measurand

        NOTE 1 The concept ‘measurement accuracy’ is not a quantity and is not given a numerical quantity value. A measurement is said to be more accurate when it offers a smaller measurement error.

        NOTE 2 The term “measurement accuracy” should not be used for measurement trueness and the term “measurement precision” should not be used for ‘measurement accuracy’, which, however, is related to both these concepts.

        NOTE 3 ‘Measurement accuracy’ is sometimes understood as closeness of agreement between measured quantity values that are being attributed to the measured.

        How could you completely miss the definition of Accuracy?

        It is defined as the “closeness of agreement between a measured quantity value and a true quantity value of a measurand.”

        It is very clear that the term is not numeric but ordinal and of course, ordinal quantities have mathematical meaning as you would well know!

      • “It is very clear that the term is not numeric but ordinal and of course, ordinal quantities have mathematical meaning as you would well know!”
        Yes. And what I said, no more or less, is that it doesn’t have a number. And despite all your huffing, that remains exactly true, and is the relevant fact. I didn’t say it was meaningless.

    • When I first considered the “law of large numbers” years ago, I applied an engineer’s mental test for myself. If I have a machine part that needs to be milled to an accuracy of .001 in, and a ruler that I can read to an accuracy of 1/16 in, could I just measure the part with a ruler 1000 times, average the result, and discard my micrometer? I decided that I would not like to fly in an aircraft assembled that way.

    • Mark, I am far from an expert but do remember a little of what I leaned in my classes on stochastic processes. If I were able to assume that the distribution from which I was measuring was a stationary or at least wide sense stationary, then the process of multiple measurements as you imply could in fact increase the accuracy. This is actually how some old style analog to digital converters worked by using a simple comparator and counting the level crossings in time you can get extra bits of accuracy. This is similar to your assertion here.

      The main flaw here is that you must make the stationarity assumption. Sorry, but temperature measurements and tidal gauge measurements are far from stationary. In fact, the pdf is a continuing varying parameter over time so I have a hard time agreeing with your assertion about the improvement in accuracy.

      • Alan ==> Oh yes, they have the rule but forget the requirements for applying the rule. “Stationary” and “Static” and “Fixed”… those must be a feature of the thing being measured many times.

        The “mean” of an ever-changing, continuous variable, is not “one thing measured many times”

    • This is essentially about significant digits. Not the standard deviation of a sample of sample means. These two things are different. Ok? You cannot manufacture significant digits by taking samples. Period.

  4. It may be worth remembering – no calculated figure is entitled to more significant figures (accuracy) than the data used in the calculation.

    • In fact, the further your calculations get from the original measured number, the greater the uncertainty gets.

      • Three measurements, each with one digit of significance: 0.2, 0.3 and 0.5

        The calculated average is what?

        Is it 0?
        is it .33?
        or is it .33333 ?
        In fact the more digits you add, the closer you come to the real value, namely one third.

      • Mark, what you illustrate in your example is the reduction of uncertainty and convergence on the true value that can be accomplished when averaging multiple observations of the same thing using the same instrument (or instruments calibrated to the same accuracy). It assumes several things, the one thing not mentioned in Kip’s article or your example is that all measures come from a quantity that’s normally distributed. So there are at least three assumptions made when averaging a quantity and using the “law of large numbers” to reduce uncertainty in the measure;

        – That all measures are of the same thing.
        – That all measures have the same accuracy.
        – That the measures are drawn from an underlying normal distribution.

        All three assumptions must be met for the mean to have “meaning” :)

        Briefly, if you average the length of 100 tuna, and the length of 100 whale sharks, you won’t have a meaningful number that represents the average length of a fish. In fact, if you were to plot your 200 observations, you’d likely find two very distinct populations in your data, one for whale sharks and another for tuna. The data don’t come from a normal distribution. In this case, any measure of uncertainty is useless since it depends on the observations coming from a normal distribution. No increase in instrument accuracy can improve precision in this case.

        I’ll get to this again in my comment on Kip’s essay below.

      • Bartleby, I believe this is the crux of the wealth of misunderstanding here: “That all measures are of the same thing.”
        ….
        A population mean is not a “thing” in your analysis of measurement.

        You can’t measure a population mean with a single measure, you need to do random sampling of the population to obtain an estimator of the mean.

        This is not at all like weighing a beaker full of chemicals on a scale.

        You don’t conduct an opinion poll by going to the local bar and questioning a single patron….you need a much larger SAMPLE to get an idea of what the larger population’s opinion is. In the extreme case where N(number of obs) = population size, your measure of the average has zero error.

      • The “average” temperature is not of any real value, it is the change in temperature, and then, as a change in the equator-polar gradient that seems to matter in climate. Purporting to find changes to the nearest thousandth of a degree with instruments with a granularity of a whole degree appears to be an act of faith by the warmist community. Credo quia absurdiam?

      • Mark S; You miss the point. What is the mean of 0.2+- 0.5, 0.3+- 0.5, and 0.5+- 0.5. Where the +- is uncertainty. Is it 0.3+- 0.5? How will even an infinite number of measurement reduce the uncertainty?

        The range is going to be 0.8 to -0.5. You can say the mean is 0.3333, but I can say it is 0.565656 and be just as correct. Basically, just the mean without the uncertainty limits is useless.

      • “Bartleby, I believe this is the crux of the wealth of misunderstanding here: “That all measures are of the same thing.”
        ….
        A population mean is not a “thing” in your analysis of measurement.”

        Mark, you’ve been beaten enough. Go in peace.

    • Peter,

      The actual rule is that no calculated result is entitled to more significant figures than the LEAST precise multiplier in the calculation.

      I suspect that some mathematicians and statisticians unconsciously assume that all the numbers they are working with have the precision of Pi. Indeed, that might be an interesting test. Calculate PI many times using only measurements with one significant figure and see how close the result comes to what is known.

      • Clyde,
        “Calculate PI many times using only measurements with one significant figure”
        Something like this was done, by Buffon, in about 1733. Toss needles on floorboards. How often do they lie across a line. That is equivalent to a coarse measure. And sure enough, you do get an estimate of π.

      • Omg. Look, the example with needles just bakes perfect accuracy into the pie. Now let’s try marking needles as over a line or not with effing cataracts or something…good lord. I don’t understand why the idea of “your observations are fundamentally effing limited man!” is so hard to understand here. Nothing to do with minimizing random sampling error.

  5. Kip is correct if the temperature never deviates from 72degF +/- 0.5degF. You will just write down 72 degF and the error will indeed be has he indicates.

    Fortunately the temperature varies far more than that. One day, the temperature high/ow is 72/45 from 71.5 true and 45.6 true, the next day it is 73/43 from 72.3 true and 44.8 true, the next day it is 79/48 from 79.4 true and 47.9 true, and so on. The noise that is the difference between the true and recorded measurement has an even distribution as he notes, but can be averaged as long as the underlying signal swings bigger than the resolution of 1degF.

    The Central Limit is a real thing. You average together a bunch of data with rectangular distribution you get a normal distribution. Go ahead and look at the distribution of a 6 sided dice. With one dice it’s rectangular. With two dice it’s a triangle. Add more and more dice and it’s a normal distribution.

    Fortunately the signal varies by more than the 1 bit comparator window for the sigma-delta A/D and D/A converters in your audio and video systems, which operate on similar principles. It would be quite obvious to your ears if they failed to work. (yes, they do some fancy feedback stuff to make it better, but you can get a poor man’s version by simple averaging. I’ve actually designed and built the circuits and software to do so)

    Peter

    • You assume you know the “true” temperature. Lets change that to all that you know is 72/45 +- 0.5, 73/43 +- 0.5, and 79/48 +- 0.5. Where the +- is uncertainty. Does the mean also have an uncertainty of +- 0.5. If not why not. Will 1000 measurements change the fact that each individual measurements has a specific uncertainty and you won’t really know the “true” measurement?

      • for 1,000 measurements the *difference* between the true and the measured will form a rectangular distribution. If that distribution is averaged the average forms normal distribution, per the central limit theorem. The mean of that distribution will be zero, and thus the mean of the written-down measurements will be the ‘true’ measurement.

        Try performing the numerical experiment yourself. It’s relatively easy to do in a spreadsheet.

        Or go listen to some music from a digital source. The same thing is happening.

      • Peter; The problem is that you don’t know the true value? It lies somewhere between +- 0.5 but where is unknown.

      • Peter; The problem is that you don’t know the true value? It lies somewhere between +- 0.5 but where is unknown.

        How odd that your digital sound system appears to know.

        You do know the true value for some period (integrating between t0 and t1) as long as the input signal varies by much greater than the resolution of your instrument. You do not know the temperature precisely at t0 or any time in between t0 and t1. But for the entire period you do know at a precision greater than that of your instrument. This is how most modern Analog to Digital measurement systems work.

        Whether a temperature average is a useful concept by itself is not for debate here (I happen to think it’s relatively useless). But it does have more precision than a single measurement.

        Nick Stokes posted an example above. Try running an example for yourself. It just requires a spreadsheet.

      • Peter; consider what you are integrating. Is it the recorded value or the maximum of the range or the minimum of the range or some variations of maximum, minimum, and recorded range?

        And I’m sorry but integrating from t0 to t1 still won’t give the ‘true’ value. It can even give you a value to a multitude of decimal places. But you still can’t get rid of the uncertainty of the initial measurement.

        Consider your analog to digital conversion. You have a signal that varies from +- 10.0 volts. However, your conversion apparatus is only accurate to +- 0.5 volts. How accurate will your conversion back to analog be?

      • Consider your analog to digital conversion. You have a signal that varies from +- 10.0 volts. However, your conversion apparatus is only accurate to +- 0.5 volts. How accurate will your conversion back to analog be?

        Do you mean accuracy or precision? I’ll try to answer both.

        If you mean precision:

        It depends on the frequency and input signal characteristics. In the worst case of a DC signal with no noise at any other frequency, the precision is +/- 0.5 volts.

        If however I’m sampling a 1Khz signal at 1Mhz and there is other random noise at different frequencies in the signal, then my precision is 0.5V/sqrt(1000) = 0.016 volts @ 1khz. I can distinguish 0.016V changes in the 1Khz signal amplitude by oversampling and filtering (averaging). I’m trading off time precision for voltage precision.

        if you mean accuracy

        If you mean accuracy AT DC, do you mean the accuracy of the slope or the offset? A linear calibration metric is typically expressed in terms of y=mx+b, I don’t know if you are talking about m or b… Likely ‘b’, or you would have used a different metric than volts (you would use a relative metric, like percentage). e.g. “accuracy = 1% +/- 0.5V” is what you might see in a calibration specification.

        Assuming you are talking about b, then since amplitude is typically a delta measurement, then the b is irrelevant (cancels out), same answer as above. You know the amplitude of the 1Khz signal within 0.016V.

        Getting back to climate, as long as ‘b’ does not vary, you get the same answer for the temperature trend, since it is also a delta measurement. IMHO ‘b’ does vary quite a bit over time, more than the BE or other folks are taking into account (see Anthony’s work), but that’s not Kip’s argument.

        Peter

  6. I’m also somewhat surprised that they do not use ‘banker’s rounding’ (google it). Not using BR adds an upwards bias with a large amount of data, which is why banks do use it.

    • Banker’s Rounding would sure explain a .5 degree increase in global temperature the last 150 years. Given that thermometers then were hardly accurate to even 1 degree reading the scale on the glass 50 years ago, and then depending what your eye level to the thermometer was reading the scale in what were fairly crude weather stations. The 1 decree C global temperature increase the last 150 years claimed by Science must also fall “randomly” within the +/- 0.5 deviation, especially if there is upward bias to do so. So half of all global warming might just be banker’s rounding.

    • “Not using BR adds an upwards bias with a large amount of data”
      It’s one way of avoiding bias. Any pseudo random tie-break would also do, and that’s probably what they do use if rounding is an issue. But it’s not clear that it is an issue.

      • Nick,
        Here is a BOM comment on rounding and metrication.
        http://cawcr.gov.au/technical-reports/CTR_049.pdf
        “The broad conclusion is that a breakpoint in the order of 0.1 °C in Australian mean temperatures appears to exist in 1972, but that it cannot be determined with any certainty the extent to which this is attributable to metrication, as opposed to broader anomalies in the climate system in the years following the change. As a result, no adjustment was carried out for this change”
        When we are looking at a 20th century official warming figure of 0.9 deg C, the 0.1 degree errors should become an issue. Geoff

      • Geoff,
        “the 0.1 degree errors”
        They aren’t saying that there is such an error. They are saying that there seems to be a small climate shift of that order, and they can’t rule out metrication as a cause, even though they have no evidence that it caused changes.

        An awful lot of numbers were converted with variable skill, but those authors have no special knowledge to offer (and say so). I remember my first passport post-metrication; my height was 1.853412 m! At one stage I looked at old news readings in F to check against GHCN (in C); I never found a conversion error.

      • BR is symmetrical since half of the .5 values get rounded up , the other half get rounded down.

        What will introduce a bias is when temperatures were marked in whole degrees by truncation. When and where this was used and stopped being used will introduce a 0.5 F shift if not correctly known from meta data and corrected for.

      • A broader quotation from the BoM document cited by Geoff is:

        “All three comparisons showed mean Australian temperatures in the 1973-77 period were from 0.07 to 0.13°C warmer, relative to the reference series, than those in 1967-71. However, interpretation of these results is complicated by the fact that the temperature relationships involved (especially those between land and sea surface temperatures) are influenced by the El Niño-Southern Oscillation (ENSO), and the 1973-77 period was one of highly anomalous ENSO behaviour, with major La Niña events in 1973-74 and 1975-76. It was also the wettest five-year period on record for Australia, and 1973, 1974 and1975 were the three cloudiest years on record for Australia between 1957 and 2008 (Jovanovic et al., 2011).

        The broad conclusion is that a breakpoint in the order of 0.1 °C in Australian mean temperatures appears to exist in 1972, but that it cannot be determined with any certainty the extent to which this is attributable to metrication, as opposed to broader anomalies in the climate system in the years following the change. As a result, no adjustment was carried out for this change”

        So several years of the wettest, cloudiest weather on record in Australia, linked to two major La Nina events, caused the mean temperature to increase by about 0.1C? And unworthy of adjustment?

        Really?

        More than 50% of Australian Fahrenheit temperatures recorded before 1972 metrication were rounded .0F. Analysis of the rounding influence suggests it was somewhere between 0.2C and 0.3C, which sits quite comfortably with an average 0.1C warming amid rainy, cloudy climate conditions you’d normally expect to cool by 0.1C.

        Corruption of the climate record continued with the 1990s introduction of Automatic Weather Stations. The US uses five minute running averages from its AWS network in the ASOS system to provide some measure of compatibility with older mercury thermometers. Australia’s average AWS durations are something of a mystery, anywhere from one to 80 seconds (see Ken Stewart’s ongoing analysis starting at https://kenskingdom.wordpress.com/2017/09/14/australian-temperature-data-are-garbage/).

        Comparing historic and modern temps in Australia is like comparing apples with oranges, both riddled with brown rot.

    • Jer0me,
      There are several rounding schemes that have been invented and many are still in use in specialized areas. However, the argument that makes the most sense to me is that in a decimal system of numbers the sets of {0 1 2 3 4} {5 6 7 8 9} are composed of 5 digits each, and exactly subdivide the interval before repeating. Thus, when rounding, one should round ‘down’ (retain the digit) if any of the digits in the position of uncertainty are in the first set, and one should round ‘up’ (increment the digit) if any of the digits are in the second set.

      • Not so because you aren’t actually rounding down the zero, its already zero… and so there are actually 4 elements that are rounded downward and 5 elements that are rounded upward so the scheme is asymmetrical and upward biased.

      • Tim,
        No, the digit in the uncertain position has been estimated as being closer to zero than it is to 1 or nine. The zero has a meaning, unlike the absence of a number.

      • Clyde

        The zero has a meaning, unlike the absence of a number.

        And the meaning is the number you’re rounding to. Think of it this way…out of the set {0,1,2,3,4} in 4 of the 5 cases cases the rounding will produce a downward adjustment. Out of the set {5,6,7,8,9} all 5 of the cases produce an upward adjustment. That cant be a symmetrical adjustment if each of the outcomes is equally probable.

  7. “In scientific literature, we might see this in the notation: 72 +/- 0.5 °F. This then is often misunderstood to be some sort of “confidence interval”, “error bar”, or standard deviation”

    The confusion is understandable? It’s been sixty years, but I’m quite sure they taught me at UCLA in 1960 or so that the 72 +/- notation is used for both precision based estimates and for cases where the real error limits are somehow known. It’s up to the reader to determine which from context or a priori knowledge?

    I’d go of and research that, but by the time I got an answer — if I got an answer — this thread would be long since dead. Beside which, I’d rather spend my “How things work time” this week trying to understand FFTs.

    Anyway — thanks as usual for publishing these thought provoking essays.

  8. Kip,
    You do have over a century of scientific understanding against you. And you give almost no quantitative argument. And you are just wrong. Simple experiments disprove it.

    In the spirit of rounding, I took a century of Melbourne daily maxima (to 2012, a file I have on hand). They are given to 0.1°C. That might be optimistic, but it doesn’t matter for the demo. For each month, I calculated the average of the days. Then I rounded each daily max to the nearest °C, and again calculated the average. Here are the results:

    Month To 1 dp   To 0 dp   Diff
    Jan   26.0478   26.0545   -0.0067
    Feb   26.0595   26.0535   0.006
    Mar   24.0706   24.0652   0.0054
    Apr   20.3757   20.3803   -0.0046
    May   16.9282   16.9242   0.004
    Jun   14.2591   14.2597   -0.0006
    Jul   13.7578   13.7416   0.0162
    Aug   15.0923   15.0832   0.0091
    Sep   17.4591   17.4493   0.0098
    Oct   19.8232   19.8177   0.0055
    Nov   22.0112   22.0087   0.0025
    Dec   24.2994   24.2966   0.0028
    

    As you’ll seen despite the loss of accuracy in rounding (To 0 dp), the averages of those 100 years, about 3000 days, does not have an error of order 1. In fact, the theoretical error is about 0.28/sqrt(3000)= 0.0054°C, and the sd of the differences shown is indeed 0.0062. 0.28 is the approx sd of the unit uniform distribution.

      • What Nick’s example shows is that rounding error is approximately gaussian ( normally ) distributed , contrary to Kip’s assertion.

        That is only one very small part of the range of problems in assessing the uncertainty in global means. Sadly even this simple part Kip gets wrong from the start. The article is not much help.

      • “that rounding error is approximately gaussian”
        Actually, there’s no requirement of gaussian. It just comes from the additivity of variance Bienayme. If you add n variables, same variance, the sd of sum is σ*sqrt(n), and when you divide by n to get the average, you get the 1/sqrt(n) attenuation.

      • Thanks Nick. That article refers to “random” variables, how is that different to normally distributed?

        “of the same variance” is also key problem in global temps since SST in different regions do not have the save variance. That is without even talking about about the illegitimate mixing with land temps which vary about twice a quickly due to lesser specific heat capacity and is why you can not even add them to sea temps, let alone the rest of the data mangling.

        You can not play with physical variables a freely as you can with stock market data.

      • Greg,
        “That article refers to “random” variables, how is that different to normally distributed?”
        Random variables can have all manner of distributions. Gaussian (normal), Poisson, uniform etc.

        ” is also key problem”
        Same variance here just simplifies the arithmetic. The variances still add, equal or not.

        My example just had Melbourne temperatures. Nothing about land/ocean.

      • Well done Nick.

        You have also highlighted your lack of comprehension of basic maths :-)

        “n” readings of +/- 0.5 uniformly distributed between 0 and 1.

        Standard deviation is INDEPENDENT of “n”

        “n” readings +/- 0.5 uniformly distributed from any 1 unit group eg (between 17.5 & 18.5)

        And suddenly you think the standard deviation becomes dependent on “n”? Really ?????

        Do you want to think about that…………… just once?

        No probably not. Just keep trotting out your statistical gibberish.

      • “And suddenly you think the standard deviation becomes dependent on “n”? “
        Where did I say that? The argument here is about standard error of the mean. Which is also related to the standard deviation of a set of realisations of the mean.

        I think you’re out of your depth here, Andy.

    • Nick. I’m sure you’re right. But, Kip has a point also. If I take a cheap Chinese surveying instrument that measures to the nearest 10cm and measure the height of the Washington Monument (169.046 m), I’m probably going to get an answer of 169.0m and averaging a million measurements isn’t going to improve whatever answer I get. (As long as the monument refrains from moving? Can I improve my measurement by jiggling my measuring instrument a bit while making a lot of observations?)

      I’m not quite clear on the what the difference is between the two situations. Or even whether there is a difference.

      • Don K,
        “I’m not quite clear on the what the difference is between the two situations.”
        Mark Johnson has it right below. The difference is that one is sampling, and sampling error is what matters. In any of these geophysical situations, there aren’t repeated measures of the same thing. There are single measures of different things, from which you want to estimate a population mean.

        So why do measurement errors attenuate? It is because for any of those measures, the error may go either way, and when you add different samples, they tend to cancel. In Kip’s 72F example, yes, it’s possible that the three readings could all be down by 0.5, and so would be the average. But it’s increasingly unlikely as the number of samples increases, and extremely unlikely if you have, say, 10.

      • Thanks for trying Nick. As I say, I’m sure you are correct. But I also think Kip is probably correct for some situations. What I’m having trouble with is that it appears to me there are not two fundamentally different situations, but rather two situations connected by a continuous spectrum of intermediate situations. So, I’m struggling with what goes on in the transition region (if there is one) between the two situations. And how about things like quantization error? As usual, I’m going to have to go off and think about this.

      • Don K writes

        But I also think Kip is probably correct for some situations.

        Situations where there was a bias involved in the measurements for example…

      • “Situations where there was a bias involved in the measurements”
        No, Kip’s examples have nothing about bias. He said so here. You don’t see examples like this involving bias. They aren’t interesting, because once stated, the solution is obvious; remove or correct for the bias. There’s nothing else.

      • Nick writes

        They aren’t interesting, because once stated, the solution is obvious; remove or correct for the bias.

        Fair enough from Kip’s later comment but practically speaking you cant easily say you have no bias in your measurements especially in measuring something as complex at GMST or GMSL.

      • But I also think Kip is probably correct for some situations.

        He’s correct for the situation which he carefully prepares above. If the signal you are sampling never deviates beyond the resolution of the instrument, you are stuck with the resolution of the instrument.

        Fortunately for your sound system and for temperature averages, the signal does deviate over time by more than the resolution, and thus you can get an accuracy greater than that of the resolution of the measurement instrument by averaging together multiple measurements.

        Your sound system in your stereo (unless you are an analog nut) samples at 10s of Mhz frequencies using a 1-bit D/A (or A/D) and then “averages” down the signal to 192Khz giving you nice 24 bit sound at 20Khz. At least, that’s how the Burr-Brown converter in my expensive pre-amp works. I also helped design such systems…

        Peter

        (I put “averages” in quotes because it’s more sophisticated than that. In fact they purposefully introduce noise to force the signal to deviate by more than the resolution. The “averages” the climate folks use are boxcar averages which is probably the worst choice for a time series…

      • Peter ==> If only they were finding the means for “water level at the Battery at 11 am 12 Sept 2017” they would get wonderfully precise and accurate means for that place and time with a thousand measurements. Digitizing music doesn’t attempt to reduce the entire piece of music to one single precise note.

      • Peter ==> If only they were finding the means for “water level at the Battery at 11 am 12 Sept 2017” they would get wonderfully precise and accurate means for that place and time with a thousand measurements. Digitizing music doesn’t attempt to reduce the entire piece of music to one single precise note.

        That’s an argument that the average sea level over some long period of time is not physically meaningful.

        That’s a different argument than what you discuss in the above article.

        As far as music, the single precise note is sampled thousands of times at low resolution and then averaged in a way that is physically meaningful to your ear. That was my point. If you want to argue that averaging the entire musical piece is not meaningful, well, I would agree with you. But I wouldn’t argue about the precision of that average, I would just argue that it’s not meaningful…

        Peter

      • Peter ==> Yes, quite right — a different subject than that of the essay. A-D is like finding the precisely right water level at a single time — sort of like NOAA Co_OPS does with the 181 1-second readings to get a six-minute mean — which is the only data actually permanently recorded.

        The attempt to use thousands of six-minute means to arrive at a very precise monthly mean is like reducing an entire piece of music to a single precise note — it is only the precision claimed that is meaningless — it is possible to get a very nice useful average mean sea level within +/- 2cm or maybe double that +/-4 cm with all other variables and source of uncertainty added in.

      • The attempt to use thousands of six-minute means to arrive at a very precise monthly mean is like reducing an entire piece of music to a single precise note — it is only the precision claimed that is meaningless — it is possible to get a very nice useful average mean sea level within +/- 2cm or maybe double that +/-4 cm with all other variables and source of uncertainty added in.

        It’s not quite so black and white. Consider music. If I averaged out the 10-20Khz part of the signal I would certainly lose musical quality (although someone with hearing loss might not notice), but I would improve the precision at 100Hz). I would still be able to hear and calculate the beats per minute of the music, for example.

        The same issue if I was trying to detect tides. If I average over 48 hours or monthly I’m not going to see the tides in my signal since the tides are ~6 hours peak-trough.

        If I’m interested in how the sea level is changing from decade to decade, however, averaging to a yearly level is perfectly reasonable, and you actually gain precision in doing so, since all the small perturbations are averaged out and additionally you trade decreased time precision for increased sea level precision. This is where we seem to disagree, and I’ll stand on 25 years of engineering experience (including as an engineer designing calibration equipment for electronics), plus can provide textbook references if you want. The Atmel data sheet I provided in a post above is one example.

        I think however that small long term changes in the average the surface temperature over the planet is not physically relevant. For the global average, I can change the time axis for an X-Y axis (making this a 3-D problem) and the above analysis about averaging and trading time precision for temperature precision applies – it’s just not physically relevant. The average global temperature in combination with time is not really physically relevant (as opposed to the monthly average temperature in the El Nino region IS physically relevant). I’d refine that argument and say 1degC change for global temperatures is not physically relevant, but 10degC likely is. (-10degC is an ice age).

        I also believe there’s an issue with measuring long term temperature trends that only a few have addressed. From Nyquist we know that we cannot see a signal with a period greater than sample rate / 2, but few people realize Nyquist is symmetrical. We cannot see signals with a frequency LOWER than the window length / 2.
        So for example in a 120 year temperature record we cannot resolve anything longer than 60 year cycles. And it’s actually worse than this if you have multiple overlapping long cycles like say for example PDO and multiple friends out of phase with each other… (Numerical analysis suggests 5 cycles required, which also corresponds to the normal oversampling rate on digital oscilloscopes for similar reasons, based on professional experience). I’d like to see a temperature record of 350 years before drawing strong conclusions about long term climate trends….

        Peter

      • Peter ==> ” I’d like to see a temperature record of 350 years before drawing strong conclusions about long term climate trends….” You’ll have to wait another 300 years in that case — as the satellite record is at best 50 years long.

        The thermometer record before the digital age is vague and error prone, spatially sparse, and unsuited for the purpose of a global average — and subject to the limited accuracy of +/- 0.5°F (0.55°C) plus all of the known reading, recording, siting etc errors.

    • For a time series, an “average” is not an average. It is a smooth or a filter. When you “average” 30 days of temperature readings to obtain a monthly “average,” you are applying a 30-day smooth to the data by filtering out all wavelengths shorter than 30 days. It is a filter, not an average. Dividing by the square root of n does not apply to smooths. You know better. You are very knowledgeable. What you are doing in your chart is comparing two different ways to do a smooth. Again, it is not an average. The only way that you can apply the square root of n to claim an improvement in measurement uncertainty is if each measurement were of the same thing. However, every day when you take a temperature reading, you are measuring a property that has changed. You can take an infinite number of readings and the smooth of such readings will have the same uncertainty as the most uncertain of the readings. You do not get the benefit of claiming a statistical miracle. The problem arises by treating a time series as if it consisted of a collection of discrete measurements of the same thing. The average temperature of January 1 is not an estimate of the “average temperature” of the month of January. Same goes for each day of January. You do not have 30 measurements of the “average temperature” of January!

      • “You do not have 30 measurements of the “average temperature” of January!”
        No. I have 100. Each year’s 31-day average is a sample of a population of January averages. And they are literally averages; they do have filter properties too, though that is more awkward. But filtering also attenuates noise like measurement error or rounding.

      • When you are smoothing 30 days of temperature data, your “n” is still only 1! It is incorrect to claim that when smoothing 30 days of temperature data “n” equals 30. Thus taking the square root of n is 1, and not the square root of 30. Thus, you do not get the benefit of improved or reduced uncertainty. All you are doing is filtering out certain terms of a Fourier analysis of a time series, namely all wavelengths shorter than 30 days. When you remove terms of an equation, you are discarding information. So, in effect, you are claiming improved uncertainty by discarding information! Let us take your century of data. A century of data has 365.25 times 100 years of daily data or about 365,250 data points. By applying a 100 year smooth to this data, you are eliminating all wavelengths shorter than 100 years and you are left with a single statistic, the 100 year smooth of a century of daily temperature readings. You are then claiming that you know this smooth to an uncertainty of one over the square root of 365,250 or about 0.0016546452148821. That is absurd. The uncertainty of the smooth is the same as the largest uncertainty in your time series. If a single measurement has an uncertainty of plus-or-minus 10 degrees C and all the other measurements have an uncertainty of plus-or-minus 1 degree C, then your smooth will have an uncertainty of plus-or-minus 10 degrees C. Again, the “average” of a time series is not a “mean,” it is a smooth. You are discarding information and common sense should tell you that you do not improve your knowledge (i.e. reduce uncertainty) by discarding information.

      • Each year’s 31-day average is a sample of a population of January averages.

        NO. Each January is a smooth of something different. You are not taking one hundred measurements of a single hole’s diameter, so that you can divide by the square root of 100 and claim that you have an improved uncertainty of the diameter of that single hole. You are taking 100 measurements of the diameter of 100 different holes, because each January is different, so you do not get the benefit of dividing by the square root of 100.

      • Phil
        “When you remove terms of an equation, you are discarding information. So, in effect, you are claiming improved uncertainty by discarding information!”

        Of course averaging discards information. You end up with a single number. Anyone who has lived in Melbourne will tell you that the average Jan max of 26°C is not a comprehensive description of a Melbourne summer. It estimates an underlying constant that is common to January days. In Fourier terms, it is the frequency zero value of a spectrum. But by reducing a whole lot of information to a single summary statistic, we can at least say that we know that one statistic well.

      • Let me put it another way. You have a hole whose diameter is changing continuously. Measuring the diameter 100 times does not improve your uncertainty as to the diameter of the hole, because each time you measured it, the diameter had changed. When you apply a 30-day smooth to the series of diameter measurements, you are simply reducing the resolution of your time series data. This may be helpful in determining if the hole is getting bigger or smaller, but it does not improve the uncertainty of each diameter measurement, because each time you measure you are only sampling it once, so you have 100 measurements of sample size n, where n=1. You can only divide by the square root of 1. You cannot claim that your uncertainty is improved. You need to treat the series of measurements as a time series and only use statistical theorems appropriate for time series. Using statistical theorems applicable to non-time series data on time-series data will provide (respectfully) spurious results.

      • Phil
        “You have a hole whose diameter is changing continuously.”
        Well, an example is the ozone hole. We can check its maximum once a year. And as years accumulate, we have a better idea of the average. There it is complicated by the fact that we think there may be secular variation. But even so, our estimate of expected diameter improves.

      • (the average Jan max) estimates an underlying constant that is common to January days.

        Again, most respectfully, no. The average of Jan max is not an underlying constant. You may claim that the average of Jan max is a constant, but, in reality, the temperature is continuously changing. You may claim that the filtered data that you call “the average of Jan max” is not significantly different from zero from year to year based on certain statistical tests, but you cannot pretend that “the average of Jan max” is a constant. Temperature is changing continuously.

        Of course averaging discards information. You end up with a single number.

        Please do not confuse issues. Averaging (dividing the sum of 100 measurements by 100) 100 distinct measurements of a hole whose size does not change does not discard any information. In that instance, you can claim that you can improve on the uncertainty of just measuring it once, by dividing by the square root of 100. “Averaging” (dividing the sum of 100 sequential data points by 100) 100 measurements of a hole whose size is changing continuously is a mathematical operation on a time series called smoothing. The result is not the mean of a population. It is a filter which removes certain wavelengths and thus discards information. Although, the computational steps bear great similarity, the two operations are quite distinct mathematically and I think you know that.

        …by reducing a whole lot of information to a single summary statistic, we can at least say that we know that one statistic well

        Once again, I respectfully disagree. How well you know that “single summary statistic” depends not only on how you reduce the information but also on the nature of the information that you are reducing. When the “whole lot of information” consists of time-series data, and what you are measuring is changing from measurement to measurement, then you cannot claim that you “know” the “single summary statistic” any better than you know the least certain data point in the series of data points that mathematical operations are being performed on, because each time you measure this continuously changing thing, you are only measuring it once. The only exception I can think of is in certain high quality weather stations where three sensors are installed and temperature is measured simultaneously by all three. At those particular weather stations and ONLY at those particular weather stations can it be claimed that the sample size, n, is greater than 1. At those stations and ONLY at those stations is it appropriate to divide the uncertainty of the sensor by the square root of 3 to obtain an improved uncertainty of each temperature measurement by the system of three sensors at each particular time of measurement.

      • Well, an example is the ozone hole. We can check its maximum once a year. And as years accumulate, we have a better idea of the average. There it is complicated by the fact that we think there may be secular variation. But even so, our estimate of expected diameter improves.

        Let’s assume that each time the ozone hole is measured, the uncertainty of that measurement is, for the sake of argument, plus-or-minus one square mile. You cannot “average” the historical maximum ozone hole measurements and claim that you know the size of the ozone hole with an uncertainty less than the hypothetical plus-or-minus one square mile. You do not have a better idea of the average maximum ozone hole size as the years “accumulate.” As the years accumulate, the characteristics of the filter that you are using change so that for 10 years of history, you may reduce that to one statistic that would be a 10 year smooth, discarding all wavelengths shorter than 10 years in length. When you have 20 years of history, you may reduce that to a different statistic that would be a 20 year smooth, discarding all wavelengths shorter than 20 years in length, but the uncertainty of each smooth would remain the same at the hypothetical one square mile.

      • Phil,
        You said, “When you remove terms of an equation, you are discarding information.” I totally agree. An easy way to demonstrate this is to plot the daily temperatures and also plot the monthly temperatures and compare them. If one calculates the standard deviation of the annual data, I would expect that the standard deviation would be larger for the daily data than for the monthly data. Also, I would expect the daily data to have a larger range.

    • I set about to disprove Kip’s assertion, using Mathematica, and found a satisfying (to me) proof.
      Then I read the comments, and found the above comment by Nick Stokes.
      Although I am a warming skeptic, and Nick (I think) is not, I must concur with Nick.
      Since he said it well, I’ll not bother to discuss my simulation — it’s quite trivial.

      • Did you check the source code in Mathematica first? Did you even read (and understand) the manual thoroughly? Statistics/mathematics packages embody a whole lot of assumptions that the average user is almost never aware of. A lot of the bad statistics around these days are due to the fact that most people never actually learn the underlying theory any longer. They just follow the recipe without knowing if they have the right ingredients.

      • I did the same 2 years ago using Matlab. And since I’ve saved companies $millions by using statistics, I’m quite confident in the source code..

        (I was actually checking to see what the result of auto-correlation was for space-based averaging, such as what Berkeley Earth uses. They underestimate the std deviation by about 2.5x because they don’t take this into account… there’s also other issues with BE (their algorithm for determining whether to infill is likely too sensitive) but I digress)

      • You would be surprised how many people have not the slightest idea what autocorrelation is, though it is hard to think of any kind of climate data that are not autocorrelated.

    • Nick, what you need to explain to me is how any treatment of data removes the original uncertainty – because whatever number you come up with it is still bound (caveatted) by the original +/- 0.1 deg or whatever the original uncertainty is; i.e. in your example 0.2 deg C.

      And remember in the series you have used that most of the numbers had a +/- 1 deg F before BOM played with them to reach temperatures to 4 decimal places from a 2 deg F range that must still apply.

    • Nick,

      Your exercise is wrong.
      Remember that a disproportionate number of original temperature readings were taken to the nearest whole degree F. If they later got some added figure after the decimal because of conversion from F to C, by dropping these off again for your exercise you are merely taking the data back to closer to where it started. Even post-decimal, If you think of a month when all the original observations were in whole degrees, you are merely going in a loop to no effect. It is unsurprising that you find small differences.
      To do the job properly, you need to examine the original distribution of digits after the decimal.
      ………
      But you are missing a big point from Kip’s essay. He postulates that observations of temperature need not follow a bell-shaped distribution about the mean/median or whatever, but are more often a rectangular distribution to which a lot of customary statistics are inapplicable. I have long argued that too much emphasis has been put on statistical treatments that do more or less follow normal distributions, with too little attention to bias errors in a lot of climate science.

      Early on, I owned an analytical chemistry lab, a place that lives or dies on its ability to handle bias errors. The most common approach to bias detection is by the conduct of analyses using other equipment, other methods with different physics, like X-ray fluorescence compared with atomic absorption spectrometry compared with wet chemistry with gravimetric finish. In whole rock analysis the aim is to control bias so that the sum of components of the rock specimen under test is 100%. Another way to test accuracy is to buy standard materials, prepared by experts and analysed by many labs and methods, to see if your lab gives the same answer. Another way is it be registered with a quality assurance group such as NATA which requires a path to be traced from your lab to a universal standard. Your balance reports a weight that can be compared with the standard kilogram in Paris.
      Having seen very little quality work in climate science aimed at minimising of bias error and showing the trace to primary standards, one might presume that the task is not routinely performed. There are some climate authors who are well aware of the bias problem and its treatment, but I do wish that they would teach the big residual of their colleagues to get the act right.
      It will be a happy future day when climate authors routinely quote a metrology measurement authority like BIPM (Bureau of Weights and Measures, Paris) in their lists of authors. Then a lot of crap that now masquerades as science would be rejected before publication and save us all a lot of time wading through sus-standard literature to see if any good material is there.
      Don’t you agree? Geoff.

      • Geoff,
        The history of the data here doesn’t matter. It’s about the arithmetic. It’s a data set with a typical variability. If the original figures were accurate, adding error in the form of rounding makes little difference to the mean. If they had been F-C conversion errors, measurement errors or whatever, they would have attenuated in the same way. The exception is if the errors had a bias. That’s what you need to study.

        That is the deal with homogenisation, btw. People focus on uncertainties that it may create. But it is an adjunct to massive averaging, and seeks to reduce bias, even at the cost of noise. As this example shows, that is a good trade.

        re BIPM – no, that misses the point. As Mark Johnson says elsewhere, it’s about sampling, not metrology.

      • Nick the question that is being asked badly and you have not answered so I will ask you directly. Can you always homogenize data, and lets fire a warning shot to make you think, both Measured Sea Level and Global temperature are proxies. I have no issue with your statistics but your group has a problem they are missing.

      • LdB
        “Can you always homogenize data”
        The question is, can you identify and remove bias, without creating excessive noise? That depends partly on scope of averaging, which will damp noise and improve the prospects. As to identifying bias, that is just something you need to test (and also to make sure you are not introducing any).

      • So basically you have a rather large gap in your science knowledge that you can’t homogenize everything.

      • It simply means that as with any numerical procedure, you have to check if it is working. With temperature homogenisation, that is done extensively, eg Menne and Williams.

      • I am less worried about the temperature readings than the Tidal gauges. Having seen many situations in which Central Limit Theory fails in signal processing the tidal guage situation does have my alarm bells ringing do you know if anyone has tested it?

    • Nick ==> There is no question that when staying in the world of mathematics the difference is small. The problem, though, is not the mathematics — maths are always (nearly ) very neat and – surprise – comply with mathematical theories.

      This, however, is a pragmatic problem — the original measurement was strictly given as a range and all means of ranges are ranges of the same order.

      Do the simple experiment described in the Author’s Comment Policy section — can let me know what you find.

      • Kip
        “Do the simple experiment”
        Here is the fallacy in your first case. You can take it that the distribution of each range is uniform, range +-0.5. So the first reading looks like this:

        The variance is 1/12. But when you take the sum of 71 and 72, the probabilities are convolved:

        The range is +-1, but the variance is 1/12+1/12=1/6. When you sum all of them, the distribution is convolved again (with the running mean) and is

        The range is now +-1.5, and the variance 1/4. To get the average, you divide the x-axis by 3. That brings the range back to +-0.5, but the variance is now 1/36. The range is theoretically 1, but with very small probabilities at the ends.

        You can see that the distribution is already looking gaussian. This is the central limit theorem at work. The distribution gets narrower, and the “possible” range that you focus on becomes extreme outliers.

      • Kip – I think we have to keep pulling people back to considering the REALITY of what the ORIGINAL measurement purports to quantify.

      • Nick is correct. It is well established statistical theory that averaging of quantized signals can improve accuracy. The usual model is, however, based upon several assumptions. Rather than flat out rejection of the efficacy, which is well established, a counterargument should focus on the assumptions, and whether they are satisfied.

        Quantization Noise Assumptions:

        1) the data are homogeneous
        2) the measurements are unbiased
        3) the underlying signal is traversing quantization levels rapidly and independently

        Under these assumptions, one can model quantization as independent, zero mean, additive noise uniformly distributed between -Q/2 to +Q/2, where Q is the quantization interval. The RMS of the error is then Q/sqrt(12). Averaging N samples then reduces the RMS to Q/sqrt(12N), and the averages are reasonably close to being normally distributed for large N. Such averaging is routine in a wide variety of practical applications, and the results in those applications do generally adhere to the model.

        To what degree are these assumptions satisfied for the temperature series? Well, the data are not homogeneous, because temperature is an intensive variable. Its physical significance varies with the local heat capacity. And, the likelihood that the measurements are unbiased is vanishingly small, due to the sparse sampling, the ad hoc methods employed to merge them together, and the issue of homogeneity referenced above.

        Assumption #3 is, in fact, the only assumption that likely does hold. Thus, I do not see the line of reasoning of this article as being particularly fruitful. It is attacking one of the stronger links in the chain, while the other links are barely holding together.

      • Thanks Nick. I like convolutions. Reminds me of a project is did on a scanning spectrometer once. Fine slits in the collimator give a sharp spectral resolution but sometimes light levels are too weak an you need to open up to wider slits. This convolutes the spectral peaks ( scanning wavelength ) with your first graph and causes broadening, losing resolution. In fact both inlet and outlet slits are finite leading to something like your second graph.

        If the slits are not equal this leads to an isosceles trapezoid form convoluted with the scan signal. The fun bit is to try to deconvolute to recover the original resolution. ;)

        The third one is quite a surprise. It’s obviously distorted but remarkably bell shaped. This implies that a three pole running mean would be a fairly well behaved filter, even with same window length each time, as opposed to the asymmetric triple RM I show here:

        https://climategrog.wordpress.com/gauss_r3m_freq_resp/

      • Nick ==> Again, you are talking statistical theory — and trying to find probabilities, reducing inaccuracy of measurement through the simple process of — in the end — long division.

        I don’t want to know the probability — I want to know the actual water level or the actual temperature — I want to know how close my measurement is to the real world true value -=- not theoretically how probably close my mean is to the actual measurement.

        My instrument — the tide gauge — only guesses (mechanically) at the water level outside. Most of its guesses are within 2 cm of the actual instantaneous water level (outside the instrument) — some are not, but the kind folks at NOAA allow the system to throw out as many as necessary if they are more than 3-sigma different than the others in the same set (of 181 measurements), which allows me to meet the accuracy specification, which is, as discusses +/- 2 cm.

        You may have as precise a MEAN as you wish, but you may not ignore the original measurement accuracy.

      • Kip,
        “You may have as precise a MEAN as you wish, but you may not ignore the original measurement accuracy.”
        You were wilfully misreading the advice from NOAA when you said:
        “The question and answer verify that both the individual 1-second measurements and the 6-minute data value represents a range of water level 4 cm wide, 2 cm plus or minus of the value recorded.”</i
        That wasn't what they said at all. They spelled it out:
        “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.”
        That is statistical. It says that the probability of being in that range is 66%. You can’t get away from probability in that definition. And the probability reduces with averaging.

        And you have tthe wrong measurement accuracy. The only thing here that could be called that is the 1mm resolution of the instrument. The +-20mm is a statistical association between the water inside and that outside. It is a correlation, and certainly is capable of being improved by better sampling.

      • i don’t think one can determine sea level to closer than 20mm, as at least with the Pacific, there is always more chop than that.

      • Nick Stokes ==> The sigma portion of the answer refers to the chart shown in the essay as:

        When we look at a data record for the Battery, NY tide gauge we see something like this:
        Date Time Water Level Sigma
        9/8/2017 0:00 4.639 0.092
        9/8/2017 0:06 4.744 0.085
        9/8/2017 0:12 4.833 0.082
        9/8/2017 0:18 4.905 0.082
        9/8/2017 0:24 4.977 0.18
        9/8/2017 0:30 5.039 0.121

        which is a tiny segment of the official tide gauge report for the Battery for the 8th Sept 2017.

        The rest of their answer is in specific answer to my specific question, exactly as quoted. I’ll forward you the whole email if it will help you understand. I have put NOAA’s portions in bold. This is an email thread, so the parts are in reverse time order, latest at the top.

        For the time being, here is the text of the email string:

        Dear Sir/Madam,

        Your message has been forwarded to my address for response.

        To address each of the questions:
        1. The sensor reports 1 second measurements to single mm units — reports to X.XXX mm. These measurements are spec’d to be accurate to +/- 2 cm.
        When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.

        That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time.

        3. Then the monthly means are calculated from all of these 6 minute (tenth hourly) averages?
        Or are there other steps in between …daily average mean sea level, weekly average, then monthly average?

        We do not calculate a “daily mean sea level” or “weekly mean sea level”.
        The data products generated from observed data include:
        – 6-minute interval data
        – hourly interval data
        – high / low tide data
        – monthly means

        Is there anyway to access the 1 second data via the internet?

        No. The 1-second data is collected within the internal data of the station and sensors. That 1-second interval data is used to compute the 6-minute interval data value. The 1-second data is not stored.


        I hope that this response has answered your questions.
        If not, please let me know.
        Should you have additional questions, please contact our office.
        Phone: (301)713-2815
        E-mail: Tide.Predictions@noaa.gov

        Regards,

        Todd Ehret
        Oceanographer
        User Services
        Center for Operational Oceanographic Products and Services
        Web: tidesandcurrents.noaa.gov

        On Tue, Oct 10, 2017 at 4:16 PM, Kip Hansen wrote:

        Mr. Ehret,

        Thank you for your very useful answer to my question and the link to the specs document. I have bolded my remaining questions below.

        As I understand the document:

        1. The sensor reports 1 second measurements to single mm units — reports to X.XXX mm. These measurements are spec’d to be accurate to +/- 2 cm.
        When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.
        If that last statement is incorrect in any way, please clarify.
        2. These 1 sec values, normally 181 of them centered on the tenth-hour, are averaged (after throwing out any 3 sigma outliers — which I assume are considered to be in error.) Creating the value recorded.
        3. Then the monthly means are calculated from all of these 6 minute (tenth hourly) averages?
        Or are there other steps in between …daily average mean sea level, weekly average, then monthly average?

        Is there anyway to access the 1 second data via the internet?

        It is very helpful to have such a responsive agency.

        Thanks again,

        Kip Hansen
        kiphansen@gmail.com

        On 10/10/2017 9:51 AM, Co-ops userservices – NOAA Service Account wrote:
        > Dear Sir/Madam,
        >
        > Your message has been forwarded to my address for response.
        >
        > Sigma — Standard deviation of 1 second samples used to compute the water level height
        >
        > As is noted in the CO-OPS Sensor Specifications and Measurement Algorithms, each 6-minute data value is an average of the 181 1-second data samples collected, centered on that time. The Sigma is the standard deviation, essential the statistical variance, between these samples.
        >
        > I hope that this response has answered your questions.
        > If not, please let me know.
        > Should you have additional questions, please contact our office.

        On Mon, Oct 9, 2017 at 10:56 PM, Kip Hansen wrote:

        Sirs,
        I have a question I can’t resolve.
        I am looking at this page:

        https://tidesandcurrents.noaa.gov/waterlevels.html?id=8518750&units=standard&bdate=20171008&edate=20171009&timezone=GMT&datum=MLLW&interval=6&action=
        which shows this image (cropped here):
        t[ide chart for the Battery]

        Then look at the .csv file for the same data with this extract:

        CO-OPS__8518750__wl.csv

        Date Time Water Level Sigma
        10/9/2017 12:00 3.015 0.174

        The “water level” shown is, of course, the Preliminary shown in the image.

        What is the “Sigma”? Is this meant to be the “standard deviation”? If so, of what data set? Or an indicator of the error bar? a confidence interval?

        Appreciate the clarification. I did try to find the answer on the pages describing the data products but was unsuccessful.

        Thank you,

        Kip Hansen

      • Kip,
        Yes, it’s clear in that expanded string that their reference to sigma was to the variation of six-minute readings, not the 2cm estiamte. Sorry to have misunderstood that. But I still think the use of the 2cm (or 5mm) measures have to be considered as standard deviations. That is normal notation and practice. I see that the NOAA guy didn’t explicitly contradict you on that, but I don’t think he was confirming. Here is what NIST says about theexpression of measurement uncertainty:

        Standard Uncertainty
        Each component of uncertainty, however evaluated, is represented by an estimated standard deviation, termed standard uncertainty with suggested symbol ui, and equal to the positive square root of the estimated variance”

      • Nick ==> It isn’t that I don’t understand that there is such as thing as “estimated standard deviations” or that they use them to make a standard statement about variation in measurements (of a static quantity measured many times).

        If NOAA CO-OPS had meant “ui” or SD” or “1 sigma” or some other thing — then I would expect the specification sheet, and the support team there, so say so, and not repeatedly use the term “accuracy”.

        I hope you realize that I am not arguing against the concept, when used in its proper place.

        It simply can not be applied to a non-static, constantly changing, continuous variable measured at many different times with results reported knowingly as a range. The range used is the original measurement uncertainty and applies to all subsequent calculations.

      • Nick’s examples are exercises in the world of sampling statistics where his probabilities are fixed to some theoretical distribution (he’s using the normal distribution a lot) the parameters of which depend massively on the size of each sample. That should the first clue into how ‘magic’ the error reduction gets when he increases the size of the sample. Also enlightening is that these kinds of exercises assume you take the same-sized sample each time. The next step in his lecture series should now be on how one comes up with the parameters for a distribution of sample means when the sizes of the samples differ. At least that will make this more applicable to the tidal gauge / temp measure issue. But one still cannot overcome the limits of the observation. Kip doesn’t care about sampling error.

      • “he’s using the normal distribution a lot”
        No, I’m not at all. The only theoretical number I used was the sd of the uniform distribution (sqrt(1/12)). It’s true that the central limit theorem works, but I’m not relying on it. It’s just using additivity of variance, and that will work equally for different sized samples, and for non-normal distributions.

      • Nick. When one invokes the central limit theorem one invokes the Normal distribution, because the latter is used to approximate the distribution of sample means. The standard deviation of that distribution gets smaller and smaller as you increase the size of each sample, and the shape of that distribution will look more and more Normal as you increase the number of samples. In a single sample, the standard error of the mean (SEM) is a sample estimate of the standard deviation described above. It is different from the sample standard deviation (SD). By itself, the SEM is a long-run estimate of the precision of the means of lots of samples. Both the SD and the SEM vary from sample to sample due to, at the very least, random sampling error. Under ideal circumstances, the sample mean is an unbiased estimator of the population mean. Under those circumstances, the sample mean will still not hit the population mean (because random sampling error), snd the SEM provides an expectarion of how closely the sample.meams should cluster together if you took a pile of additional samples each the same size as the first. Again. Precision. The mean of a sampling distribution of means will equal the population mean if both are distributed Normal. The central tendency theorem is invoked to assume that the distribution of sample means is Normal, even though the samples are drawn from a population that is non-Normal. This sets up valid null hypothesis tests that concern the means of sampling distributions of means and, say, a single sample.mean. It does not necessarily allow for unbiased estimation of the population mean using the mean of the sampling distribution of means, let alone our lonely single sample mean. So you are invoking the Normal distribution, a lot, when you refer tonthebcentral limit theorem. You’re dealing with sampling distributions.

      • “Nick. When one invokes the central limit theorem”
        But I didn’t. I observed its effect. All my example did was to take means of various samples of daily maxima, and compare with the identically calculated means of data that had been rounded to integer values. No assumptions. The differences were small. I compared them to what is expected by additive variance (it matched) but that is not essential to the conclusion. I showed that the difference in means was nothing like the 0.29°C effect of rounding on individual data.

        But in all this, I haven’t heard you support Kip’s view that measurement error passes through to the sample mean without reduction. You do say that it somehow adds differently, but don’t say how. How would you calculate the effect of observation uncertainty on the mean?

    • But this confuses the issue completely. The posting is not about removing the error from rounding, but from uncertainty in measurement. Your argument is utterly irrelevant to the question at hand.The post is addressing the physical fact that using a ruler that only measures accurately in millimetres twice won’t make it give you a measurement in picometers. You can’t use a high school ruler a million times to measure the size of an atom. Measurement accuracy does not improve with repeated samples.

    • Nick,

      You are missing Kip’s point. His assertion is that your January reading should be 26.0478 +/- 0.1.

      • +-0.05, I think. And he would assert that after rounding it should be +-0.5. But it clearly isn’t. I actually did it, for 12 different months. And nothing like that error is present in the means.

      • @Nick Stokes

        You’re still missing the point. Why would the error be present in the means? There is no there there to begin with, in the means or otherwise. How can you say something is or isn’t present if it was never measured in the first place?

        We are not discussing errors in means, we are discussing errors in measurement.

    • “Nick Stokes October 14, 2017 at 11:51 pm
      Kip,
      You do have over a century of scientific understanding against you. And you give almost no quantitative argument. And you are just wrong. Simple experiments disprove ”

      Pure hand waving, Nick.

      Explain how century old temperatures, eyeball read from mounted shaded thermometers can be added to modern, never certified or recertified for accuracy, temperature thermistors?

      Then an alleged average calculated out to four decimal places? Which by sheer absurdity only appears accurate.
      e.g. Jan maxima average is 26°C, period.

      Calculation of an alleged four decimal place version and/or difference does not represent greater accuracy than January’s 26°C.
      It is all pretense, not reality.

      Then you want everyone to accept that mishandling a Century of data accurately represent the entire and all potential weather cycles?

      Hand waving, Nick.

      • “Nick Stokes October 15, 2017 at 10:09 am
        “Hand waving”
        No, it’s an introduction to a concrete example with real data.”

        Real data!?
        You call four decimal place numbers from “0.n” maximum 1 decimal place physical measurements, “real data”?

        That claim is a mathematical shell game using an imaginary pea.
        Yes, you are hand waving.

      • “You call four decimal place numbers from…”
        No, I call them calculated results. I need the decimals to show what the difference is. But the robustness of the calculation. To at least two decimals, you get the same result if you reduce data from 1 dp to 0dp.

      • “Nick Stokes October 16, 2017 at 1:10 am
        “You call four decimal place numbers from…”
        No, I call them calculated results. I need the decimals to show what the difference is. But the robustness of the calculation. To at least two decimals, you get the same result if you reduce data from 1 dp to 0dp.”

        You claim false value for your imaginary four decimal places.
        Nor can you prove four decimal place value when using integers and single decimal place recorded numbers as data.

        You use “robustness” just as the climate team does when they’re skating bad research or bad mathematics past people.

    • Nick Stokes ==> So let me get this right — you are saying that it does not matter at all what the original measurement accuracy is, because “Long Division will always reduce inaccuracies in measurement to negligible sizes if we just make a sufficient number of inaccurate, vague measurements.”

      If we measured tide gauge water level only to the nearest foot (or meter), would you still like to insist that we can derive mean sea level to millimetric precision and accuracy? If so, why not go for an even tinier number — say 10,000ths of a meter? How low can you go with this? How about if we measured temperature to the nearest 10 degrees? Still get a perfectly defensible mean to hundreths of a degree?

      Is your claim that measurement accuracy means nothing if you just have enough numbers to churn?

      • Get an eight foot pole that has markings at 1,2,3….8 feet.
        ..
        Use this pole to measure 10,000 adult American males randomly selected. Each measurement is to the nearest foot.

        When you sum all the measurements it will be roughly 58300 to 58400.

        When you divide the sum by 10,000, you’ll get 5.83 to 5.84

        Congratulations, you just measured the average height of an American male to less than the nearest inch. Pretty amazing considering your pole only has markings at one foot intervals!!!

      • Mark S Johnson,

        Well if you have any stock in companies that manufacture highly accurate and highly precise measuring instruments you had better sell it. You have just let the cat out of the bag that anyone can get by with much cheaper, crude instrumentation if they just measure 10,000 samples.

        Based on your remarks, I don’t believe that you have read my article that preceded the one Kip cited. Let me then share a quote from it:
        “Furthermore, Smirnoff (1961) cautions, ‘… at a low order of precision no increase in accuracy will result from repeated measurements.’ He expands on this with the remark, ‘…the prerequisite condition for improving the accuracy is that measurements must be of such an order of precision that there will be some variations in recorded values.’” But, most importantly, you must be measuring the same thing!

      • Clyde Spencer: “Well if you have any stock in companies that manufacture highly accurate and highly precise measuring instruments you had better sell it. You have just let the cat out of the bag that anyone can get by with much cheaper, crude instrumentation if they just measure 10,000 samples.” When you are measuring the height of only one person, 10,000 samples are going to agree, and be up to 6 inches off with 95% chance of being up to 5.7 inches off when done with Mark S. Johnson’s 8-foot pole with perfect calibration and resolution of 1 foot. But if you are looking for an average height among 10,000 persons, Mark S. Johnson’s measuring pole can determine that with a much smaller +/- with 95% confidence. And if Mark S. Johnson’s pole has all of its markings being incorrect by the same amount or the same percentage, it can still be used to track growth or shrinkage of a large random population to the nearest inch if that changes by more than an inch, with high confidence.

      • Is your claim that measurement accuracy means nothing if you just have enough numbers to churn?

        It is a question of quantisation or resolution, ie precision, not accuracy. You should not use the two terms interchangeably. They have precise and different meanings.

        It is not that the precision “means nothing” but less precision can be compensated by more readings.

      • Mark S Johnson writes

        Congratulations, you just measured the average height of an American male to less than the nearest inch.

        Except you’re an inch out on the true average and you couldn’t do it at all if the markings were at 3 foot intervals. You seem to want to ignore the measurements themselves when arguing how accurate you can be. Its a fatal mistake.

      • Mark S Johnson October 15, 2017 at 12:42 pm
        “Congratulations, you just measured the average height of an American male to less than the nearest inch. Pretty amazing considering your pole only has markings at one foot intervals!!!”

        What’s even more amazing is that you also got the height of Australian males to the nearest inch. I’m really impressed.

      • Well if you have any stock in companies that manufacture highly accurate and highly precise measuring instruments you had better sell it.

        Too late. I helped developed such a system in 1995 at an electronics test and measurement company. The technique was developed many decades before that but only became economically viable in the 1990s due to the newer CMOS manufacturing capabilities.

        Currently I have a Burr-Brown 24-bit ADC (59 ppb precision) with a 1 bit (+/- 50%) sampler in my stereo pre-amp. It sounds so good I run my analog record player through it. In 1995 we were happy to get 18 bits using the same technique for a digital multi-meter.

        Your 1-foot interval for the American male population won’t work because the signal (actual heights) doesn’t vary by more than a foot. However, if you want 1/10th of an inch precision then measuring each male to 1-2 inches precision is quite sufficient. Just make sure when you calibrate your stick you calibrate your 1 inch tickmarks to 1/10th of an inch precision.

        Peter

      • Peter Sable says: “Your 1-foot interval for the American male population won’t work because the signal (actual heights) doesn’t vary by more than a foot. ”

        Nope, it will work because there are 6foot 3 inch males in the population, and there are 5 foot 2 inch males in the population. There are even some 4 foot 4 inch males and some 7 foot inch ones.

        The key fact you don’t understand is that some males will be smaller than 5 foot 6 inches ,and some will be larger. It’s the relative proportion of each that determines the average.

      • I agree with Peter there. Calculating the average is trying to estimate ∫hP(h) dh where h is height, P is the pdf. The coarse ruler is like trying to evaluate the integral with quantiles. You can get a good approx with 1″ intervals, which is less than 1/10th of he range. But when you get intervals close to the range, the integration is likely inaccurate.

      • Nope, it will work because there are 6foot 3 inch males in the population, and there are 5 foot 2 inch males in the population. There are even some 4 foot 4 inch males and some 7 foot inch ones.

        There aren’t enough in the population sample to span the range of 1 foot. you are right if you happen to know the exact mean of the population you could use a “are you taller or shorter” measurement and estimate the mean from that.

        For an analog input signal to a 1-bit DAC it’s possible to know (or rather calibrate) the true mean of the population and then the proportion gives you sample average as you indicate I don’t think you know that mean a-priori with a population. Also, your population had better have an even distribution. I suspect there are more 6’6″ males in the population than 4’6″ males.

        When the variance of the signal approaches the precision of the instrument, then the devil is in the details. We’re talking about 1degC precision with a 10degC diurnal variation, so not apples-apples to your yardstick example.

      • Nick & Peter….

        Sorry to inform both of you, but, the numerical PROPORTION of 5 foot measures to 6 foot measures will contribute the most to determine the average when the sum of the measures is divided by 10,000. There will be some 4-foot measurements, and there will be some 7 foot measurements, but their numbers will be relatively small.

        What makes any argument against my “8 foot pole” example fail, is that we know prior to executing my procedure, what the average is. Also known is how height is distributed. With these two facts, you will have a hard time showing my exampple failing.

      • Peter, the analogy of DAC is inappropriate. DAC sampling does not measure a population mean. It approximates an instantaneous value which is the antithesis of a population value.

      • OK, I tried it, and Mark’s method did still do well, with 1′ intervals. I assumed heights normally distributed, mean 5.83, sd 0.4. Centered, the expected numbers were

        4.5    5.5    6.5    7.5
        190    6456   3337    17
        

        Weighted average is 5.818, so it is within nearest inch.

    • You are missing the point. What is the uncertainty of each of the daily maxima? Run your averages where the measurements are all at the top of range of uncertainty and then again when they are all at the bottom of the range. Now tell us what the “real” value is. If there are uncertainties, you just can’t assume the middle of the range is the correct reading.

    • Nick, we already went through this once and you haven’t learned how this works.

      “As you’ll seen despite the loss of accuracy in rounding (To 0 dp), the averages of those 100 years, about 3000 days, does not have an error of order 1. In fact, the theoretical error is about 0.28/sqrt(3000)= 0.0054°C, and the sd of the differences shown is indeed 0.0062. 0.28 is the approx sd of the unit uniform distribution.”

      You are making the same mistake as last time – you are leaving out the uncertainty of the readings, and treating them as if they are gold. You have calculated the centre of the range of uncertainty and called your construct the ‘theoretical error’. The uncertainty of each reading is 20mm up or down and you have shown nothing that reduces it.

      You have provided an SD based on the data, but forgot to add the uncertainty for each reading, for which a different formula applies. You are trying to sell the idea that 3000 readings makes the result ‘more accurate’. The accuracy of the result is determined (only) by the instrument, which is why we rate the accuracy of instruments so we can pick one appropriate for the task at hand. You can’t just leave out the instrumental uncertainty because you have 3000 readings. They are 3000 uncertain readings and that uncertainty propagates.

      It is a surprise to me that so many contributors do not understand this. Kip wrote it out in plan bold letters: measuring 1000 things once each with an inaccurate instrument does not provide a less-inaccurate result. That is the property of measurement systems – uncertainties propagate through all formulae including the one you show.

      Measuring with a plus-minus 20mm tide gauge 1000 times over a 4000mm range does not provide an average that is known to better than plus-minus 20mm because that is the accuracy of the readings. Any claim for a more accurate result is false.

      If you used the same equipment to measure the water level in a lake with waves on it, knowing that the level does not change, is a different matter in terms of how stats can be applied because that is taking multiple measures of the same thing with the same instrument. That still wouldn’t increase the accuracy, but the stats that can be applied are different. It certainly wouldn’t make the result more precise either because the precision remains 1mm. Your formula estimates quite precisely where the centre of the error range is located. Nothing more. The ‘real answer’ lies somewhere within that range, not necessarily in the middle as you imply. That is why it is called a “range”.

      • Crispin (wherever you are) ==> It nearly brings tears to my eyes to see that someone understands the issue so clearly.

        Yours:

        The uncertainty of each reading is 20mm up or down and you have shown nothing that reduces it.
        You have provided an SD based on the data, but forgot to add the uncertainty for each reading, for which a different formula applies.

      • Crispin
        “You have calculated the centre of the range of uncertainty and called your construct the ‘theoretical error’. The uncertainty of each reading is 20mm up or down and you have shown nothing that reduces it.

        You have provided an SD based on the data, but forgot to add the uncertainty for each reading, for which a different formula applies.”
        My example was of temperatures in Melbourne. But how do you “add the uncertainty”? What different arithmetic would be done? There seems to be a view that numbers are somehow endowed with original sin, which cannot be erased and has to be carried in the calculation. But how?.

        In fact all my example did was to take a set of readings with high nominal precision, sacrifice that with rounding, and show that the average so calculated is different to a small and predictable extent. Any “original sin” derived from measurement uncertainty would surely be swamped by the rounding to 1C, or if not, I could round to 2C, still with little change. If the exact readings could have been optained, they would be a very similar series before rounding, and would change in the same way.

        One test of these nonsense claims about irreducible error is to actually calculate a result (protagonists never do) and show the error bars. They will extend far beyond the range of the central values calculated. That does not make nonsense of the calculation. It makes nonsense of the error bars. If they claim to show a range over which the thing calculated could allegedly vary, and it never does, then they are wrong.

      • Nick, the errors at the different levels (observation vs. random sampling) will sum to give you the true estimate of error. If the errors are correlated (unlikely) then they sum but are also influenced by the direction and magnitude of the correlation between them. It is like Kip said, this isn’t typical undergrad stats, unfortunately (which is more a dig at oversimplified undergrad stats).

      • “Nick, the errors at the different levels (observation vs. random sampling) will sum to give you the true estimate of error. “
        So how would you sum the observation errors? Say they amount to 0.5C per observation. Why would that sum differently than, say, 0.5C of rounding?
        Kip wants to say that 0.5C observation error means 0.5C error in mean of 1000 observations. Do you believe that?

      • No, Nick, Kip Hansen is stating that the average does not mean anything without an error band of .5C., if the data going into the average had that error band.

      • Nick. Kip already mentioned it. The errors are essentially fixed, the observations finite and known. Therefore the SD will be +/- 0.5. (was it cm?) Var=(n/n)E{0.5^2}. SD = Var^0.5. This is your first level variance. Sum it with variance from each additional level of estimation. With all the different sites of measuring water level, each probably exposed to different factors which probably overlap sometimes from site to site, I would guess that sea level would be considered a random effect if this were a meta analysis. Variability (precision) within each site and variability in sea level betweem sites would need to be taken into account as well in order to get the ‘true’ unceetainty in the uber avergage.

      • RW,
        “Var=(n/n)E{0.5^2}”
        Do you mean 1/n? I can’t figure the second term, but it sounds a lot like you’re agreeing with Mark Johnson and me that the std error of the mean drops as sqrt(1/n). What you’re saying doesn’t sound at all like Kip’s
        ” the means must be denoted with the same +/- 0.5°F”

        And what do you make of Kip’s insistence that ranges, not moments, are what we should be dealing with?

      • Nick. Yes 1/n like you are thinking but because the error is 0.5 for each observation the equation becomes n/n …0.5^2 ‘n’ times…i just pulled the n out of the summation (‘E’) per summation rules to make it easier for you to see thay it has no effect at that level. We are back to what Kip said originally. We have also established that the 0.5 +/- is a standard deviation as i think was said by someone already (you?).

        The SEM is not SD/(n-1)^0.5 as someone else wrote, it is simply SD/n^0.5 . The n-1 only comes with the calculation of sample variance. Here, we use n for variance because we have the population of observations. We are not generalizing to a population of observations.

      • “because the error is 0.5 for each observation the equation becomes n/n …0.5^2 ‘n’ times…i just pulled the n out of the summation (‘E’) per summation rules to make it easier for you to see thay it has no effect at that level. “
        You’ll need to spell that out in more detail. If you are summing n variances, the summands are, after scaling by the 1/n factor of the average, (0.5/n)^2. So the thing in front should be (n/n^2).

        As for “We are back to what Kip said originally.”, no, Kip is very emphatic that 0.5 is not a sd, and we should not think of probability (what else?):
        “In scientific literature, we might see this in the notation: 72 +/- 0.5 °F. This then is often misunderstood to be some sort of “confidence interval”, “error bar”, or standard deviation.”

      • Nick ==> Do you think there is anything you and I can agree on on this very narrow specific point? If so, pass it by me.

      • Kip,
        I think no agreement is possible because you reject probability as a basis for quantifying uncertainty, and I insist there is nothing else. People here like quoting the JCGM guide; here is one thing it says:

        3.3.4 The purpose of the Type A and Type B classification is to indicate the two different ways of evaluating uncertainty components and is for convenience of discussion only; the classification is not meant to indicate that there is any difference in the nature of the components resulting from the two types of evaluation. Both types of evaluation are based on probability distributions (C.2.3), and the uncertainty components resulting from either type are quantified by variances or standard deviations.

        You like intervals. But
        1) meaningful intervals rarely exist in science. Numbers lie within a range as a matter of probability; extremes of any order can’t be ruled out absolutely. If an interval is expressed, it is a confidence interval, perhaps implying that the probability of going beyond can be ignored. But not zero, and the bounds are arbitrary, depending on what you think can be ignored, which may differ for various purposes, and may be a matter of taste.
        2) Intervals do not combine in the way you like to think. Science or Fiction set out some of the arithmetic, as did I and others. When you combine in an average, the only way the ends of an interval can stay populated is if all the measures are at that end. So it is one-sided, and takes an extraordinary coincidence.

        You don’t have absolutes in science. Heissenberg insists that you might be on Mars. All the oxygen molecules in your room might by chance absent themselves. One does not think about these things because the probabilities are extremely low. But you can’t get away from probability.

        The practical problem with your musings is that they describe a notion of uncertainty which is not that of a scientific audience, as the JCGM note shows. So it doesn’t communicate. I also believe that it just isn’t one that you could quantify or use systematically. That is what StatsFolk have learnt to do.

      • Nick ==> Well, I tried.

        I wonder what’s wrong with me and all those engineers and other scientists that agree with me?

    • This shows nothing aside from how the number of significant digits you use has little influence on the standard deviation of a sample of sample means (i.e. the standard error of the mean). You are talking inferential sample statistics. All the gains you are referring to combat random sampling error. The post concerns uncertainty in the measurements themselves. These are different things. The former is hugely helped by taking more samples and/it increasing the n in each sample, whereas the latter is not overcome by this.

      • “You are talking inferential sample statistics. All the gains you are referring to combat random sampling error. The post concerns uncertainty in the measurements themselves. These are different things. “
        They are. And the post is talking about the wrong one. In climate, many different kinds of measurement are combined. The post imagines that somehow the measurement uncertainty of each aligns, and can be added with no effect of cancellation. It doesn’t explain how.

        There may indeed be some alignment; that would create a bias. An example is TOBS. People make great efforts to adjust for changes in that.

      • Nick writes

        There may indeed be some alignment; that would create a bias. An example is TOBS.

        Another might be how the satellite chases the tidal bulge around the earth when doing sea level measurements such that month averages have biases.

  9. Is temperature truly infinite in the continuum like time is, or does it have quanta associated with it like radiation?
    I found a sample document to read, but I haven’t extensively studied quantum mechanics yet :
    TEMPERATURE IN QUANTUM DYNAMICS
    ALESSANDRO
    SERGI

    ABSTRACT
    . What is the meaning of the thermodynamical temperature in quantum mechanics? What is its role in the classical limit? What can we say about the interplay between quantum and thermal fluctuations? Can we impose a constant-temperature constraint within dynamical simulations on quantum systems as we do in simulations of classical systems?

    https://www.scribd.com/mobile/document/40884849/Temperature-in-quantum-mechanics

    • You are talking of at a theoretical level there you don’t measure at a theoretical level you measure with an instrument. The instrument has it’s own characteristics which often don’t precisely match the quantity being measured and it will shock many on this site because of their level of science that temperature is one of them.

      So lets do this as basic as we can in quantum mechanics temperature is a “made up” statistic you can’t isolate it as a pure quantity. You actually need to combine several real quantities in QM to make what you measure as temperature. Temperature in classical sense was something that made a fluid inside a tube move up or down past some marks on the device. Later it got turning into roughly the movement speed of the molecules in classical physics. The problem comes with QM that you can have movement and momentum which can’t be isolated to our 3 dimensional world but can shown to be QM quantities.

      So what the article is dealing with is you need to be very careful when trying to have temperature arguments in QM because you need to clearly isolate what you are calling temperature, it isn’t clear cut like in the classical sense. You see this in that QM can take temperatures below absolute zero, they aren’t breaking the laws of physics it’s just the thing you call temperature isn’t a pure thing and they are showing that by using QM techniques.

      All of that is outside what is being discussed, you have a device which is measuring classical temperature. I am sort of having fun watching all sides try and follow thru the argument. No one has got it completely right and there is a big thing missing which is discussion of the measurement device itself.

      I hope first explaining the QM basics and making the parties aware they need to think about the device. The article looks at the Sea Level device and it is on the right track. Nick, Rick and a few others are coming at it from statistics but they haven’t thought about the device itself. Kip is right in asking the question are you entitled to use the statistics and you need to work that out for yourself and what the underlying assumptions become.

      • “All of that is outside what is being discussed”
        Kip (the author) calls temperature a continuum, using the word infinity or infinite. The article I linked to mentions kinetics as part of what temperature is at the atomic level which therefore indicates to me that temperature is indeed a continuum and not discrete (quanta / quantum).

        I have a high interest in knowing the extreme details of temperature constructs because of my work involved in the Wattson Project, which isn’t scheduled for public introduction until January 2019.

      • The thing you are calling temperature is a continuum in classical physics. It is not anything in QM it is a made up thing to match what you measure in classical physics. I can’t be anymore blunt.

        There is nothing to understand about temperature in QM it simply a construct of some quantities to match what classical physics describes. If you like it is like trying to measure a rainbow.

  10. “[ Note: In a separate email, it was clarified that “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.” ]”

    My guess is this isn’t as odd as it sounds. But. “essential” was probably intended to be “essentially”?. And the standard deviation is the square root of the variance? My guess is that “variance” really should have been the less rigorous term “variation” Note that the statistical property “variance” has units of the variable under discussion squared and is often a disturbingly large number compared to the actual size of the errors.

    • Don ==> Yes, I should have added “(sic)” to the quote of the email — the correct word should have been “essentially”. The point of including this statement was for clarity. NOAA CO-OPS notes the sigma of the six minute readings as very small, but the “accuracy” as +/- 2 cm, for the same recorded “measurement”.

  11. I don’t want to be pedantic, but this is a subject that I taught to laboratory technicians and engineers for many years.

    Sorry MSJ, Kip is correct and you are wrong. A measurement uncertainty specification states the size of an interval around the indicated value within which the true value is thought to lie. Properly stated the MU specified should indicate the distribution type used to determine it – i.e. normal, rectangular or triangular and a confidence level – see the ISO Guide to the Expression of Uncertainty in Measurement (GUM). It is typically 2 times the standard uncertainty derived from calibration comparisons the a primary reference along with evaluation of additional sources of MU. There are always more than one.

    When multiple measurements of something are made with a precise enough instrument they will invariably differ by some amount. The differences are considered random error and this can be reduced by averaging a number of measurements. But the random error is a source of uncertainty that is in addition to the instrument MU.

    So, it I take 100 measurements with a +/- 2 cm instrument and get an average of 50 cm with a standard deviation of 1 cm the overall MU is +/- 2.32 cm at a 95% confidence level. [Note: there is math involved in this calculation: MU = (sqrt((2/sqrt(3))^2 + (1/sqrt(n))^2))*2].

    In short, no matter how many measurements you make, the MU of the average is always greater than the MU of your instrument

    • Rick C PE, you do not understand the difference between measuring an individual item and sampling a population mean. There is no instrument capable of measuring the average monthly temperature of anything. The only way this can be done is by using a multitude of individual measurements arithmetically combined to yield an “average.” Hence the mathematics of statistical sampling must be invoked to determine the confidence interval of the SAMPLING.

      You are at the mercy of the sqrt(N) where N= the number of observations used to determine the population mean.

      • Mark , what you have just done is to say because “there is no instrument capable of measuring the average monthly temperature of anything” we will ignore measurement error.

        Imagine that you such a series of measurements and do the stats and state your uncertainty as +/-0.1 degree. You then check manufacturer’s spec for the thermometer and find that is was only calibrated to be +/- 1 degree of the true temperature.

        this is part of measurement uncertainty which is not reflected by your statistics and never can be.

      • Mark

        The sqrt(N) is a theoretical construct only valid when the individual sample uncertainty is negligible.

        Ever heard of Nyqvist or signal to noise ratios? Statisticians often forget these ideas and even the field of metrology. Or how about the basic Scientific Method?

        The idea is that you design the tools to fulfil the job. So if you require a certain maximum uncertainty you use tools that can give you that with multiple sampling.

        Temperature measurements and even sea level heights were not recorded with instruments and processes designed to give real uncertainties of a tenth of expected values.

        For example the typical variation of temperature anomalies is 0.1 K pre decade. So you need to design your system ideally with a systematic uncertainty of around 0.02 K or less, for decent signal to noise. 10 to 1 is better if that data is then processed and other results derived from it.

      • Mark: It is you who apparently does not understand that measurement uncertainty and sampling theory are two different things. The issue is the erroneous assumption that the error in instrumental measurement is random and symmetrically distributed about the ‘true’ value. This can never be known as there is always some possible bias. An instrument with a stated calibration uncertainty of +/- 2 cm could, for example, always be off by +1.5 cm and be considered acceptable (fit for purpose). Thus, no matter how many readings are averaged the answer will still have a +1.5 cm bias error. Here it should be noted that bias by definition is not known, or it would be corrected in the calibration process – e.g. “subtract 1.5 from the measured value”.

        Sampling is actually a quite complicated issue. Key issues include assuring randomness of samples, number of samples relative to population size, consistent measurement instruments and technique, etc. From what I’ve seen in climate science, sampling is far from adequate to justify the precision typically claimed. Even in well controlled laboratory settings, assuring that samples are truly random and properly represent the population being studied is often difficult. In many cases either the value of samples that may be destroyed in the analysis process or the cost of making the measurements themselves make statistically proper sampling infeasible.

        The application of normal statistics to estimate the range of a population mean from a sample mean (dividing the sample SD by the square root of n) is based on an inherent assumption that the measurement errors are random and normally distributed about the true value and that the sample is truly representative of the population from which the sample is drawn. I don’t think any of these conditions are met in the evaluation of annual mean temperatures or sea level.

        One final thought. In the laboratory we often make measurements of samples to determine some specific property – e.g. measure the tensile strength of 30 samples taken from a coil of steel. Each measurement may have an uncertainty of 100 psi, but the SD of the sample results may be over 1000 psi. In such cases the MU is of little consequence. Thus, we always want to use instruments at are at least 4 to 10 times more precise than the inherent variability of what is being measured. If you want to measure mean air temperature to an accuracy of +/- 0.1 C, your thermometer should have an uncertainty of less than 0.025 C.

      • Rick C PE ==> Yours: “I don’t think any of these conditions are met in the evaluation of annual mean temperatures or sea level.”

        Yes, that is exactly right. The StatsFolk insist on throwing our out the original measurement uncertainty (accuracy range of the measuring instrument) without first making sure that the requirements for applying their statistical analysis approach applies to the physical system and its measurements.

        They are entitled to find as “precise” a Mean as they wish — but they may not discard the original measurement uncertainty in doing so. If they insist that they can validly do so, then we could measure water levels to the nearest foot, or meter, or mile, throw out the known range of measurement uncertainty, and still derive a mean to millimetric precision.

    • Rick, perhaps you could write an article on this. You seem to be a lot more knowledgeable and qualified on the subject. This whole subject of uncertainty of measurement and claimed uncertainty is fundamental and has remained largely unchallenged for decades.

      Sadly I doubt much will come out of the sporadic flow on comments here.

    • Rick C PE ==> Thank you for the support….few and far between.

      The original measurement uncertainty is normally simply chucked out the window and the recorded figures treated as discreet measurements.

      so refreshing to hear from someone who has a solid professional grasp of the topic — I have sympathy for those with stats training who “know the rule” but not the exceptions.

      • When I did the “mathematics and statistics” courses in about 1976 this stuff was treated rigorously. I do remember it was a hard course requiring lots of homework and clear thinking, and the teachers were really strict about getting it right. How very sad to think that the standards seem to have dropped somewhat.

      • As you say Kip,

        Some people have done a reasonable level of maths and comprehend the difference.

        Others don’t comprehend and probably never will.

        Pointless to argue something that is beyond their ability, or willingness, to comprehend.

    • Rick. Perhaps you could answer this, then, which seems to be at the heart of the issue:
      If I have some number years of Jan 1 noon temperature measurements, each accurate (as per the article) to +/- 0.5 deg, and I plot them on a chart and fit a line and observe a slope to that line, how many measurements should I take to be sure of a rising (or falling) trend? Or, how many +/- 20mm tidal measurements must I have to declare a 1mm/yr sea level rise?

      • Paul Blase ==> I’ll take a pass at your question while we wait upon Rick — I don’t mean to answer for him.

        For your temperature example, two. It only takes two (2) data points to draw a line. That line is the “trend” of those two points. If the known measurement uncertainty is +/- 0.5°F, the difference between the first and second measurements would have to exceed that 0.5°F, to be certain that the measurements were, in fact, different at all. What you wouldn’t have is an understanding of anything as a result. You’d have just two dots, one different from the other.

        To derive meaning from those two dots requires understanding the physical system, the underlying causes-and-effects, that produce “temperatures”, what the relationships between (in your example) annual same-day temperatures, what the variations or differences in these measured values mean and don’t mean, are my two temperatures different because of something in particular, or just different? Would it make sense to expect them to be the same? (This questioning goes on for some time). The same reasoning would apply if you had 100 years of Jan 1s, or a thousand years of Jan 1s.

      • Thank you. I suppose that we could flip the question around and say that IF a rise of (say) .1 deg per year or 1mm per year is important, how long until we can be sure that we actually have one?

      • You can’t catch up to find a trend outside the uncertainty. You may be able to see a trend by looking at the top, recorded, or lower range lines. However, the value of the temps in the trend will lie somewhere inside and you have no way to know an exact value. That is why using an average out to 1/100th or even 1/1000th is ridiculous. Throw a dart until it lands inside the range and you’ll be as accurate as any scientist!

      • First, I agree with Kip’s answer. But, while the question seems simple, the answer is quite complicated. In fact, I’m pretty sure there are many textbooks and scholarly papers on the question of trend analysis and in particular time series. The key issue is whether a trend calculated from a particular data set is an effect of some cause or a result of some inherent random variability – i.e. due to chance alone. Error in data measurement primarily adds to the potential for either missing a real trend or incorrectly concluding there is one. If we see a 0.3 C difference between years measured with an uncertainty of +/- 0.5 C each time, how do we know if it is a real difference or not?

        But, even if we can conclude that an observed trend is likely real with a high confidence, this knowledge is of little value unless it has predictive value. Casino’s love folks who have a system that they think can predict the next roll of the dice or roulette number based on the trends they see in previous 10 or 20 trials. The stock markets have been trending up for some time, should you mortgage your house and invest based on this trend? Maybe buy stocks that have had the largest upward trend in the last 6 months?

        Doing regression analysis (curve fitting) on time series data effectively makes the passage of time the “independent” variable. But the data being analyzed is typically not a function of time. Temperature and sea level are clearly influenced by many independent variables that change over time (and yes CO2 concentration is one of them). Many seem to cycle up and down at varying rates. Those who frequent this site can easily list many of them. The real question is does anyone have an adequate understanding of how all of these variables affect temperature and/or sea level to be able to accurately predict future climate? My own conclusion is that the earth’s climate system is an excellent example of a chaotic system and that prediction of future states is not possible.

        By the way, we do know CO2 has increased quite steadily over the past 50+ years. However, if CO2 were indeed a primary control of temperature, there should be a strong correlation between them. But the data I’ve seen shows an R-squared of near zero. In my experience R-squared of less than 0.7 should be taken as a poor indicator of causation (more likely an indication of some unknown variable affecting both).

    • Kip: “So, it I take 100 measurements with a +/- 2 cm instrument and get an average of 50 cm with a standard deviation of 1 cm the overall MU is +/- 2.32 cm at a 95% confidence level.” What if an inaccuracy of determining a baseline measurement is not important and one is concerned about change from the baseline? Global average temperature is not as easy to determine as how much a determination of global average temperature changes. (The part of the world without official thermometers probably has a different average temperature than the part with official thermometers, but both parts of the world have high expectations of changing similarly in temperature when global average surface temperature changes.)

      Or, suppose that the 2 cm instrument does not have an error of 2 cm or up to 2 cm, but merely rounds its output to the nearest 2 or 4 cm but is otherwise accurate, like Mark S. Johnson’s 8-foot measuring pole that measures the height of American adult males to the nearest foot? Or that you have a few hundred of these instruments with biases ranging from 2 cm upward to 2 cm downward in a random manner known to have a 99% chance of having average bias of no more than a few millimeters, or their biases are known to not vary from year to year. What do these mean for how little global sea level can change from one year or decade to the next and there is 95% confidence that sea level changed in the indicated direction +/- 99.9%? Or a change twice as great is known to be +/- 50% with 95% confidence?

    • Rick C PE

      Thank you. Cogent and correct. You didn’t mention it but the ISO GUM “x 2” is there to create a confidence envelope around the measured value – a high confidence.

      This bears repeating because this is how to propagate uncertainties through a calculation:

      “In short, no matter how many measurements you make, the MU of the average is always greater than the MU of your instrument”.

      Crispin
      ISO TC-285

  12. Kip — I think you/we are sort of skating on thin ice over a sea of statistical sampling theory here. Sampling theory is a serious, and very complex field that is very important to industry. It’s widely used to determine things like how many samples from a production run need to be tested to have a certain degree of confidence that the run meets predefined quality standards.

    I know just enough about sampling theory to know that It’s really important and mostly beyond my abilities.

    In your/our case, the issue is how to sample temperature/sea level so as to get a useful/meaningful estimate of values and changes in values.

    • Don K ==> Yes and no….it is not an inability to get a useful “estimate” — finding means is simple…and the rules and methods clear…..it is the meaning of the apparent precision of those means that is the question.

  13. “We cannot regard each individual measurement as measuring the water level outside the stilling well”

    Yes and no. Conceptually, the stilling well is just a mechanical low pass filter that removes high frequency noise caused by waves, wakes, etc. Hopefully, water level measurements made within the stilling well will yield the same value as would measurements made outside the well. With a LOT less measurement and computation.

    • Don K ==> The pragmatic, real world fact is that the measurement inside the stilling well is known and acknowledged to be different from that outside the well — at each 1-second instant that readings are made. NOAA is very careful in their illustration to make this point, and the CO-OPS support agent very careful to point out that each measurement is in reality ONLY expected to be within 2 cm (+ or -) of the water level outside the well.

      You are exactly right that the stilling well is a mechanical “averager” or “filter” for the vast array of different waves and ripples and boat wakes and other disturbances of the water surface in a harbor (and having lived aboard boats and ships half of my adult life, I am far too familiar with the topic). But the supposition that this mechanical device does some sort of mathematical or statistical magic would be an error — it may be “like” some concept, but an engineer would explain that it simply does what it does, to the accuracy specified and tested in real use. That’s why I queried NOAA CO-OPS — and they confirmed this point.

      Tide Gauges return instantaneous readings of the level INSIDE the stilling well to am accuracy of 1mm, and report a range of water level OUTSIDE the stilling well to +/- 2cm (usually — some of the 1-second readings are far off and must be discarded).

      • Kip,
        You said, “Tide Gauges return instantaneous readings of the level INSIDE the stilling well to am [sic] ACCURACY of 1mm,…” I think that you mean a PRECISION of 1mm.

      • Clyde ==> I concede — NOAA refers to it as the “Resolution” of the inside-the-stilling-well measurement.

  14. Kip is right. Simple thought experiment. Suppose you took any large number of daily readings of an actual daily temperature of 70.4999 degrees, every reading the same, due to the constant theoretical climate. Each would be reported as 70 degrees, and any number of days averaged would result in an average of 70, when the real average is actually almost half a degree distant.

    • In this very artificial example, the variability is so small that the roundings are totally correlated; the error far exceeds the variability. But with any real situation of temperatures or tide gauges, it is the opposite. If the error far exceeds the range of the data, measurement tells you nothing useful. But if you have a temperature range of say 10°, then rounding errors will no longer be correlated.

      • Yes, this is like measuring something with quantised digital sensor, eg and analogue to digital converter on some physical sensor. The results are quantised into steps. One technique in instrumentation design is to deliberately add a small amount of noise like +/- one quantum step BEFORE sampling. This means that you can average groups of ten sample and the quantisation of you results is the 1/10 of the previous step quantisation.

        As long as the noise injected is normally distributed you have gained in resolution of the instrument at the cost of sampling ten times more often.

        It should be noted that here you are measuring the SAME physical quantity at an interval where it is assumed not to have changed physically. You are not mixing ten different sensors !

    • Not the point, The argument is over measuring one thing once and then measuring a different thing another time and averaging the two gives you greater accuracy of “something” that the accuracy you had of the original things you measured.

      I believe Phil (above ) is right: I base that on logic rather than stats. If today’s temperature is measured as 1 and yesterday’s as 2, then averaging them gives me 1.5. But that 1.5 is not a measurement of anything that actually happened, so the whole idea that it is “accurate” is nonsensical. I cannot measure something that does not and has not existed.

      As Phil argues, smoothing a time series is not the same as averaging measurements of the same, unvarying thing.

  15. I have to agree with Nick here. When you round to nearest integer you will adjust the measurements up or down +/- 0.4999 depending on what side of the zero point they were. Over multiple measurements each individual will fall “randomly” within the +/- 0.5 intervall.

    • MrZ

      “Over multiple measurements each individual will fall “randomly” within the +/- 0.5 intervall.”

      It might, or it might not. You have no idea, and that’s the point. You can’t just say it and then it is true.

      • Crispin ==> But “just say it and then it is true” is the basis of the strength of the StatsFolk argument. They have a standard method that has been drilled into them in Uni course after course, repeated until it dances before their eyes as a Universal Truth. But, like so so many things, the Devil is in the details — they know the rule but forget [or ignore] the exceptions and the required conditions for applying the rule.

        I appreciate your enlightened input here.

  16. Question: How much does the worlds river water run-off, and the soils it carries, rise the oceans level
    over a year’s time.

    Does the increase in soil deposits around the measurement site, effect the readings. The deposits have to increase the water level over time

    • “Question: How much does the worlds river water run-off, and the soils it carries, rise the oceans level over a year’s time”

      Quick answer — not very much. More detailed answer. MOST of the water in run-off is derived from precipitation which, in turn, is ultimately fueled by evaporation … from the oceans.. Some is derived from storage in aquifers, but that is thought currently to be roughly offset by storage of water in new reservoirs.

      “Does the increase in soil deposits around the measurement site, effect the readings.”

      Conceptually, it doesn’t affect the readings because the measuring hardware is fixed to the old bottom.

      In the grand scheme of things, sedimentation does represent a transfer of “dirt” from the land to the sea floor and thus raises sea level. But on a human timescale, the effect is surely negligible.

    • Additional note: Some older tide gauges using a mechanical float are said to have had a problem with sediment in the bottom of the settling well causing the float to bottom out at low tide. That causes the gauge not to record low water points properly.

      • Don K ==> Yes, that is an acknowledged problem. The PSMSL has a set of standard for records that it maintains — and I think that the old float gauges are excluded now. (I’d double check if it is important.)

  17. Nice Cutting of a Gordian knot John. Isnt it fascinating how counter intuitive statistics can sometimes be giving rise to the rather intemperent divergences of opinion seen on this thread.
    For the record I am with Kip, as physicist recalling ancient lab work

    • Alastair gray ==> Thank you sir. I am puzzled by the insistence of so many others on the use of statistical theory over practical real world example, use which ignores the rules for when those theories can be applied.

    • At least half the contributors on here would really benefit from being locked in the ancient lab. I can supervise Monday and Wednesdays for the rest of 2017. Any other volunteers? The project may well run into 2018 unfortunately.

      • Badger

        It is clear who here works in a lab with real instruments and who does not. Thanks too, Alastair. Can you imagine claiming to report the bulk temperature of the oceans measured occasionally to 0.06 C and claiming it has risen by 0.001? The cheek!

  18. When you start making claims on ‘world wide ‘ you really must face the reality of not just accuracy but ‘range ‘ Its is simply not enough to throw computing power , through models , at the issue. If you need a thousand or ten thousand measurement locations to deal with the scale, then to ‘know’ rather they ‘guess’ then that is what you should have .
    And when it comes to weather or climate on a world wide scale we seem to be not even close to the coverage needed to make these measurement in a manner that supports the scientific value they are often attributed with.

  19. One of the key statistical slight of hand tricks here is to pretend that thousands of measurement of SST are repeated measurements of the same thing, like it was some lab experiment.

    In the context of AGW and attribution we are interested in assessing and attributing effects of changes in radiation flows on the surface temperature, ie for lack of better metrics we are using the ocean mixed layer as a calorimeter the measure heat energy. If we have sufficient data we can try the same thing on all water down to a certain depth and arguably get a more meaningful result.

    So SST is not tens of thousand of measurements of some fixed quantity the “global temperature” since no such thing exists. There are many temperatures at different places for very good physical reasons. The “global mean temperature” is just that, it is a mean value : a statistic, it is not some physical quantity which we are taking thousands of independent measures of. In fact it is the sum which we are interested in, not the mean. The statistical confidence levels for the mean indicate how well the mean represents the sample and the confidence we can have that any given individual measurement will lie within the one or two std devs of the mean. This is not our confidence in the sum.

    What we have is global array of little boxes of sea water for which we have a temperature, from which we want to estimate heat energy content. Now if we want the total energy content we will have several thousand individual measurements each with its own measurement error that we add together the get the global heat content.

    There are many contributions to the uncertainty of such measurements some will average out others many be independent and thus considered orthogonal but will not average out others will be systematic and will not reduce at all.

    Then we have to add changes over time of the measuring system, which itself is largely unknowable at this stage.

    There is no simplistic answer like Kip’s +/-0.5 nor the usual +/- 2 s.d. which also ignores the nature of much of the uncertainty.

    • You are the first to get one big piece of the puzzle. Lets extrapolate the problem, for the tidal guages the wave background etc may be very different at different locations. Each guage at each location may have very different accuracy. In science it’s called the calibration problem.

      What you want to ask the statistics group is what is the calibration and discrimination on their statistics.

    • Greg ==> My essay is about the original measurement uncertainty — of each and every individual measurement or its recorded value.

      All the other stuff (and there is a lot of other stuff) is additive to the +/- 0.5° that derives from the original measurement uncertainty.

      I have said nothing about standard deviations, which are not part of the story in my essay.

      • thanks Kip. You seem to be missing the point a bit.

        There is a quantisation error in the recording method but there is also a lot of noise in the signal. With many measurements, averaging breaks down the quantisation. See my explanation of this effect and how it is dealt with in instrumentation design.

        Leo makes a similar point later using an audio example.

        Recording nearest degrees will cause a problem if the recording method changes and is not documented thoroughly and adjusted for. It is not a problem in itself if methods are consistent.

        You are mistaken in asserting that the result can never be more accurate than a single reading.

        Maybe you should try it. Take a dataset , do some averaging then truncate or round or whatever and show some figures. The very limited numbers you use in the essay do not examine the effect of averaging.

      • Greg ==> You are speaking only of mathematical results, and not real world measurement results, and your example works, of course, because that’s how long division works. It is a circular argument, in a sense.

        Work out the little exercise in Author’s Comment Policy section — share your results below. You will find that the range of the mean is the same as the range of the original measurement. It can not be different.

      • Kip

        I will go out on a limb here and say that the 20mm is based on observations in typical wavy tides. I am willing to bet a coffee that in windy places where the waves are higher, the 20mm claim is unsustainable. Just as one finds when taking measurements in the lab, the instrument can only perform well within a certain range of conditions. If they did some calibration exercise and arrived at the round number of 20mm, it is probably only reasonably true, and only in ‘standard conditions’. The reason is that as discussed above, the stilling well level is known to deviate from the water outside, and it cannot possibly have a perfectly linear error up and down over a range of sea conditions. On a calm, it might be 8mm and in stormy weather, 50mm. I am sure they use 20mm as a “general case” and live with the consequences. The fact that there are ‘outliers’ indicates there are generalisations and assumptions behind that 20mm figure.

      • Crispin ==> Tide Gauge specs are created by technical bureaucrats and committees, International bodies of sea level and tide people — I bet I could find, given a few hours and the patience to spend them, the very technical paper that describes the hows and whys of the spec at ±2cm. I may give it a go later this week.

  20. Re the “stilling” chamber. As Don K says there is a “low pass filter” which hopefully eliminates the effect of high frequency noise from waves and wakes. Note that depending on the location, and the design, it is unlikely to eliminate the low frequency noise from swell – easily 9 second period and in a bad location more. Further, because of the friction in the filter, the level inside will be different from that outside – when the tide is rising the interior level will be lower, and when the tide is falling the level will be higher.

    As a real – ie geographical – low pass filter, I refer you to the Rip at Port Phillip Heads. Here a narrow channel connects the open ocean and Port Phillip. Consider starting from an instant when the levels inside and outside are the same. Then as the tide outside rises, water pours in through the Rip, and the tide rises inside Port Phillip. When high tide is reached outside, the water is still pouring in, and continues to do so until the outer water level has fallen to the level inside. At this time, high tide is reached inside Port Phillip, while the outside water level is approaching half tide.As tide level outside continues to fall, water commences to pour out, and the tide level inside Port Phillip starts falling. Given the range of tide outside, and the tidal range inside, the six hourly period (approximately) between high and low tide outside is offset by nearly three hours (approximately) inside.

    This is on a far greater scale than that in the tide gauge, nevertheless, a similar situation. Hopefully the difference between inside and outside is rather lower!

    Re the accuracy of readings. As a cadet on a cadet ship in the 1950s we had to take sea water temperatures using the bucket method. We had to read – estimate – the temperature to the nearest 0.1 degree. I cannot recall if it were in Fahrenheit or Celsius – as the thermometers were supplied by the UK Met Office I suspect that Celsius was used – I remember that in the coded message we had to insert a C or F for the readings of all temperatures. Note exactly on a degree line is easy to observe, and exactly half way between is also easy. A third or a quarter is also easy to estimate, and this gives 0.3 or 0.7, and it is fairly easy to see if it is a little more than a quarter – hence also 0.3 or 0.7, or a little less, hence 0.2 or 0.8, and if it is a tad more or less than the actual mark, then it is either 0.1 or 0.9. In good weather, the reading would be good. With a strong breeze blowing, and rain or sleet, it would be difficult to get better than to the nearest 1 degree. And in a howling gale, often it would not be possible to get the bucket in the water – it would be blown sideways so much it never reached the water.

    Presume the Met Office found our reports valuable – there weren’t too many ships which reported and did so regularly every 6 hours, at 0000, 0600, 1200 and 1800 GMT.

  21. Cook-ery with numbers
    97% (+/- 98.8%) with very high confidence (because ‘Climate Science™’ is all about confidence not verification or proofs).

  22. Your position does not seem to agree with the international guideline: Evaluation of measurement data — Guide to the expression of uncertainty in measurement

    See section:
    “4.4.5 For the case illustrated in Figure 2 a), it is assumed that little information is available about the input quantity t and that all one can do is suppose that t is described by a symmetric, rectangular a priori probability distribution …”
    and
    “4.2.3 The best estimate of σ 2(q) = σ 2 n, the variance of the mean, is given by …”

    It seems as if the origin of the disagreement is related to:
    “G.2.1 If Y=cX +c X +…+c X =ΣN cX and all the X are characterized by normaldistributions,…
    then the resulting convolved distribution of Y will also be normal. However, even if the distributions of the Xi are not normal, the distribution of Y may often be approximated by a normal distribution because of the Central Limit Theorem. …”

    It would be interesting if you, with reference to this guideline, could identify exactly the source of the disagreement. After all, if there is an error in this standard, you need to identify that error quite precisesly:
    “The following seven organizations supported the development of the Guide to expression of uncertainty, which is published in their name:
    BIPM: Bureau International des Poids et Measures
    IEC: International Electrotechnical Commission
    IFCC: International Federation of Clinical Chemistry
    ISO: International Organization for Standardization
    IUPAC: International Union of Pure and Applied Chemistry
    IUPAP: International Union of Pure and Applied Physics
    OlML: International Organization of Legal Metrology ..”

    • Science or Fiction ==> Did you personally undertake the simple exercise described in the Author’s Comment section?

      • I did more than that:
        I generated 100 set of 2000 random temperatures between 0,00 and 10,00 (Unit doesn´t matter)
        (Let us call these sets ´real temperatures´.)
        I calculated the average of each set.

        For each temperature I then:
        Rounded the temperature to the nearest integer.
        (Let us call the resulting sets ´rounded temperatures´)
        Calculated the average of each set of rounded temperatures.

        I then:
        Calculated the standard uncertainty of the average value of the rounded data set in accordance with the Guide to the expression of uncertainty: Result: 0,006
        Which means that for the hundred sets I would expect to see a maximum difference between the averages in order of magnitude 3 – 4 standard uncertainties = 0,018 to 0,024

        For the 100 sets, I then calculated the difference between the average of the ´real temperatures´ and the ´rounded temperatures.
        I then found the maximum difference between the two average values for the 100 sets:
        Result: 0,021.
        I repeated the test two times more –
        Results: 0,023 and 0,021

        The result was exactly as I expected on the basis of the ´Guide to the expression of uncertainty in measurement´.

        More than that, in terms of your terminology:
        The accuracy of the mean, represented in notation as +/- 0.5, is not identical to the original measurement accuracy. [of each reading]

        Hence, the results of this test did not support your position.

      • S or F ==> Try the experiment as I defined it …. see if the range of possible means is the same as range of the original measurement uncertainty. It is, very time.

        Youare working a different problem, about probability and the precision of means.

      • “Try the experiment as I defined it …. see if the range of possible means is the same as range of the original measurement uncertainty. It is, very time.”

        You are right! The range of possible errors of the mean is the same as the range of possible errors for the individual measurements. :)

        However, that is also an useless conclusion.

        I will explain why.
        Let us say the resolution is 0,1 DegC.
        For the average to be 0,5 DegC too high, all measurements have to be 0,5 DegC too high
        The probability that one measurement is 0,5 DegC too high is 1/10
        The probability that two measurements is both 0,5 DegC too high is 1/10*1/10=1/100
        The probability that n measurements are all 0,5 DegC too high is 1/(1*10^n)
        The probability that 80 measurements are all 0,5 DegC too high is 1/(1*10^80)

        Given that 1*10^80 has also been proposed as estimate for the number particles in the Universe, I think the conclusion is pretty clear. The probability that the error of the mean is equal to the error of each measurement is vanishingly small – as in completely unlikely, for a relevant number of measurements.

        Hence, your point that:
        “The accuracy of the mean, represented in notation as +/- 0.5, is identical to the original measurement accuracy — they both represent a range of possible values.
        is right, but absolutely useless.

      • S or F ==> Thank you acknowledging that the central point of the essay is correct — even if you don’t agree with the rest. That’s a sign of intellectual honesty, and I recognize and appreciate it.

        “Usefullness” is a complex and complicated topic and may be beyond the bounds of this particular essay, except that the acknowledgement that the true accuracy of the “overly precise” means in many fields of research are omitted from the results — nearly always in press releases, often in published conclusions, and very often in future application.

        One way to look at the temperature data is to first imagine that we use the current system of using integers to represent the 1°-range. We then look at a million measurements in the range, and you say that the chance, the probability, of all measurements being “0.5 too high” is infinitesimal. However, the same is exactly as true for all measurements being very near the integer.

        Say we then change our approach and set the recorded values as “integer.5” to represent the range from “integer-1” to “integer+1” — the same 1° range, but centered on the .5. Now it would appear that values near the integers would be outliers, and the values near the .5 are “more probable”.

        The probability estimation is not true by either scheme — in other words, not true in the real world.

        Just a thought.

      • «We then look at a million measurements in the range, and you say that the chance, the probability, of all measurements being “0.5 too high” is infinitesimal.”
        To be precise, 80 measurements is more than enouh to make that conclusion

        “However, the same is exactly as true for all measurements being very near the integer.»
        I´m not so sure about that. If we define the error of the average of the measured temperatures as: the difference between the average of the real temperatures and the average of the measured temperatures:
        The probability for the error being in the 0,1 DegC interval around the true average is actually the higher than the probability for the error being in any other 0,1 DegC intervals.

        To understand that, I think we first have to take a look at the following statements in your article:
        «Any difference of the actual temperature, above or below the reported integer is not an error.»
        I would say that if a high precision thermometer displays 0,5 DegC, and the measurement result is reported as 1 DegC, an error of 0,5 DegC is committed in the measurement process.
        That interpretation is also in accordance with the definitions in Guide to the expression of uncertainty (GUM): «B.2.19 
error (of measurement): 
result of a measurement minus a true value of the measurand»

        The next statement we need to have a closer look at is:
        «These deviations are not “random errors” … »
        If the real temperature is a random and continuous variable that can take on any value within a range that is larger than the magnitude of the rounding (see definition of in GUM C.2.2).
        Then
        The error, by rounding, that is made in each measurement, is indeed random within a 1 DegC range

        The second part of that sentence is correct:
        «These deviations …. are not “normally distributed”.»
        But we can specify the distribution. Actually, the probability distribution for the error is termed a rectangular distribution (GUM 4.3.7 and 4.4.5 ). I can not even think of a more perfect example of a rectangular distribution.

        And now comes the crux.
        A combination of a little handful of measurements having equal rectangular distribution will approximately have a normal distribution (GUM G.2.2 ):
        «EXAMPLE The rectangular distribution (see 4.3.7 and 4.4.5) is an extreme example of a non-normal distribution, but the convolution of even as few as three such distributions of equal width is approximately normal.»

        This, I am not good to explain, but it has been demonstrated. See (GUM G.2 Central Limit Theorem) or Wikipedia on ´Central Limit Theorem´ . (I don´t think William Connolley has messed up that article yet. :) :)

        As
        the central limit theorem postulates that the combination of a handful or more of rectangular distributions of equal width will approach a normal distribution,
        and
        a normal distribution has highest probability density at a range around the central value
        then

        the probability for the error of the average being in the 0,1 DegC interval around the true average is actually the higher than the probability for the error being in any other 0,1 DegC intervals.

        I believe that another way to think of it is that the range near the real average is the range that has the largest number of possible combinations of individual measurements. I believe that follows from the central limit theorem.

      • Central limit theorem with square or ‘rectangular’ distribution is ‘easy’ to explain. The key is in taking the average. The average will always ‘pull’ towards that middle value (along the x axis in that distribution graph). It’s more likely to fall into the middle values than the extreme ones. Think if it this way, each observation could yield a value from the left it right of the middle x value. Averaging them together just brings you to the middle. Think of a coin toss. 2 flips. Heads 0 Tails 1. 2 of those outcomes yields .5, 1 yields 0, and the remaining one yields 1. Voila, kinda normal-looking already. Do 100 flips and so on and eventually you’re into Normal curve land.

        But this is just sampling statistics in a fun world where the theoretical distribution is known and you pluck observations from it pretending like you didn’t just fundamentally constraints the entire process to give you a fixed result as the central limit theorem indicates.

        The magic of the central limit theorem…

        …Is also a sort of unrealistic aspect of the central limit theorem. The averages of many variables are Normal distributed when you take an ‘approaching infinity’ number of samples, but the averages of other variables are not distributed Normal when you do the same procedure. In the real world, we often do not know the distribution of the variable we measure, and the number of samples we take is limited, so we cannot really test whether or not the central limit theorem holds. Invoking the central limit theorem to conjure a Normal distribution is invoking an assumption that is probably not even remotely testable. So it is a big qualifier in many cases – and statisticians are explicit when they invoke it.

        But this all doesn’t do anything for getting rid of the uncertainty in the observation. That kind of random error and random sampling error are different sources of error.

  23. While the accuracy diacussion is interesting, it is not important to the issue of WARMING, which is the determination of rate of change. The confidence in the slope of the temperature line is certainly related to the precision, not the accuracy, and it is here where sampling makes a difference.

    • You are sort of dancing with the second problem the stats group haven’t addressed. So hold your thought about the rate of change, and lets divid the space in two one half increasing at rate X and one at rate Y.
      So what does your average rate represent?

      Okay now divid the space into 4 X & Y remain the same but U is not changing and W is going down at a slow rate. Now what does your average rate represent?

      You can be certain about the average rate but what does that really mean to any one of the 4 sections of sample space you need to do something very important as the next step in science. I am interested to see if you know what that is.

      • BRILLIANT ! Thanks LdB, you are making me work hard to retrieve stuff in my brain from over 40y ago,

        I love this blog – good job I am half (+/- 0.125) retired!

    • Roger ==> As to the “slope of the trend line”…..the original measurement accuracy comes into play when the uncertainty range is applied to the resultant means. If the means are all within the uncertainty of the original measurement, then assigning a difference to the various means is a dicey proposition. If the accuracy of temperature measurements are in the +/- 0.5°F range, and all the means fall within this range, then there is uncertainty that the means are actually different.

  24. Enthalpy is a measurement of energy in a thermodynamic system. It is the thermodynamic quantity equivalent to the total heat content of a system. It is equal to the internal energy of the system plus the product of pressure and volume.

    https://en.wikipedia.org/wiki/Enthalpy

    It is the change in total energy of the earth (system) that is important. Not including atmospheric pressure and water content (humidity) in calculations is another source of error.

    • rovingbroker,
      Which raises another question. Because temperature is serving as a proxy for energy, how certain are we that there haven’t been long-term changes in humidity?

  25. Well this is an interesting topic!

    Thanks to Kip for a very clear and well thought out explanation. Easy peasy or so I thought. Just reminded me about all the maths and statistics I studied when I was 17/18/19 and like Kip showed us I used to work out homework type examples myself. Pencil and paper, basic calculator. I didn’t even see a scientific calculator until 1977.

    Well I thought all this was basic, elementary, simple. Easy to grasp. Foundational stuff for any STEM degree course. Ingrained in the brains of all those who graduated, known by all PhDs. FUN-DER-MENTAL.

    Apparently NOT !!!!!!!!!!!!!!!!!

    Where oh where to start? I literally have no idea now. If only Nick were one of my students, we will have him stay behind and sit with all the others (it’s going to be a big room) with just a pencil and paper and a basic calculator. Unfortunately I expect the homework will not be even attempted as the students will just start arguing with the teacher (again).

    I’ll have a look through my library and see if I can find some of the older books on metrology, probably still got a few somewhere. May be coming back later with some references after the weekend.

    TLDR Kip right, all the “others” SO SO WRONG.

  26. Apologies for this being OT, but I thought it may predict exact path of hurricane Ophelia,
    [snip – you thought wrong -mod]

  27. ” and vast sums of money are being invested in research to determine whether, on a global scale, these physical quantities — Global Average Temperature and Global Mean Sea Level — are changing, and if changing, at what magnitude and at what rate”

    Ah no. Not vast sums. Hardly anything at all. GISS spends less than a 1/4 man year on temps.
    last I looked CRU was maybe a Post doc.
    Cowtan and Way.. volunteer.
    Berkeley earth, all volunteer.

    Not vast sums at all.

    The other efforts “re analysis” which Judith Curry takes as the gold standard, is also cheap
    and some folks even make money of it.

    Kip, you have no valid points. I only pray the Red team asks you to Join.
    That would doom it.

      • No one has ever accused Steven of impartiality, objectivity, (or civility). The issue is the extent to which he is correct. I

        n this case, I think he hass simply misunderstood what Kip said. I mean, what is climate modeling, but an attempt to project future temps? I’m told that the modeling is not cheap. Likewise, the principle use of satellite Radar Altimetry seems to be projecting sea level rise. At a first approximation, nothing associated with satellites is cheap. Ever

      • He is deliberately missing the point. If GISS spend half man year trying to ‘adjust’ the temperature record to fit their climate models, they still need the data collection and that is a global network of 100,000s of meteo stations, deployment and maintenance of floating and anchored sensors, ships records etc. etc.

        Before Climategate came out CRU were being paid $1 million per year to maintain the land surface record and that was just archiving ( which they failed to do and the cat got it ) and processing.

        He also wilfully ignored the whole question MSL despite having included it in the snip he quoted.

    • lol @ Mosher. Someone was describing your more recent posts the other day as drive-bys. Pretty lazy drive-by. I hope you and kip know one another and that you’re just razzing him.

      • RW ==> Mosher is himself (which is often not a compliment in his case). He has nothing to say about the content of the essay — so he goes off about whether or not how much money is being spend on determining global means….apparently, though he is on the BEST team, he doesn’t get paid (or paid much). Fair enough, he may complain all he wants about that.

        Of course, I don’t get paid either….but I don’t complain.

  28. Thanks, Kip. Re temperature: I think it is important for everyone to understand that in this particular post you are only addressing a small part of the total problem. [You know this of course, and you have addressed some other issues in other posts.]. Correct me if I have got it wrong, but …..
    1. Your analysis addresses only the temperature measurements that are made. It makes no allowance for the temperature measurements that are not made. Temperature measurements that are not made include missing entries from an existing station and missing stations. By “missing stations” I mean the areas in which there are no stations at all, areas that are too different to their nearby stations’ locations to be represented by those stations, and changes in the set of stations over time. All of those temperature measurements that are not made have to be estimated, and that introduces significant further error.
    2. The temperatures being measured are not necessarily the temperatures that are required for climate purposes. For example, all temperature measurements in urban areas have to be corrected for UHE (Urban Heat Effect). This again introduces significant further error, because UHE is not fully understood, the factors required for accurate correction are not available, and some of the methods being used by the providers of some temperature sets are quite simply wrong. Some argue that UHE is insignificant because urban areas are such a small part of the total surface area, but this argument is incorrect because urban stations’ readings are used to estimate the temperature measurements that are not made as per 1 above. UHE is only one such source of error, other sources include land-use changes, aircraft movements at airports, pollution, poor station siting, etc.
    3. The temperature measurements that are made are not necessarily correct, ie. not necessarily within the 1 deg F range that you describe. Depending on whether a station is or was automated, there could be human or equipment error. This tends to be dealt with by ignoring outliers, but this simply adds to the set of temperatures that are not made as per 1 above, it does not trap those errors which leave readings within the accepted range, and it risks eliminating genuinely unusual temperatures.
    4. The temperature inside the station may vary from the temperature outside, if for example there is a fault with the station’s design or siting or changes in its condition.

    The end result of all of the above is that the (in)accuracy ranges that you describe are only a small part of the total inaccuracy.

    NB. This is not in any way a criticism of Kip’s post. Kip’s post addressed one particular issue only. All I am doing is making sure that others understand that the issue addressed by Kip in this post relates to just a small part of the total temperature error.

    • Mike Jonas ==> Yes, Mike, all those other sources of inaccuracy and uncertainty are ADDITIVE to the most basic of inaccuracy, the original measurement inaccuracy or uncertainty.

  29. Kip – you have a difficult row to hoe, here’s the first two sentences from the Executive Summary from Chapter five of the IPCC Fourth Assessment Report: Climate Change 2007 (AR4):

    The oceans are warming. Over the period 1961 to 2003, global ocean temperature has risen by 0.10°C from the surface to a depth of 700 m.

    Really? Two place accuracy for the entire globe over 42 year period? When you’re dealing with people who are in charge and also write that sort of non-sense it gives you hopeless feeling. How many reviewers with a PhD put their stamp of approval on that over the top sophistry?

    • That IS an interesting paper

      “The uncertainty of GPS position doesn’t seem to decrease simply as root n:”

      One quick answer is (probably) that some of the errors are due to things like mis-estimates of ionospheric delay and satellite position that tend to average out over time. In technospeak, observations that are close together in time tend to have errors that are correlated.

      I expect that’s not the full story.

    • Chas- as far as I know, the GPS systems delivers a coordinate with a fixed variability due to the way the signals have to be analysed. Before GPS went beyond the US military the position was “fuzzed” so it did not represent a random distribution around a centerpoint but a level probability for around the point of something like 6-10meters. Now they aren’t fuzzing the output and the coordinates represent a point anywhere between 6-10 centimeters. You can’t query the position several times and the coordinates returned don’t have an random Gaussian probability of being anywhere in the box. There is no statistical centerpoint as when a measurement has a Gaussian distribution probability.

      Very similar to what Kip is talking about.

      Most the methods used by the Standards organization above deal with measurements where individual measures can be expected to have a Gaussian distribution- chemical tests, electrical measurements, conventional surveying, engineering measurements, etc. That doesn’t apply when the measurement is rounded to an arbitrary figure. There are several good posts on how the Australian BOM goofed when they introduced electronic thermometers. The WMO standard requires averaging measurements over 10 minutes, to mimic the previously used mercury thermometers. The AMO was picking the highest reading found within 1 second and using that as the average. Then, for awhile, they had low limits programmed in the data gathering that were well above reasonable limits. -10.5C was automatically reported at -10.C in several areas that had routinely in the past reported -12,-13, -14.

  30. Comment on Temperature:

    Temperature provides very limited information about the energy state of any system not in complete thermal equilibrium. A temperature reading is a highly localized measurement of instantaneous kinetic energy. But the very existence of weather proves the Earth we are measuring is not in thermal equilibrium. To approximate the actual energy state of the large volume of atmosphere or water represented by a single thermometer, we would have to know a lot more about the heat capacities and thermal conductivities and thermal gradients present throughout that volume. And to have any hope of accurately approximating with a single pseudo temperature value the energy state of the dynamically changing entire Earth’s surface at any single moment, we should need a much more uniform and dense distribution of thermometers than we have today.

    Comment on Accuracy:

    I would reframe the debate above in the following terms.

    1. Consensus does not necessarily equal truth.

    2. Measurements are analogous to opinions: they each have some degree of truth and some degree of ignorance/error.

    3. Averaging ignorant opinions leads to consensus. Averaging erroneous measurements leads to consensus.

    4. Averaging more ignorant opinions or erroneous measurements firms the consensus but does not force the consensus to converge toward truth.

    5. The a priori assumption that ignorance or error is random and self-cancelling rather than correlated/biased is unscientific, and likely ignorant and erroneous in its own right.

    Kip is essentially right. Rick C PE is more precisely correct. Mind your significant figures.

  31. Late to the discussion and perhaps (most likely I think) my understanding of Stats has declined considerably since my graduate work in that area 40 years ago, but I’m having trouble accepting this statement by Kip as valid.
    “When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another.”

    I may be misunderstanding the concept of ‘independence’ as used.I think that the temperature at 11:01 is dependent to some extent on the temperature at 11:00, at least in the physical world. There has to be some physical limit to how much the temperature can change in one minute. If the measurements are dependent it makes things mathematically very ugly, very quickly.

    • There are two ideas here. Repeatedly measuring a board deals only with measurement error, presumed normally disrributed and very tractable via the law of large numbers. Measuring temperature at a place at different times is NOT ‘the same board’, so errors do NOT wash out. Kip’s point. But you are correct the measurements will be autocorrelated. Many time series in economics are autocorrelated. This introduces a number of complications into the statistics concerning them. For one example. See McKitrick’s paper on the statistical significance of the various pauses in temperature anomaly time series.

      • DC Cowboy and ristvan ==> Ristvan is correct — it is simply the point that one is not measuring one thing many times — which then allows the Law of Large Numbers and other statistical ideas to be applied to the multiple results. One measurement — one result — no large numbers.

      • “Repeatedly measuring a board deals only with measurement error, presumed normally disrributed and very tractable via the law of large numbers. Measuring temperature at a place at different times is NOT ‘the same board’, so errors do NOT wash out.”

        Not to be a jerk to the contrarians, but I’m truly amazed that this is so difficult to grasp! If you follow along the thought process in the real world with real instruments measuring real changing things, it’s pretty simple to see just WHY the errors don’t wash out. Divorce the numbers from context and I supposed it becomes harder to see why it doesn’t work.

    • It’s not the temps that are independent, it is the measurements. You are not measuring the same thing at both times.

      • Jim Gorman ==> I concede — the temperatures are not truly independent at that scale — it is the measurements that are.

    • You are right. The measurements are not independent measurements of the same quantity. They are strongly autocorrelated.

  32. We seem to be discussing the resolution of gauges. There is also the conversion errors, observation error, recording errors, bias and prejudices of recording personnel, to name a few. Together with siting problems, I would think that surface gauge averages indicating a .85 degree of warming to be totally meaningless. Surface observations such as frost-free periods or river ice breakup dates are probably more reliable indicators if records were available for long enough periods. Glacial melt-backs and sea ice extent do not seem to me to be a reliable indicator. After all, the ice in my drink will continue to melt without warming the room as long as the room temperature is above the freezing point. In fact, the melting ice would cool the room a tiny amount.
    https://wattsupwiththat.com/2011/01/22/the-metrology-of-thermometers/

  33. I will quietly say it one more time.
    If the measurements of a single physical quantity are randomly distributed about the ‘true’ value. the error in the mean of the results is the square root of the error of the actual samples.

    This fact is actually used to make digital audio recording sound better. Digital audio sampling is accurate to the last bit, and values near that bit will be consistently too low, or too high. By adding about a bit of random noise, the samples AVERAGE OUT to the correct value with a greater precision than is possible using a single digital sample on its o0wn.

    Simple thought experiment. You want to represent 7.5 using only integers and averages.

    One sample of 8 and one sample of 7 gives you 7.5

    In fact any real (decimal) number can be represented by the average of an (infinite, if needs be) sum of integers. That’s a similar case to the ‘fractions versus decimals’ argument.

    Going back to the audio example, we have a CONSISTENT error. And that means that without the addition of randomness to randomize the error, trying to measure 7.49 for example, nets us 7, all the time, forever.

    And that seems to be the key misunderstanding. Consistent error and normal probability errors. Consistent errors will stay the same no matter how many samples we take. If the thermometer is mis-calibrated and is reading a degree low, no amount of readings will improve the result.

    On the other hand a sample of 1000 thermometers dipped in the same bucket, as long as they have a random error distribution, will.

    The ‘average temperature of the earth’ has meaning because we give it meaning. It probably means ‘to a very close approximation the average over a year of many perfect thermometers readings taken every ten seconds, at 1km intervals over the surface of the earth’.

    The less readings there are and the more imperfect the thermometers the less meaningful that average is.

    Expressed as a rise in error bar size. BUT to deny that that average is more meaningful than a single measurement taken once, is um, Scientific and Mathematical Denialism frankly.

    Statistics is hard. I hated it more than any other maths. It’s regularly abused, BUT its hugely useful if you know what you are doing. Unfortunately most people don’t, and I don’t exclude myself.

    But I do know the very basics, and that’s what I have tried to illustrate here. The difference between consistent bias on all measurements, and random error probability. One can be averaged out, the other cannot be.

    Kip seems to confuse the two

    • Leo: “On the other hand a sample of 1000 thermometers dipped in the same bucket, as long as they have a random error distribution, will.”

      But what we have is 1000 thermometers dipped into 1000 different buckets. So the calibration and construction errors will still average out but are no longer measuring the same thing.

      We have several thousand gridcells with variable numbers of readings done by varying methods.

      how does that affect the stats.?

      The “meaning” we attribute to the mean is not arbitrary , it is being taken as cast-iron indication of the supposed effect of GHE ie it is not just a question of how good is the mean as a mean ( an “expectation value” ) it is implicitly a calorimeter : measure of the total heat energy.

      • Greg:

        “So the calibration and construction errors will still average out but are no longer measuring the same thing. ”

        We have no idea if it is true they will average out. Manufacturers are under no obligation to create instruments that report results that are randomly distributed between the error limits. It is far more likely they tune it from one side and stop tuning with if gets inside the limits them move to the next one.

    • Your evidence that the error distribution for thermometer measurements is random? Not saying it isn’t, just that I’ve seen no proof that it is or is not. That would be an interesting study.

    • Leo ==> You cannot average out the fact that the original measurement record is a range. While you may get as precise an answer as you wish through averaging (long division), you do not eliminate the original range. Do the little experiment described in the Author’s Comment section.

      This is not a matter of statistics — which deals with probabilities. This is a matter of measurement.

      • But what we have is 1000 thermometers dipped into 1000 different buckets. So the calibration and construction errors will still average out but are no longer measuring the same thing.

        Yes, they are measuring something that is the same thing – the signal over that time frame of 1,000 buckets. That’s exactly how your sound system works. Take 1,000 samples of the sound with an 8-bit A/D. Then average them. You get *one* output sample that has 13 bits of resolution. (8 + log2(sqrt(oversampletimes)). Do that at 20Mhz and suddenly your terrible 8-bit A/D is not so bad for 20Khz signals.

        You are trading off time resolution with measurement resolution. This is standard signal processing work. If you want formal proof, it’s done in the first year of EE graduate school in typically the digital signal processing class. Hope you like math…

        Or you could look at some pictures. This datasheet from Atmel explains it fairly well: http://www.atmel.com/Images/doc8003.pdf

        Peter.

  34. To throw something into the discussion, it would appear Nick Stokes and Mark Johnson are reifiing “average temperature” and “average sea level,. presuming that the concepts have a sort of Platonic essence outside the procedure used to derive them. Reification has occurred in psychology with terms like “intelligence quotient”, where practitioners fall down a metaphoric rabbit hole when they forget that the number is the result of tests with some repeatability with a given subject, and some correlation with other tests purportedly measuring the same thing.
    Nick S and MSJ, one is measuring different things multiple times, not the same thing multiple times. So I agree with Kip Hansen, that the average has no more precision than any one of the separate measurements. Plato is treacherous as a guide to reality.

    • +1

      MSJ and Stokes-as-he-ever-was need the Central Limit Theorem to apply to temperatures and sea level data, so they just keep saying it does, no matter whether it makes any sense or not. If you have measured air temperature at one time and one place, you have measured one temperature, one time. You cannot measure it again, as it is not the same next time.

      • Peter Sable,

        time se·ries
        tīm ˈsirēz/
        nounStatistics
        noun: time series; plural noun: time series; modifier noun: time-series
        a series of values of a quantity obtained at successive times, often with equal intervals between them.

        Thermometers and tide gauges do not measure “a quantity.” They measure different things each reading. Flail away at it all you like, no CLT…

    • +10 you have realized one of the two real problem the Stokes group haven’t addressed. No-one asked them for their calibration because they are dealing with things that are proxies which is what you are trying to describe.

      I don’t agree with your conclusion however, because I simply don’t know or see the answer to how much has been done. I am fence sitting and I am interested to see what the scientists did with this problem because it’s technical it’s not something you will find in the general media.

      Even out in the real world It is solvable, you see Ligo has to do this sort of thing. I am genuinely interested to see how climate science has dealt with it.

      • The Central Limit Theorem as applied to time series signals is simply trading time-resolution for measurement resolution. If you have 365 days worth of temperature, you know the year’s average temperature (one number) to a pretty good degree of measurement accuracy – better than the individual measurements. You know nothing extra about what the temperature was on Feb 2nd.

        If you don’t believe this, then you must sell any digital sound or video equipment you have and go find some vintage analog gear, because this principle is how all modern digital A/V gear works…

      • to a pretty good degree of measurement accuracy

        Ooops, editor escape. Should have said precision.

        Accuracy is a different problem, though I’ll argue you get the same central limit theorem effect by independently calibrating 1000 thermometers. Unless one can prove a bias effect in such calibration that varies over time… (but that’d be a different topic)

  35. Kip, You correctly point out that the probability is flat across each original data point interval and that when you average a number of points that possible outcomes span the same width interval. What you left out is that the probability is no longer flat, as the extreme cases only occur for one possible combination of values whereas the middle can be found from many combinations.

    • MikeP ==> Probabilities is the subject of statistics. Measurement is measurement — the extreme values are as equally possible and probably as any other value in the range — and only appear to be extreme because the integer that has been selected as the record is already the mean of the range, before any measurements have taken place.

      • Kip,
        Your tide example involved sigma’s, which you wrongly interpreted as ranges. They are probability moments. In science, you can’t get away from them. You normally don’t have ranges at all, and they are not what proper scientists mean when they talk about error. They mean standard deviation, or standard error.

      • sorry, but when you average, the only way to get an extreme result is if every measurement happens to be at the same extreme end. You can get the middle result a multitude of ways. The more data points averaged, the closer the average is to Gaussian, even though every single point being averaged represents an interval.

      • Nick Stokes ==>` Yours:

        Your tide example involved sigma’s, which you wrongly interpreted as ranges. They are probability moments.

        .

        I admire your knowledge and usually your opinion, but in this case, you are in error.

        The tide example does not deal with sigmas at all (except that NOAA CO-OPS lists a sigma for each of the 181 1-second measurement sets. )

        The range of the tide gauge data stems from the specification and design of the tide gauge instrument itself, and its accuracy is specified by NOAA in its official documents. In order to ensure that they were not using sigmas or SDs or CIs, I specifically queried this point before writing here….I will repeat my question and their answer;

        The six-minute figure is calculated as follows:

        “181 one-second water level samples centered on each tenth of an hour are averaged, a three standard deviation outlier rejection test applied, the mean and standard deviation are recalculated and reported along with the number of outliers. (3 minute water level average)” (from NOAA’s spec sheet)

        Just to be sure we would understand this procedure, I emailed CO-OPS support [ @ co-ops.userservices@noaa.gov ]:

        To clarify what they mean by accuracy, I asked:

        “When we say spec’d to the accuracy of +/- 2 cm we specifically mean that each measurement is believed to match the actual instantaneous water level outside the stilling well to be within that +/- 2 cm range.”

        And received the answer:

        “That is correct, the accuracy of each 6-minute data value is +/- 0.02m (2cm) of the water level value at that time. “

        [ Note: In a separate email, it was clarified that “Sigma is the standard deviation, essential the statistical variance, between these (181 1-second) samples.” ]

        The +/- 2cm is NOT a sigma, nor is it a Standard Deviation, a Confidence Interval, or “error bars”. It is the accuracy of the original measurement, in the first instance, and attached to the six-minute mean as the accuracy of those means. They say accuracy, and they mean accuracy.

      • Kip,
        “They say accuracy, and they mean accuracy.”
        Well, they say estimated accuracy. But there is nothing in what they say (as opposed to what you put to them) that is inconsistent with the normal understanding that if a number is expressed a±b, then b is a sigma – a normal deviation within which 2/3 of occurrences would be found. From your email chain, I agree that their explicit remark about sigma was about the 6 minute numbers, not the 2cm estimate. Sorry for getting that wrong.

        The fact that they do reduce 2cm to 5mm for the monthly average is consistent with that a±b interpretation. You insist that they mean intervals and so get it wrong. I think the evidence for that is weak; much more likely is that they use the standard interpretation as a sigma, and it reduces as the number of readings averaged grows.

      • Nick ==> In actual fact, they arrive at the monthly reading through simple “add and divide” finding of the mean using all the six-minute records for the month (I confirmed this by test). What they do as well is ignore the original measurement uncertainty altogether, and treat each six-minute mean as if it were a discreet, exactly accurate measurement and decide upon SD through standard statistical theory.

        Of course, my view is that this is incorrect — some of the professors and engineers reading here agree with me — some others with you.

        I am positive that your view represents the standard statistical approach — I just believe it is inapplicable to these types of situations, creating inappropriately “precise” means and changes in them that are not physically supportable.

  36. I am considering coding up a computer model to exercise the various concepts of temperature measurements as discussed in this article. With an idealized model, probably consisting of various continuum math equations so a known exact answer is known a priori, then testing various simulated taking the temperature of scenarios, I hope to definitively confirm if Kip is right or if his detractors have a case.
    I will have to sketch out on paper for a while how to proceed. How to simulate someone measuring a temperature within my model. What various forms of varying temperature models to use and how statistics might be used.

    • “I hope to definitively confirm if Kip is right or if his detractors have a case.”

      Bad objective, likely to lead to confirmation bias or non reporting of a negative result. This is exactly what is wrong with climatology.

      I would suggest you set out to improve your understanding of how errors accumulate. and combine. If you think you will find +/-0.5 degrees I can tell you now you won’t, but I would say do your test, fully document your model and preferably make it available for others to play with. There may be improvements to be made to how you simulate uncertainties and measure errors. I do think you will find current estimations of uncertainty are optimistic, so the effort could be worthwhile.

      • Sorry, I misread what I quoted. Hoping to have a positive indication one way or the other is fine. Like I said it is well worth looking into.

  37. Nick Stokes is talking about purely random error. Systematic error does not reduce by the number of readings. There is also the problem of averaging readings from different areas, each with its own error.

    • Also, if you have two random errors which are orthogonal ( independent causes ) you still have two errors whose uncertainties need to be combined.

      • OR you could have a number of localized results distorting the average. What they have not talked about is any sort of main effect testing which has stunned me given they are statistics peeps. They have said absolutely nothing about the behaviour of the sample space.

    • “Nick Stokes is talking about purely random error.”
      So is Kip. There is no systematic error in either of his examples.

      The thing is, it isn’t easy to be systematic. In any large dataset, the errors are apt to be uncoordinated. It would be hard to coordinate. If there is systematic error, that is bias. And yes, it doesn’t reduce under averaging, which is why people make strenuous efforts to identify and remove it.

      • NS,
        The stated range of +/- 2cm implies that each and every site has its unique systematic bias. The manufacturers are warranting that none in the array exceed that value. That is the only reasonable explanation for tide gauges that have a precision of 1 mm, but an accuracy of 20 mm.

      • Nick and Clyde ==> Actually, I am not really talking about BIAS or ERROR. The mechanical device, the tide gauge itself, the stilling well design and real world function, its actual performance, means that even though the acoustic sensor inside the stilling well can discern the height of water inside the stilling well to 1 mm precision and accuracy, the overall design can only return an instantaneous accuracy of the water level outside the stilling well to +/- 2 cm. NOAA tries to limit the inaccuracies by finding the MEAN of 181 1-second measurements, throwing away 3-sigma outliers, and recalculating the Mean and the Sigma of that 181 (minus outliers) data set. They properly assign an ACCURACY of the six-minute recorded values at the same +/- 2 cm.

        My guess is that certain water conditions bias high, others bias low, currents one way bias high in one direction and low in another, or some other design consideration that limits the overall ACCURACY of the instrument to +/- 2 cm. I would expect that the manufacturers of this equipment must prove through actual field testing, that their tide gauge meets the specification of returning an actual-in-use accuracy of +/- 2 cm (after being allowed to discard all 3-sigma 1-second measurements).

      • Kip,
        I would agree that one could have a bias that is dependent on the tide. Someone else has pointed out that because of time delay, the water in the stilling well probably lags what the water outside is doing.

      • Clyde ==> We could make one of these stilling wells with transparent tubing and put it in our bathtub or swimming pool, get some kids to splash about, and we would see just what NOAA illustrates in the image used in the essay. Mechanically, the water inside must lag the water outside. So, yes, that’s my guess as part of the reason NOAA specs the whole system at +/- 2 cm for instantaneous measurements, and not the resolution of the acoustic sensor — the whole sensor system is in fact 40 times less accurate than the sensor itself.

    • We term these ‘systematic errors’ and ‘experimental errors’. One expects that experimental errors are normally distributed unless there is good cause to say otherwise. Systematic errors are irreducible because they are built into every measurement.

      If a thermocouple is mis-calibrated by 2 degrees, all the readings are out by two degrees and are not made ‘more correct’ by take many more readings. Similarly a thermocouple that is ‘within spec’ is not necessarily giving randomly varying readings around a perfect result, it is giving results within the specific limits.

      Expecting sea level readings to be normally distributed around the true level is like expecting waves on the ocean to be sine waves. Given the 20mm range in the original, that 5mm long term range is unsupportable.

  38. Global temps are continually “corrected” yet the uncertainty is always the same. This means that the method of assessing the uncertainty is not accounting for all the errors. The “adjustments” which were deemed necessary were not in the original error model !

  39. This discussion is fun. In Physical Chemistry 101 we had a lab, naturally. One of the experiments was to measure the acceleration of gravity with paper tape apparatus. There were 200+ students split into teams of 3 or 4. The focus was on experimental error. Measuring the distance between dots on the tape with a steel rule graduated in 2mm ticks, a stop watch showing full seconds and tenths, We had to measure the accuracy of each measurement by multiple measurements of two marks and measuring the time. The overall results for the class were quite good, something like 9.65-9.9 m/sec^2 with a standard deviation of The range between teams was not very good, something like 8.9-10.3m/sec2. The reports had to include a total estimated error- the final range of possible error summed according to the equation. With all the individual acceleration measurments only about half fell within +/- 1 std deviation(fat tails), but every measurement but one fell withing the total error estimates.

    The Prof specifically said when reviewing the results that the point was to not be to focused on the result but to have a realistic estimate of how far wrong an experiment could be. The other point made was that a few teams had results within.1m/sec^2, while others made as many tries and got a spread of .5m/sec^2- the difference between accuracy and precision.

    More or less to the point Kip was making. Measurements are not the same as what you are measuring and how measurements are made and what they actually represent. Measuring m/sec^2 is pretty trivial. Trying to estimate something as insubstantial the the global average temperature trend is getting to the point of meaningless since we can’t even begin to know how all the variables involved affect the result, and the GAT, which is an extensive measurement has a very convoluted and undefinded relationship to what the climate actually does.

    • Suppose this experiment with the same many measuring teams was repeated in a place where gravity was .2 m/sec^2 stronger or weaker, or the local gravity changed by .2 m/sec^2 due to something that is a matter of fiction. What is the expectation of this hypothetical change of gravity being detected, and with how high % of confidence? How much would the gravity in your lab have to change for the sum of your student teams to have detected a change with 95% confidence that the change occurred within +/- 99.9%, +200/-99.9%, or +/- 50% of the gravity change indicated by your teams? With consideration that your low accuracy high precision teams are probably in good shape to detect a change in what they are measuring inaccurately with high precision? (This reminds me of a story by someone young trying out shooting a gun at a gun range, misinterpreting how to use the old tech sight, and shooting a tight small cluster whose size was smaller than its center’s distance from the center of the target.)

  40. Kip, you have a gift to make complicated things simple. I dare say your classic stat
    critic has stepped back. It’s not that he or she wouldn’t understand, of course, but rather in a negligent reading of your point, saw only a violation of a well established principle that actually wasn’t at issue in the case you were describing.

    The main criticism I would have with the unhappy statistician is, having read your piece negligently, this person then resorts to an insult re your competence in science. I’m not sure I would have confidence in a statistician who has a habit of not reading carefully before arriving at such an outrageous conclusion, unless you have since received a sincere apology.

  41. The argument is about measurements which vary in time and value such as temperature versus measurements which vary in value only such as hole size.
    One measurement has two dimensions, time and temperature. The other measurement has one dimension, size.
    When measuring temperature it’s impossible to increase accuracy by taking multiple measurements since the temperature varies in time. The only way to increase accuracy is by taking multiple measurements using multiple thermometers at exactly the same time and averaging them. This eliminates the time dimension and makes it possible to use single dimension tools like averaging and standard deviation to estimate the error.
    When a item is measured in a metrology lab, extreme measures are taken to keep things like the temperature of the measuring instrument and the item being measured at the same temperature. Air drafts are blocked and everything is handled as little as possible so that there are no variations of dimension over time. With the time dimension eliminated, it is now possible to make repeated single dimension measurements at different intervals using a single measuring instrument.
    The only way that single thermometer measurements can be averaged is when the temperature is known to be stable over the measuring interval such as when a thermometer is calibrated using a known stable temperature such as the triple point of ice. Repeated measurements can be made and single dimension average and standard deviation can be calculated.
    Which brings up another point. Assume that you are calibrating a digital thermometer which reads to 1 degree and it’s calibrated using a precision temperature source accurate to 1/100th degree. The readings are 100, 100, 100, 100, 100, 100, etc. What is the accuracy of the thermometer?
    You have a measuring stick which has 10 foot units and use it measure the average population height. The readings are 0, 0, 0, 0, 0, etc. What is the average population height and standard deviation?

    • AZeeman,
      You asked, “What is the average population height and standard deviation?” The obvious conclusion is that there is a minimum measurement interval that is required to improve measurements. That minimum is what will produce a different value each time. The implication is also that even with that minimum fiducial increment, there is a limit to how much the accuracy or precision can be improved. One cannot take an infinite number of samples or measurements and get infinite accuracy or precision!

  42. Comment on Temperature:

    Temperature provides very limited information about the energy state of any system not in complete thermal equilibrium. A temperature reading is a highly localized measurement of instantaneous kinetic energy. But the very existence of weather proves the Earth we are measuring is not in thermal equilibrium. To approximate the actual energy state of the large volume of atmosphere or water represented by a single thermometer, we would have to know a lot more about the heat capacities and thermal conductivities and thermal gradients present throughout that volume. And to have any hope of accurately approximating with a single pseudo temperature value the energy state of the dynamically changing entire Earth’s surface at any single moment, we should need a much more uniform and dense distribution of thermometers than we have today.

    Comment on Accuracy:

    I would reframe the debate above in the following terms.

    1. Consensus does not necessarily equal truth.

    2. Measurements are analogous to opinions: they each have some degree of truth and some degree of ignorance/error.

    3. Averaging ignorant opinions leads to consensus. Averaging erroneous measurements leads to consensus.

    4. Averaging more ignorant opinions or erroneous measurements firms the consensus but does not force the consensus to converge toward truth.

    5. The a priori assumption that ignorance or error is random and self-cancelling rather than correlated/biased is unscientific, and likely ignorant and erroneous in its own right.

    Kip is essentially right. Rick C PE is more precisely correct. Mind your significant figures.

  43. Can you measure “global temps” by measuring averaged air temperatures or even combined air and sea temperatures? (Notice the use of the word “combined” rather than “averaged.”) It seems all global temps are good for is for observing trends. The true temperature of the earths surface is much lower than what air temperatures indicate due to the extreme cold of the ocean abyss. Trenberth has acknowledged this when he claims that deficiencies in models are due to heat hiding in the oceans.

  44. Readers ==> I am not ignoring your comments — but I have other obligations Sunday mornings (Eastern Time). I will be with you in a couple hours and try to address your concerns.

    I have a few minutes now, and will start at the top, working my way down. Thanks for your patience. — kh

    • Kip, here’s a thought experiment that might help. Say I’m trying to get the average height of a human male, and I measure 1000 randomly selected males, with an accuracy of 1 inch. Averaging all those measurements gives me a ‘suspected’ average height. Now I measure 1000 males a second time, but I can’t guarantee the same set of 1000 is in my second sample average. I keep repeating this 100 times. If I average those 100 measures, can I say the result is any more accurate? Intuitively, I don’t think so, since each time I’m not measuring the same thing. On the other hand, if I measured the SAME 1000 males 100 times, I could potentially feel better about the result, actual measurement errors tending to cancel out.

      • Taylor

        I would say that your accuracy remains the same, and the experimental error is reduced. What some above refuse to accept is that these two things are additive. Vastly reducing the experimental error doesn’t nothing to reduce the uncertainty about the measurement which is an inherent property of the apparatus.

      • Taylor ==> Thanks for joining in — my view would be that you can get better and better, more precise averages, but would need to note that your more precise average was only accurate to +/- 1 inch.

  45. Is the ‘global surface temperature’ construct valid or meaningful at all?

    A crude example:

    Air temperature in Canada increases this year by 1 degree, air temperature in Australia decreases by 1 degree, global average temperature remains unchanged.

    Canada is predominantly forest and plains and cool, Australia is predominantly desert and hot.

    My crude example exposes the myth that a single figure for a worldwide temperature can represent something useful.

    Increasing the number of temperature sensors and adding increasingly devious ways to merge temperature together does not make the single figure more useful.

    Trying to say anomalies gets round the problem, is a bit like saying ‘I know the answer is wrong, but I get the same result no matter what method I use, therefore it doesn’t matter how we do the sums…’.

    I’m with Kip.

    • Steve Richards,
      I have previously advocated monitoring climate zones for changes. They can be aggregated subsequently with weighted averages, but we could see more readily if all areas are changing, and if they are, which are changing most. It would certainly give us a better understanding of what is happening than using a single number!

    • The global temperature can be measured with great precision. The Central Limit Theorem ensures this. Kip’s above analysis is completely flawed in this regard, which is too bad because he does other good work.

      Global temperature cannot be measured with great accuracy over spans of centuries. Far too many biases, known an unknown. I believe the satellite data, that’s about it, and given the recent changes, I keep increasing the error bars in my head… if y’all would stop adjusting things, your error bars would be far more believable.

      As to whether it matters whether the global temperature changes by 1.0degC? Not at all noticeable. If it were to change by 10degC we would likely all notice. I note this scale of ‘notice-ability” is non-linear. Which is why the small changes we see are useless metrics to look at, and are being extrapolated to the tune of billions of dollars of mis-spent money.

      • Peter ==> “The global temperature can be measured with great precision. ” That part is true, then add back on the original measurement uncertainty to get a valid picture of the accuracy of that global mean.

        All this signifies is that while they might be able to detect changing precise means in the hundredths of a degree, the accuracy being +/-0.5°F, we can’t be sure that the temperature has actually changed until the change at least exceeds the uncertainty in measurement.

        This is why Mosher is right, we know it is warmer now than in the depths of the LIA, maybe warmer now than in 1960, but not sure is it is warmer now than in 1998.

  46. Reductio ad absurdum.

    Let us assume that taking multiple low accuracy measurements and applying some mathematical treatment to the results CAN derive a figure more accurate than the individual low accuracy measurements. A figure closer to the actual reality. If this is so then the increase in the accuracy must bear some relationship to the number of measurements taken for a given mathematical treatment.

    We assume the mathematical treatment is fixed. We simply increase the number of measurements. If the accuracy increases does it increase forever, in other words if we apply a massive , say trillion size, number of readings will we achieve , for example for temperature, a resolution of 0.0000001 degree? If this were to be true then we would not have to waste time and energy in designing more accurate instruments. We could simply continue to use the low accuracy ones and take more readings.

    Want to know the length of a piece of steel to 1/10thou of inch but only have a school boy’s ruler? Simple, just take 10,000,000 readings and use the special app on your iPhone. Nonsense? Yes, I think we will agree on that example. So therefore the conclusion is, assuming we CAN derive greater accuracy via increased number of readings, that there MUST be a limit. If there is a limit then there is a certain type of mathematical relationship between the increase in the accuracy and the number of readings.

    This relationship must be a graph, showing increase in accuracy v number of readings. The graph must be something fundamental, not dependent on the thing being measured or physical units, it MUST be something mathematical. So for a given mathematical/statistical treatment of any measurement it would be possible to derive an exact number for the number of measurements necessary to increase the accuracy by, say, double. Let this number be N.

    So we take Y measurements with an accuracy of , say, =+/- 10% i.e. the reading taken may differ by up to 10% from the REALITY.
    By then taking (N x Y) readings we can, via mathematical treatment of them, derive a figure which is better, only +/- 5% variation from reality.

    However my colleague who had more time to waste actually did take (N x Y) Original readings with the 10% accuracy instrument . So all he has to do is take N^2 x Y readings to double his accuracy via mathematics.

    However if we now plot these 2 examples on our graph we have a straight line, indicating that you can increase the accuracy ad finitum, which we have already agreed is ridiculous.

    Reductio ad Absurdum./ QED

    • Reverend,
      I don’t have a mathematical proof or citation to provide. However, I suspect that the practical improvement of precision is one, or at most two, orders of magnitude because of the requirement to have measuring increments that will result in getting different values each time a measurement is made. That is to say, if one has a measurement of a fixed value that has one uncertain figure beyond the last significant figure, that uncertainty can be resolved by multiple measurements. That suggests that 100 measurements is the practical limit for improving precision. On the other hand, for a very large population that is sampled to estimate the mean of a variable with a large range (e.g. temperature), where the measurements define a probability distribution of the value of the mean, probably the estimate of the mean can be improved with more than 100 measurements. However, in a practical sense, it seems to me that the standard deviation is more informative of the behavior of the variable than an estimate of the mean with improved accuracy. I don’t believe that one is justified in assigning more precision to the estimate of the mean of the variable than the precision of the original measuring instrument. I do address this issue with the Empirical Rule in one of my previous posts.

  47. “When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another.”

    Independent? This is time-series data with the expectation of serial correlation over time. If temperature is measured in various places, there will be serial correlation in space, including all three dimensions.

    Temperature measurements are serially correlated in four dimensions implying dependence in four dimensions.

  48. Speaking about unrealized uncertainty,

    Consider that the consensus sensitivity of 0.8C +/- 0.4C per W/m^2 is expressed with +/-50% uncertainty and this doesn’t even include the additional 50% uncertainty added by the RCP scenarios.. We know that intelligent life exists elsewhere in the Universe with far more certainty than this (I would put the probability at well over 99%). Which of these two is settled?

  49. Kip,

    I am scientist and therefore skeptic about climate science and many other things. When there is an easy way to test something I like to try to do it myself. And in this case there is any easy empirical test. We do not have to believe the statisticians theories or assumptions. (Mainly the assumptions are the problem, but not in this case.)

    So I was curious about what you are saying and decided to build simulation, using the program Mathematica, which enables one to do very large simulations quickly. I wrote the following code (see below), which simulates taking temperature measurements every 1/10 of an hour, for 365 days.

    I first generate a simulated “actual” temperature pattern varying sinusoidally for each day, and over one year to simulate the seasons. Then I add some “random walk” noise to that, that can easily add more than a few degrees to deterministic “actual temps” (simulating weather variations over the day and year). I though this might be important to capture any problems with integer rounding of each measurement done by meteorologist.

    I then find the Tmax and Tmin for each day from the integer “measurement” data (i.e., the “actual temp” rounded to nearest integer) ,and average those to get the “measured daily mean” temperature. The program then compares that to the mean that is measure by using the “raw” (not rounded) data, where the actual daily mean uses all 240 measurements made during each day.

    To my surprise the actual and measured daily mean are very close to each (usually within 2 or 3 decimal places, sometimes 4 decimal places) when averaged over the whole year (i.e., the yearly mean, found by taking the mean of all 365 daily means.)
    This simulation takes 10 seconds on my MAC, so I ran a dozens times for more.

    However, of course, the daily measured mean is off up to +/- 1 degree, and on average is off by +/- 0.5.

    I do remain skeptical of measured temperatures, and global averages, due to measurement environments warming (Anthony’s work shows it) and temperature adjustment biases of some of the climate scientists.

    Here is the code, it is easy to follow; “rhv” is name I gave to “random walk” variation added, where each 0.1 hour the temp can increase or decrease randomly by up to .1 degree. Since it is cumulative it can accumulate variations of several degrees easily and is not bounded how far it an vary from the sum of the two sinusoidal variations which have excursions of +/- 12 degrees (daily) and +/- 18 degrees (year), together allowing deterministic variations of +/- 30 degrees. 0 degrees is assumed actual mean over all time in the simulation. The final comparison in calculated to 8 decimal points. Mathematica does arbitrarily large precision.

    dailyt = {}; actualdailymean = {}; dailyhigh = {}; dailylow = {}; \
    measdailymean = {}; rhv = 0;
    Do[
    rhv = rhv + RandomReal[{-.1, .1}];
    temp = 12*Sin[2*Pi *t/10] + rhv + 18* Sin[2 *Pi*t/3650];
    AppendTo[dailyt, temp];
    If[Mod[t, 240] == 0,
    AppendTo[actualdailymean, Mean[dailyt]];
    rounddailyt = Round[dailyt] ;
    tmax = Max[rounddailyt];
    tmin = Min[rounddailyt];
    AppendTo[dailyhigh , tmax];
    AppendTo[dailylow , tmin];
    AppendTo[measdailymean , (tmax + tmin)/2];
    dailyt = {} ],
    {t, 1, 10*24*365}];
    N[{ Mean[measdailymean], Mean[actualdailymean]}, 8]

    • I averaged the AC voltage coming out of my electrical outlet and it was zero. Since the voltage is zero that means it’s safe to touch. Right?
      I have a square hole and a round hole and make multiple measurements of diameter of each hole and average them out until I get results precise to 10 decimal places. Both diameters measure exactly the same therefore the holes are identical.
      Averaging throws away information and reduces dimensionality. AC voltages have dimensions of frequency and voltage. Averaging throws away the frequency information leaving only the DC voltage information which is useless.
      Averaging works very well for measuring DC since it is known that there are no frequency components which contain information and all frequencies can be filtered out. There is only one dimension to measure:voltage.
      With AC there are at least three dimensions, voltage, phase and frequency. This is why all the “over unity” generators use AC measurements. By reducing three dimensions to one, the loss of information hides the fact the no energy is actually created.
      When making AC or time varying measurements, information on the frequency must be known in advance so that all other frequencies can be filtered out from the measurement. What precise frequencies need to be filtered out to make daily temperature measurements? Averaging doesn’t work since it only removes higher frequencies, it doesn’t remove lower frequencies which will affect measurement precision. Errors caused by low frequency noise is indistinguishable from measurement error unless the precise frequency values are known and filtered out.
      Climate “science” is like trying to find the pea under the cup. The trick is not trying to follow the cups, but realizing that sleight of hand is used to hide the pea so that it isn’t under any of the cups. Extremely high dimensional computer models are collapsed down to two dimensions, time and temperature, throwing away massive amounts of information in an attempt to fool the common herd.
      What’s important is not what is shown, but what is hidden.

      • Yes, There is obviously a huge loss of information when averaging. No one debates that. But the average (i.e., the mean) is mostly unaffected, even by coarse measurement and a simplified method of finding the mean (e.g., average Tmax and Tmin for the day), if the coarse measurement is sufficiently fine (i.e., 10x to 30x smaller that the true variation in the quantity being measured) and if the coarse measurement is done consistently overtime, with all the daily course measurements of (Tmax+Tmin)/2 averaged over all of the days of a year.

        Not a obvious results, but not all that surprising either.

        That result would clearly NOT be true is there a consistent asymmetry in daily variation of temperature (i.e., more time spent near Tmin vs. that spent near Tmax. In fact that may occur in actual temps is some (or many) locations on the earth (e.g., where the ocean keeps temps near the Tmin for 10 to 20 hours/day, but peak solar heat (1 pm to 2 pm)? with only brief periods of no clouds in the afternoon determine the Tmax.

        In that case taking (Tmin + Tmax)/2 to be the daily mean temp is a BIG MISTAKE, which should be obvious to all meteorologists, and grade school children as well.

    • Thanks TA. Someone above suggested he may do a similar thing but it looks like you’ve knocked it out in no time.

      For those who cannot see the logic or refuse to see it, that pretty much knocks it on the head. Nice work.

    • You have a sinusoidal process with random noise. Now try that with an added constant, linear slope, say +.01 deg per day, and see what happens.

    • On initial review, I believe your method is a tautology.

      You have assumed the result.

      There is no “actual temperature” series – only temperature intervals (as represented by the “actual temperature” series). Your process reproduces (approx.) the original series – as it must with your underlying model assumptions.

    • Thin Air,
      One thing that comes to mind is whether the random walk of 0.1 deg is sufficient. When a cold front moves through, tens of degrees change may be observed in a matter of minutes. Similarly, when a dense cloud moves overhead, degrees of change can be expected rather than 0.1 deg. Gusts of wind over a water surface are more likely to provide cooling in the range of a few tenths of a degree. Certainly, when a heat wave descends on an area, changes of much more than 0.1 degree can be expected. I’m curious what would happen if you increased the random walk increment, or had two different random walks of different magnitude and different periods, with the longer period RW having the larger magnitude.

    • Nicely done. Haven’t seen that in that few lines of code.

      The arguers here cannot seem to separate the idea that precision (or accuracy due to lack of precision) is a completely separate argument about whether the mean temperature over a year is physically meaningful (or meaningful across geographies).

      It’s completely valid (and correct) to argue that precision is not an issue, but the physical meaning of the average temperature of a year is a legitimate issue. It’s also very bad to argue as Kip as done about precision and its influence on accuracy (i.e. incorrectly), it just taints the rest of the valid arguments.

      BTW the simulation needs to go a bit further. Though I don’t think anyone has addressed it, the calibration source and precision’s influence on accuracy needs to also be taken into account, because Kip is correct if we are talking about a single thermometer from a single calibration source. Again the CLT applies but only if there are many thermometers calibrated by different operators. I’d be curious if anyone has studied long term changes in calibration methods.

  50. Superb !!

    This piece should be required reading for every voter in the U.S., the E.U. and every other country.

  51. If you have one perfectly calibrated thermometer whose output is integers reading 72 degrees F, the distribution of possible actual temperature is flat, with all temperatures between 71.5 and 72.5 being equally probable. If you have two such thermometers in two different places both reading 72 degrees F, then the distribution of possible actual average temperatures of both places, although it ranges from 71.5 to 72.5 degrees F, is not flat, but with probability of the actual average temperature of both places being zero for 71.5 or 72.5 F, and half as great for 71.675 or 72.375 F as for 72 F. Increasing the number of thermometers does not reduce the maximum possible difference between indicated and actual average temperatures of the region they measure, but the probability of any given actual temperature gets concentrated towards the indicated temperature.

    • Oh my , another assertion fan. If the actual temp is 71.5 and you have one perfect thermometer telling you its 72 , you believe it could be 71.5. But if you have two telling you the same thing you now state that there is zero probability that the real temp is 71.5 even though we know it is.

      Seems like it really is the twilight of the age of reason. :(

      • Nick: I found it necessary to click on your highlighted “here” to see what you posted.

        Also, my original statement is incorrect in minor ways. I meant to say that with two thermometers rounding to the nearest degree F but otherwise perfect and in two different places and agreeing, that the actual average temperature of the two places had 50% probability of being 3/8 degree off, but I typo’ed one of my figures by .05 degree F off. The second way was my failure to see correctly the probability of various ways for two rolled dice to add up to various numbers – I assumed a parabolic arch, and now that I think about this more I see this as with upsloping and downsloping straight lines. (With a flat region in the center if the two dice have an even and finite number of sides, getting smaller as the number of sides that the dice have increases.)

  52. In order to give their published results the appearance (illusion?) of greater credibility, a statistical trick is often employed by scientists. The ‘trick’ used by many scientists involves . Many of them misuse it in ignorance of its purpose and meaning. When replicate measurements are made on a quantity it is trivial to calculate the mean and standard deviation. Often this provides a rough idea of the magnitude, but the SD may produce what look like unsavory ‘error bars’. To circumvent this embarrassment, the scientists may focus on the “Probable Error of the Mean”, a quantity that is smaller than the SD by a factor of the square root of (N-1), where N is the number of measurements. If 100 measurements were made of the quantity, then the PEM will be nearly an order of magnitude smaller than the SD, a much more ‘confident’ result.
    The problem is that this is ONLY valid when the replicate measurements are made on exactly the same sample, with the same techniques, and within a brief time span during which the sample cannot be expected to change.
    I once published my direct measurements by mass spectrometry of the helium-3 content of atmospheric air in a peer-reviewed journal. Not long after publication I had another researcher contact me and practically BEG me to report the PEM rather than the SD. I would not do this as the measurements were taken on a large number of discrete samples. He was somehow disappointed.

  53. Nice post Kip.
    Naturally assuming precision greater than the noise is a pathway to error.
    Leads to proclaiming a signal smaller than that noise to be significant.
    Apparently the next statistical skill is to pronounce the average of this imaginary signal to significant digits well beyond physical meaning.
    If you take these statisticians arguments seriously, then no measurements would lead to absolute accuracy.
    The willful blindness of some of the critics is astounding.
    But that is Climate Science, an ideology wearing the cloak of science.

    WUWT’s past post on methodology of measurement needs reposted.
    I think something was lost from general reasoning when we transitioned from slide rules to calculators.

    Some of the arguments against your simple statement are amazing, brings to mind Douglas Adams and “Six Impossible Things Before Breakfast.”

  54. Measurement instruments must be accurate.

    They must be checked at least every year to verify accuracy.

    There must be enough measurements, well distributed and sited.

    The people collecting and compiling the data have to be trustworthy.

    The people compiling the global average have to be trustworthy.

    A global average must be a useful statistic that represents the climate.

    There are no real time temperature data for 99.999% of earth’s history.

    There is little data for the Southern Hemisphere before 1940.

    No one knows what average temperature is “normal”, or if there is a “normal”.

    No one can identify any problems specifically caused by the average temperature rising +1 degree C. since 1880 … +/- 1 degree C., in my opinion

    No one knows if the 1 degree C. range of average temperature since 1880 is unusually large, or unusually small.

    The data infilling and “adjustments” are done by people who expect to see global warming and have predicted global warming in the past — are they trustworthy?

    I consider myself to be a logical person — after 20 years of reading about climate change, I believe the +1 degree rise of the average temperature, mainly at night, and adding CO2 to the air, ARE BOTH GOOD NEWS.

    So, long before we get to debating math and statistics, why not debate whether CO2 controls the climate (no evidence of that, IMHO), and whether adding CO2 to the air is beneficial because it greens the earth (I think so).

    Climate blog for non-scientists:
    http://www.elOnionBloggle.Blogspot.com

    • Very good Richard, and I 100% agree that BEFORE anything else it would be, to say the very least, useful to answer to CO2 question. This would however rather reduce this blog to a one horse race and Anthony won’t be going down that route. It does rear it’s head from time to time but that particular elephant only pop’s its head round the door or sits in the corner of the room farting. We all know he is there and some of us comment on the smell. Usually ctm manages to lead it out.

      So if you want to start a debate about “Why it’s not Carbon Dioxide” you will have to do it elsewhere. You know all the places (or should do) and I would be happy to join you for my $0.02 worth as a few others might BUT it’s generally a mighty quiet corner with very few active participants compared to WUWT.

    • RG,
      You offered the advice, “They must be checked at least every year to verify accuracy.” That depends on the application and the consequences of being out of calibration. If you have a device that goes out of calibration the first month after installation, and it isn’t checked again until after a year’s production, you might have to throw everything away and declare bankruptcy. Or if you are monitoring toxins that accumulate in the body, a year might be far too long. That advice should be qualified by what it is that is being measured or monitored. A year might even be too long for use in an airport.

  55. Thank you for the very nice detailed analysis of measurements. I like it. I confess I did not delve too deeply into the weeds of the article, but I believe I got the gist of it.

    I would make a similar assertion with respect to long-term temperature estimates. For example, consider the oft-cited spaghetti chart of time-temperature series of over 100 GCM runs with some kind of an “average line” snaking through them. The chart does not indicate a best estimate with a range of uncertainty. The chart, at best, might define an infinite series of rectangular probability distributions along the time axis with ranges equal to the extreme high and low temperature estimates for all times. All temperature values within the range for any time, t, have the same probability of occurring. The analysis represents one bar of an incomplete probability bar graph. The bar’s fraction of the total probability of 1.0 is unknown.

    This is a quick reaction. I may have gone into the abyss on this one. Any comments?

    • I’ll pull you up to the edge of the abyss, the rest is up to you.

      My wife and all the other members of the coven have had a bake-in for charity. They have made a veritable ensemble of cakes of numerous varieties and flavours. I want a HOT sausage roll.

  56. Kip,

    Fine article. You address a crucial point in the climate debate: How accurate are the measurements and, therefore, how accurate can predictions (guesses) be?

    You hit on a question I’ve puzzled about often:. “When temperature is measured at 11:00 and at 11:01, one is measuring two different quantities; the measurements are independent of one another. ”

    Are the observations of a time series like the temperature of a particular weather station independent? On the face of it, I’d say no. They’re highly correlated. If the temperature at 10:00 AM is 20 degrees, it’s unlikely to be -40 at 10:01 and even less likely to be -40 at 10:00:01 and less so at 10:00:00.01 etc.

    How should this autocorrelation be properly handled?

    Pat

    • Pat ==> If one is doing something reasonable, like deciding if it is too early in the year to plant potatoes, one needn’t worry about it. The same is true if your are deciding whether to take a sweater on your hour on walk at 5 in the evening — we know that the temperature “moves”– changes — from one temperature to another, moving through all the intervening infinitesimal steps. Temperatures are certainly auto-correlated on the basis of a day and seasons.

  57. The bottom line in this article, and others that attempt to explain measurement especially of temperature and water levels, is how temperature and water levels are finally reported to the world in general. I doubt there is a single environmental or science reporter who could understand any of the issues associated measurement precision and accuracy or means/ averages. I asked someone just today if they believed in CAGW and if so why. They said because “scientists” were telling us. I then asked why do you trust what scientists tell you, after all they are only human. The response was “Ah, gee, good point.” We then discussed briefly measuring things, means, accuracy and precision. Ask the next reporter you talk to to explain anomalies.

  58. Now, a real world problem.

    What is the “proper” way to get an “average” hourly weather (2 meter temperature, wind speed, wet bulb temperature (thus relative humidity) and pressure) for each day of the year at sea level, for 83 north latitude?

    I’ve got 4 years of measurements for each hour at Kap Jessup Greenland. What I need is the average “weather” for each hour of day of the year to determine (approximate really) the heat loss from the ocean at that assumed air temperature hunidity, and pressure.

    Thus, the “average” 12:00 weather over the 4 years could simply be “Average the 2010, 2011, 2012, 2013 12:00 readings for each day-of-year.” Plot all average temperatures, develop an equation (or set of equations) that curve-fit the daily cycle from 5:00 am low to the 2:00 PM high and the yearly cycle from a winter’s low to the mid-summer high. Trust the curve-fitting and the 4 data points for each hour to smooth own storms and clear periods.

    But, is that a valid, adequately correct “average 12:00 temperature, pressure, humidity, wind speed” for 12 August? 12 Feb? 12 Dec?

    “Weather” for 12:00 o’clock (on average for a yearly model) you expect to change slowly over the year’s time, but very rapidly over a 3-4 day period as storms go through. Should the storms and clear periods be “averaged through” as if they did not exist? What if May 12 had storms (high winds, high humidity, near-zero sunlight) 3 years of the 4?
    “Weather” data for 12:00 should be close to that of 11:00 and 13:00. Should data for those hours be used to smooth the 12:00 information, or does that confuse it?
    If one assumes 12 Aug 12:00 is the average of 4 yearly 12 August 12:00 records, should 11 Aug 12:00 and 13 Aug 12:00 data be included in the average of 12 Aug to “get more data poinits”? After all, the expected daily change from noon on 11 Aug to noon on 13 Aug should be very small compared to the difference between 12 Jan and 12 Aug?
    Should that be expanded to successive noon readings for the 4 year’s of noon records for 10 Aug, 11 Aug, 12 Aug, 13 Aug, and 14 Aug?

    • “heat loss from the ocean at that assumed air temperature hunidity, and pressure. ”

      and wind speed ??

      IIRC evaporation is proportional to square of wind speed. S-B is T^4 . Any non linearity will mean that using averages is not correct. Whatever formula you come up with you should apply it directly and average ( or sum ) the resulting heat losses.

      • Well, the idea of calculating each hour’s data independently has merit.
        Each of the four losses vary differently: convection losses and evaporation losses are directly proportional to ocean surface temperature, air temperature, relative humidity and wind speed squared. (Evaporation losses are zero if the surface is ice-covered.) Long wave radiation losses (from open ocean or from ice-covered surfaces) vary by ocean surface temperature^4, air temperature^4, ice surface temperature^4 and relative humidity. Ice surface temperature adds in conduction losses from below, proportional to ice thickness.

      • Evaporation as a function of wind speed is a sublinear function, at least once the wind speed is great enough for turbulence to develop or if convection is occuring. Expect something similar to the difference between temperature and windchill as a function of wind speed.

  59. This is a very interesting discussion. It is like trying to assess the accuracy of the statement that the average American family has 2.5 children. While mathematically it may be accurate, it still represents an impossible representation of any actual family.

    • For the purpose of adding a touch of humor to this discussion Hoyt……..there is a small time window, when a mother in labor is delivering her 3rd child, that an actual family can have 2.5 children. It’s a very fleeting thing, but entirely possible.

      • Mark S Johnson,
        Until the child is completely out of the birth canal (born), and has taken its first breath, it is not actually counted as a living child. Indeed, for purposes of expected longevity from birth, unless the baby survives the first year, it is not included in actuarial tables.

    • Hoyt,
      So, probably what should be stated is that the average American family typically has two or three children. Short of a Solomon-like decision to cut one baby in half, humans come in whole numbers and it makes more sense to describe an average family in the units they come in.

      • Ah, but what if the reality is that families typically have either one child, or a larger family of four children? This is where averaging fails us. As I posted earlier, meaning is in the details. Averaging obscures details.

  60. “if the coarse measurement is sufficiently fine (i.e., 10x to 30x smaller that the true variation in the quantity being measured)”

    And there’s the rub! We are asked to accept fractions of a degree Celsius per decade as measurable fact. Would anybody actually claim that the individual “coarse” measurements are better resolved than 0.1°C over decades at any one spot? They are good enough to approximate the absolute temperature (which swings by 15-20 deg. or more daily in most places, thus 10+ times the accuracy of the measurement), but hardly for detecting minuscule long-time drifts.

  61. I once witnessed a group of ME PhDs doing vibration analysis on an exotic machine. They were taking data samples that were in the 10’s of minutes duration with A/Ds running at 4-8 kHz. These were 14 bit pxi modules I think. They seriously flubbed up the grounding and had to deal with some line noise as well but they learned to ignore the 60hz and harmonics lines. One day one of them was pondering a sharp spike at 1hz, and some lesser ones at 3,5, and 7. The spike was several bits below the resolution of the A/D, low microvolts. I personally was surprised that you could even see anything but noise at that level but it was a clear spike. When I offered that the source of the anomaly was the blinking light on the front panel of the signal conditioner, and offered an argument for the plausibility of my theory, I was laughed out of the room. If they figured it out I never knew about it.

    Moral, it’s hard to argue an expert out of a position if he thinks he is more expert than you. Can be wrong on both counts but no matter.

    • Phil ==> (all of you) — In truth, that’s why it is simply better to use full real names — there are other “Kip Hansens”s in the world — at least one in the Hollywood movie business — but only one that writes and comments here and in the NY Times.

  62. The way stats work, I’ve been told, is that averaging the quantuum doesn’t change the accuracy, but doing the same for the anomaly does.

    The problem I see is the item to be measured changes. So there is no fixed item being changed. Which means, I think, the error bar doesn’t change even for anomalies.

    If an item is fixed but the measurement fluctuates around the correct number in a random way, I see repeated measurements averaging to the “real” value. But if each surface or SST spot and time is different, why would the anomalies average to a more accurate value?

    The assumption for a global value has to be that WITHOUT external, i.e. CO2 forcing, the temperatures averaged over the planet would be unchanging. If there were even a cyclic variation of years in length, this assumption would be invalid. Which would mean variations in the global temperature would have larger uncertainties than presented.

    I would like to see a discussion of the statistical probability of the actual temperature history within the error bars. We see the center line but could the wander be reasonablly ANYTHING within a 1.0 r value?

    Every interpretation of our global temperature changes relies on an expectation of thermal stability without the influence of our forcings of interest. It’s good to say, because of a-CO2 it is X. But if Gore and Mann were to show a “non-CO2″history, I’ll bet we’d be skeptical. There would be too much stability these days shown for the average citizen to believe.

    • Douglas ==> This:
      “The problem I see is the item to be measured changes. So there is no fixed item being changed. Which means, I think, the error bar doesn’t change even for anomalies.”
      is exactly our point of agreement.

  63. Bartemis October 15, 2017 at 11:43 am
    “Nick is correct. It is well established statistical theory that averaging of quantized signals can improve accuracy.”
    …………………………
    Bartemis, the correct sentence is that ” … that averaging of quantized signals can improve PRECISION.”
    You, Nick and others should not be propagating this fallacy when all it shows is that there is a problem of which you are a part. It is time for you to learn.

    Example 1. The Hubbard Telescope went into orbit with an error. Wrong mathematics gave a mirror with wrong curvature. Operators could take as many repeat measurements of ab object as they wanted, with NO IMPROVEMENT to accuracy. To correct the accuracy, another optical element was sent to space and fitted.

    Example 2. The measurement of radiation balance at the top of the atmosphere has been performed by several satellites. See the problem –

    Simple eyeballing shows that there are accuracy problems from one satellite to another. The precision seems quite high. Over a short time, if the signal does not vary, there appears to be a small amount of noise and repeated sampling along the X-axis is doing just what I claim in the correct sentence above. Precision is being improved by repeated sampling of data from any one satellite. Accuracy is untouched by repeated sampling.

    These examples are not hard to digest. Why, then, is there such a problem for stats fanciers to to get their brains around the analogs to these examples when dealing with Kip’s examples of ground thermometry and sea level measurement. I have given vivid pictures to help comprehension and I now repeat what I wrote above, for emphasis. There is NO WAY I can be shown to be wrong, but contaminated minds will possibly try.

    “Repeated quote. “It will be a happy future day when climate authors routinely quote a metrology measurement authority like BIPM (Bureau of Weights and Measures, Paris) in their lists of authors. Then a lot of crap that now masquerades as science would be rejected before publication and save us all a lot of time wading through sub-standard literature to see if any good material is there.”

    • Geoff – The Hubble inaccuracy could not be corrected because correcting it required knowledge that was unavailable, i.e., a full mathematical description of the optical aberrations.

      For this: “Simple eyeballing shows that there are accuracy problems from one satellite to another.”

      It shows there are biases in each measurement set. Recall the assumptions of the model:

      1) the data are homogeneous
      2) the measurements are unbiased
      3) the underlying signal is traversing quantization levels rapidly and independently

      Bias runs afowl of assumption #2. If different types of instruments were used, that runs afowl of assumption #1.

      Under the assumptions that I outlined and repeated above, the nature of quantization error is known, and accuracy can be improved. It is the assumptions that are the problem in the application at hand, not the process.

      Here is a practical example. I have a constant quantity I want to estimate based on measurements. Let’s make it

      K = 100.3

      I am measuring this signal with a measurement that is polluted by a sinusoidal signal of amplitude 20, and a period of 20 samples, and then quantized to the nearest whole number.

      The sinusoid will ensure that quantization levels are traversed rapidly, so that the error model holds reasonably well. I will average measurements 20 samples at a time to ensure that the sinusoidal signal is suppressed.

      In MATLAB, I will construct my data set as follows:

      x=round(100.3 + 20*sin((pi/10)*(ones(1000,1)*(1:20)+2*pi*rand(1000,1)*ones(1,20))));

      The random phase ensures I have a different sample set for each row-wise trial.

      I take the mean over the rows of this matrix:

      y=mean(x,2);

      The mean of y should be close to 100.3:

      mean(y) = 100.2817

      The estimated standard deviation is

      std(y) = 0.0698

      I expect the estimated standard deviation to be near the expected value

      1/sqrt(20*12) = 0.0645

      and, it is.

      • What is the uncertainty range of each measurement? Do you get measurements ranging from say 101 to 99. That is the point we are making. How do you tell which ones are correct? Your little experiment doesn’t seem to have any uncertainty built into it.

      • Bartemis,

        You claimed, “The Hubble inaccuracy could not be corrected because correcting it required knowledge that was unavailable, i.e., a full mathematical description of the optical aberrations.”

        Perhaps you could explain how the corrective mirror was manufactured if they didn’t have “…a full mathematical description of the optical aberrations.”

      • ‘Perhaps you could explain how the corrective mirror was manufactured if they didn’t have “…a full mathematical description of the optical aberrations.”’

        Good point. It appears I was hasty in my dismissal. Correcting the images via deconvolution with the optical response might have been possible. Indeed, googling “correcting hubble images via deconvolution” gives a plethora of references, and it apparently was done with some success.

        A workshop paper from 1990 concluded:

        The fundamental loss of HST imaging science as a result from spherical aberration is not a loss of resolution; rather, it is a loss in the ability to detect faint objects, especially in crowded fields.

        So, the COSTAR was not so much an issue of correcting the images as it was of enhancing the SNR. Keep in mind, the servicing mission was in 1993, which was 24 years ago. There have been significant advances in algorithms and computing power since then. Perhaps, if it happened today, it could all be fixed with software.

    • Bart, there was never any “inaccuracy” in the Hubble telescope. It was designed to be “nearsighted” from the start. The first few years of operation of the Hubble was not to peer out into outer space, but rather to be used as the best spy satellite ever put into orbit. After the “spooks” had their fill of using it, they initiated the designed “repair” with the Shuttle to restore the device to be a true astronomical telescope.

  64. Kip,
    You, as many contributors to WUWT, have once again provided me with real world information to present to my classroom full of young minds. This information about accuracy and precision plays right into a topic I spend at least 5 weeks on in high school junior level Chemistry. While my students won’t get to see much in the way of biochemistry or nuclear chemistry, they will be able to calculate mole’s/atoms/grams, convert anything into a corresponding unit, and comprehend the periodic table and periodicity.

    You are absolutely correct that this topic is woefully under represented in STEM courses, because it does not appear to be fully understood by upper elementary and jr. high math teachers, and isn’t taught with any rigor if at all. High school STEM teachers assume our students have been taught this subject so it is glossed over. I made that mistake my first semester, but will not do it again. I tell my students that we will not move on until they have mastered the various sub-sets of scientific measurements and calculations. In order to motivate the learning, they are shown how the grade system will be enforced, and how consequential it will be to their final grade of the semester.

    We toured a petrochemical refinery two weeks ago. As the HSE Super was driving my group around the refinery and talking about what is done here and there I pointed out various capacity placards and asked my students if they could do the calculations if they knew the dimensions of the vessels and piping and could they then calculate the weight of the contents if they knew the density of the liquids, etc. It nearly brought tears to my eyes as I watched the connections to the classroom instruction click into place and my students were going, “Yeah! Mr. Dees, we could do that using…., then … , and the sig figs should be about….”

    This is yet another article from WUWT that I’ll be shamelessly borrowing to drive home another topic in science to encourage critical thinking skills.

    Though I am usually in agreement with the articles presented here, I do carefully select those that are straight up science and evidence based. I will not use articles which contain excessive bias or opinionated editorial in my classroom. As much as I’d like to shape the opinions of my students to match my own, that isn’t my job. My job is to build critical thinking skills using the scientific method and grow an understanding of chemistry, physics, and environmental science.

    Thank you Anthony Watts and Kip Hanson.

    PRDJ

    • PRDJ ==> Very high praise indeed. I shamelessly admit to having planned to be a HS Science teacher myself when I was a youth. Maybe this is my second chance….in a way.

      • I’m not exactly a youth myself. I’ve entered the classroom after spending two decades in water treatment for drinking, wastewater, and power plants (pre-treatment, water/steam, and discharges). I reckoned it was time to pass on the wisdom and try to affect a change in the classroom curricula to better reflect what these kids need in the “real world”.

        The first discussion of this paper went fairly well. Two more class periods today and then finish up tomorrow. I’ll let you know how it goes.

      • Here’s the report succinct as possible.

        The students with strong reading comprehension and vocabulary understood the concepts and were able to report out in a cohesive and comprehensive manner.

        Those that struggle with reading comprehension and have weak vocabulary… well, they struggled.

        Ultimately, I asked them to agree or disagree with the message. Their choice did not determine their grade. What determines their final assignment grade is how strongly they defended their position. I directed them to put in the title and your name to access a digital copy of the paper then access and peruse the embedded links to assist in defending their chosen position.

        While it may sound like a wishy-washy assignment, I find these students to be ingrained in rote regurgitation of information and weak in critical thinking skills. Therefore, I have assignments such as this after building their knowledge base about specific topics.

        I’m still grading their papers, but wanted to report out as promised.

        Thanks again, Kip, for your contributions here.

      • PRDJ ==> Thanks for the update,very interesting. I am a huge fan of teaching critical thinking skills over the memorizing of facts — and a huge fan of practicing mental math skills (quick down-and-dirty estimation of numerical data).

  65. If an item is fixed but the measurement fluctuates around the correct number in a random way, I see repeated measurements averaging to the “real” value. But if each surface or SST spot and time is different, why would the anomalies average to a more accurate value?

    From time to time, I have made this point on Dr Spencer’s blog. I consider that people fail to appreciate the significance of the fact that at no time during the time series thermometer reconstruction, are we repeatedly measuring the same data, or even from the same site, or even using the same equipment, or even using the same practice and procedure of measurement…

    The stations that composed the sample set in 1880 are not the very same stations as composed the sample set in 1900, which in turn are not the very same stations as composed the sample set in 1920, which in turn are not the very same stations as composed the sample set in 1940, which in turn are not the very same stations as composed the sample set in 1960, and so on and so forth.

  66. To quote from the “Scientist and Engineer’s Guide to Digital Signal Processing”, by S.W. Smith:

    “Accuracy is the difference between the true value and the mean of the underlying process that generates the data. Precision is the spread of the values, specified by the standard deviation, the signal-to-noise ratio or the CV.”

    If the ‘underlying process that generates the data’ is random error (i.e. the uncertainty is aleatoric), then, in accordance with the Central Limit Theorem, the spread of values of an increasing number of measurements will decrease, resulting in increased precision. Whether or not this also results in an increase in accuracy, however, is a matter of calibration (only for a perfectly calibrated instrument will the mean value of repeated measurements tend to the true value and, hence, precision equate to accuracy). The problem with calibration uncertainty is that it is epistemic rather than aleatoric. Therefore, no amount of repeated measurement will shed any light on its magnitude. Consequently, the extent to which a given level of precision can be taken as a measure of accuracy remains an open question. Certainly, if the level of uncertainty associated with calibration error exceeds the uncertainty associated with imprecision then you may be in deep doo-doo. In view of the controversy surrounding the calibration of the temperature and sea-level measurements, I would have thought the levels of precision being quoted were highly dubious.

    • John ==> I think that’s what I said….:-)

      The StatsFolk (and the signal processing folks) are entitled to all the precision they can squeeze out of long divisions to many decimal places, averaging their heads off to their hearts content.

      But if the original measurements are only vaguely accurate, then the means, though precise, are also vaguely accurate.

    • I’ve just returned to my comment and realise I had made a simple gaff. I had meant to say that the spread of values of the mean will decrease, i.e. the standard error of the mean decreases. I also misrepresent the relevance of the central limit theorem. However, my main point remains. One can expect random errors to cancel out but systematic errors will not. Uncertainty is often underestimated because the epistemic component is overlooked. I am not sure whether I am agreeing with Kip because I am unsure what is meant by the phrase ‘vaguely accurate’.

  67. Fascinating conversation. It seems like both “sides” of this conversation have excellent points to make, and although I am not a statistics expert, I would make to make an analogy to something that I do somewhat understand and ask wiser heads than mine if it applies to this conversation as well…….

    The world of digital video is filled with Analog-to-Digital converters to capture image “data” and then Digital-to-Analog converters that do the reverse, so that we can see the original source with as much fidelity as possible.

    The quality of the “data” – and the ultimate reproduction of it – depend greatly on the frequency of the samples being taken and precision of the sample being generated. There is also a spacial component in this process, with the largest number of “pixels” (or sample locations) producing the best data and ultimately the best reproduction.

    It seems to me like this is almost exactly the same thing as trying to store and then “recreate” climate data so it can be used to “estimate” things like the average global atmospheric temperatures or average global sea height.

    In the real world, these three variables together can generate many objectionable artifacts if the sampling method is not matched to the data trying to be captured and reproduced.

    To state the obvious, 12-bit video sampling yields data that is far more “accurate” than 8-bit samples.
    And 240 Hertz sampling is more accurae on fast moving objects than 60 Herz sampling.
    And 4K spatial sampling reveals more detail than a 640×480 sample on a complex image.

    The analogies to this topic seem straight forward

    The number of bits used to sample video is equivalent to whether you are using Whole numbers (or not) to measure temperature. The Frame rates in video is equivalent to how often you measure the temperature (by the second, minute, hour, day, week,etc) The Spatial distribution (what video calls resolution) is equivalent to how many different locations you sample the temperature.

    My experience has been than it is absolutely possible to “improve” the accuracy of the reproduction by “averaging” many samples over time. After all, this is how digital time exposures work. Therefore, it should be possible (?) to more accurately observe climate trends than the accuracy of the initial measurements would imply.

    The devil is in the details, of course, and some algorithms used to “improve” picture quality can easily create objectionable artifacts in the picture that are not really there.

    Thoughts?

    • Mike Fayette ==> Appreciate your input. The trouble with analogies…..is that they are analogies.

      In A-to-D video conversion, you are capturing one “screenful” of image data, one bit (loosely) or pixel, at a time, trying to find the best estimate for that pixel from the still analog image, selecting the3 stills at some frequency (more being better) , you do this pixel by pixel at your chosen resolution by some sort of averaging technique, once you’ve got all the pixels for a screenful, you can save it or throw it at a video screen and we have a still image. enough still images delivered quickly enough gives us a motion picture.

      I hope I have that right — been a while since I digitized live video coverage at the Masters Golf Tournament (Tiger Woods’ first year there) for internet broadcast (local only, for Lou Gerstner, IBM CEO). [ Gerstner’s comment upon seeing my demo of the tech? “That’s going to cost me a lot of money.”]

      Trying to calculate global mean sea level is more like trying to capture an accurate screenful of data with ONE average value — and results in “lite gray” very precisely, but not a very accurate representation of the still image, no less a motion picture.

    • Here is the difference. Supose your sampling device delivered ‘color’ from green to yellow and you don’t know which is accurate at each sample. This is what measurement uncertainty means. You can average the values but are you sure you’re getting the right color when you do this?

    • “Thoughts?”

      Yes, it’s a similar problem, but much, much more difficult.

      In the case you describe, there’s a right answer, but with analogues (proxies) like temperature and sea level, there’s no reference. No “right” answer to compare with. There are calibrated instruments that measure these things but none of them purport to measure global values. As a result, there’s nothing to compare them with.

      The basic ideas behind global average temperature or global sea level, depend on the belief there is such a thing, and that remains to be seen. What does “global temperature” mean? How should it be measured? By averaging local temps? Willfully combining apples and oranges? Many think not. From a measurement theory perspective, it just doesn’t make a lot of sense, but it’s the best we can do with the tools we have.

      I think it’s more important to question precision, as Kip has done. We have no way to measure “global temperature” or “global sea surface” with the precision that’s being presented to us. It simply can’t be done. The underlying data prevent it.

  68. Kip, it would help a lot if you stopped using the work uncertainty to mean accuracy; they’re different. I thought you understood that and that it was the point of your essay?

    Accuracy follows individual measurements, the mean of a repeated measure is only as accurate as the individual measure. Accuracy isn’t improved by arithmetic operations, but precision may be, assuming the measurement error is normally distributed.

    No matter how many folks like Nick confuse the correct treatment of accuracy and precision those facts don’t change. You can try to improve that situation by encouraging use of the right words.

    • Bartleby ==> In a perfect world, and in a perfect language, each and every word would have only one very accurate and precise meaning. You are right, of course, that not everyone means the same thing when they use the word uncertainty.

      However, even in the rarefied field of science, there are multiple meanings for individual words.

      It is quite right to speak of “measurement uncertainty” — under these definitions:

      “Error is the difference between the measured value and the ‘true value’ of the thing being
      measured.
      Uncertainty is a quantification of the doubt about the measurement result.”

      – source : . Measurement Good Practice Guide No. 11. A Beginner’s Guide to Uncertainty of Measurement. Tech. rep., National Physical Laboratory, 1999.

      I have tried to point out that temperature measurements expressed as an integer only are in reality a RANGE, and thus “we don’t know” the value that existed, may have been measured, but was not recorded to the right of the decimal point.

      From the same document above”
      “two numbers are really needed in order to quantify
      an uncertainty. One is the width of the margin, or interval. The other is a confidence level, and
      states how sure we are that the ‘true value’ is within that margin.”

      For temperatures as recorded modernly, we have an INTERVAL of 1 degree (in Fahrenheit, 0.55 degrees in Celsius) to very near a 100% certainty (ignoring all the variables in instrument error). The INTERVAL is the RANGE, and represents the uncertainty, because we don;t know exactly where in the interval the actual measurement lies.

      For Tide Gauges, the INTERVAL results form the instrument itself — as explained in the essay, with a bit of the confidence removed by the method of averaging 181 1-sewcond values, tossing out 3-sigma outliers, and re-calculating a mean.

      • Kip –

        OK, if you’d like to distinctly use the words “Uncertainty” and “Confidence” to describe the two, rather than “Accuracy” and “Precision” I suppose I can support it. I’d encourage you to print a small thesaurus making the choice clear. I disagree that’s a good choice since throughout statistics texts you’ll find the terms “precision”, “uncertainty” and “confidence” freely interchanged, while “accuracy” means only one thing.

        I use the more common terms “accuracy” and “precision” or “confidence. It means that I consider “Accuracy” the best term to use to describe the known error and either “precision” or “confidence” to describe the error of estimate, the observational error using a calibrated instrument.

        We can know (ironically through repeated measure) the accuracy of an instrument, such as a tide gauge or a thermometer. We assume that accuracy is fixed and that if the instrument is used precisely, the observations made using it will always fall within the range described, usually with a 68% probability (1 sigma).

        Further, we assume there will be error introduced in any single observation using that instrument. In most cases, the observational error is assumed infinite for any single measure and is reduced as a function of the number of measures made. So, while “accuracy” of the instrument is never improved by multiple observation, the “precision” of the measure is increased with the count of observations. This assumption is based on the idea that observation error is normally distributed and multiple observations, when averaged, will tend towards the true value or “mean” (average).

        As long as you make a very large point of distinguishing “accuracy” from “precision” (or in your case “confidence”) in some way there’s no problem. But make sure you communicate the difference to your readers and also make all efforts to be consistent in your usage?

      • Bart ==> Actually, I chose the words in the essay — and used far too many examples already. Those who read the essay should find it easy to understand as long as they are not blindered by too much brainwashing with “only one definition allowed”.

        My editor already eliminated about half the redundant use of verbiage originally included to make sure everyone understood the concepts I was trying to get across, even if they may have been indoctrinated with a contrary idea.

        If you’ve read all 389 previous comments above — you would find that further discussion of the same point(s) is not going to add to the conversation.

        (Sorry — just tired of the repeated insistence that I use one particular set of nomenclature from one narrow field.)

      • Bartleby,
        For me, the word “precision” implies the number of significant figures that can be assigned to a measurement. Whereas, “uncertainty” implies a range such as +/- 1 SD. “Confidence,” to me. brings in a subjective probability assessment that tells us something is highly probable, highly improbable, or somewhere in between.

    • There exists one international standard for expression of uncertainty in measurement:

      “The following seven organizations supported the development of the Guide to expression of uncertainty, which is published in their name:
      BIPM: Bureau International des Poids et Measures
      IEC: International Electrotechnical Commission
      IFCC: International Federation of Clinical Chemistry
      ISO: International Organization for Standardization
      IUPAC: International Union of Pure and Applied Chemistry
      IUPAP: International Union of Pure and Applied Physics
      OlML: International Organization of Legal Metrology ..”

      The standard is freely available. I think of it as a really good idea to use that standard for what should be obvious reasons. Even some climate scientists are now starting to realize that international standards should be used. See:
      Uncertainty information in climate data records from Earth observation:
      “The terms “error” and “uncertainty” are often unhelpfully conflated. Usage should follow international standards from metrology (the science of measurement), which bring clarity to thinking about and communicating uncertainty information.”

      • Well, all I can say is that “certainty” and “uncertainty” in statistics are terms derived arithmetically and shouldn’t be confused with “accuracy”. They are used to establish accuracy in an instrument, but they’re distinct (operationally) from the confidence or precision used to describe the use of such an instrument.

        I wrote a description of this difference in a direct reply, I hope I made it clear.

        Essentially, “accuracy” of an instrument is pretty much it’s known “precision”, if that makes any sense. Use of the instrument makes up the measure of its observational “error” or “experimental precision” if that helps.

        The accuracy of a calibrated instrument reflects its ability to, when correctly used, produce an observation “precise” to the stated range.

        The precision of a measurement taken using such a device is determined by observational error, which is reduced by repeated measure.

        I don’t know how else to say it.

      • S or F ==> They don’t have a single word for the concept I am trying to communicate specifically. If you find one in there, you let me know.

        The lack of an Internationally agreed-upon assigned nomenclature for the concept changes nothing.

      • It is fully legitimate to identify a shortcoming or a flaw in Guide to the expression of uncertainty in measurement.

        Personally, I am aware of one issue with it that should be taken care of: The way it kind of recognizes ´subjective probability´. I have plans for writing a post on that issue.

        However, I think you will need to pay extreme attention to definitions and clarity of your argument to be able to do that. If you use terminology that is already defined in that standard while meaning something else, you will run into problems.

        This post contains some principles that I think you will need to observe closely to succeed with your arguments:
        https://dhf66.wordpress.com/2017/08/06/principles-of-science-and-ethical-guidelines-for-scientific-conduct-v8-0/

        Wish you the best of luck with your effort. :) SorF

      • Bart ==> Apparently you haven’t read all 390 (now) comments above — and if I were writing a statistics essay, I would use their particular, somewhat peculiar, nomenclature. Your definitions are quite correct, of course, and repeated ad nauseam by stats students young and old above.

        I am, however, writing about measurement, not statistics. Statistics is about probability. Think “engineering”.

        I’m afraid that this horse has been beaten far beyond death — into horse puree — and must now be left for the street sweeper to clean up.

      • Kip,

        I’m aware of the difficulty, I’m retired now but was once a practicing statistician involved in the design of experiments so I have a practical understanding of exactly what you’re trying to say.

        I can’t claim to have read all 360 comments above, but I did read quite a few and some are mine. I only brought this up after reading an extended (and rather pointless I might add) debate with Nick Stokes, who apparently can’t differentiate between the concept of an accurate measure and a precise one. As mentioned, unless you agree on terms you’ll end with “horse puree” as you say :)

      • For those with a strong interest in this topic, I think that the Guide to Expression of Uncertainty (the link to which is provided above courtesy of Science or Fiction), is well worth the time to read it.

        Probably the most salient point to be found in it is as follows:
        “4.2.7 If the random variations in the observations of an input quantity are correlated, for example, in time, the mean and experimental standard deviation of the mean [AKA standard error of the mean] as given in 4.2.1 and 4.2.3 MAY BE INAPPROPRIATE ESTIMATORS (C.2.25) of the desired statistics (C.2.23). In such cases, the observations should be analysed by statistical methods specially designed to treat a series of correlated, randomly-varying measurements.”

        It is generally acknowledged that both temperature time-series and sea level time-series are auto-correlated. Thus, the caution should be taken to heart by those who are defending the position that the best estimate of uncertainty for temperature and sea level is the standard deviation divided by the square root of the number of observations.

      • “Nick Stokes, who apparently can’t differentiate between the concept of an accurate measure and a precise one”
        No quotes given. I haven’t talked about either very much. The frustration of these threads is that people persistently talk about the wrong problem. And it’s obviously the wrong problem. There is no citation of someone actually making repeated measurements of the same thing to try to improve precision. What is in fact happening is the use of repeated single measurements of different samples to estimate a population mean. And the right language there is sampling error and bias, and that is what I’ve been talking about.

      • Mr Stokes, the issue is not measuring an unchanging population mean, but a changing mean temperature, derived from instruments with a limited degree of both accuracy and precision. The dispute is whether claiming greater resolution in determining that change than is inherent in the instruments.

      • “the issue is not measuring an unchanging population mean”
        Often it is measuring a mean that changes slowly relative to the measurement frequency. You can see it as a version of smoothing. In any case, the issues are known. What it isn’t is a metrology problem. It is a matter of combining separately measured samples (which have expected variability) to estimate some kind of expected mean or smoothed value.

      • Nick ==> And you are entitled to “estimate some kind of expected mean or smoothed value” in the way you describe. What the method can’t do is claim find a highly accurate mean far beyond the accuracy of the original measurements.

  69. With all the statistical nitpicking of commenters above, Kip Hansen makes the extremely valid point that if someone tries to claim that sea level is rising at a rate of 2 mm per year or 4 mm per year, what people are really debating is whether the “average” sea level rose one increment of the instrument (20 mm) in 10 years or in 5 years.

    Of course, the data cited by Kip showed that the water level at Battery Park rose 0.40 meter in a half hour, meaning that over the 6 hours or so between low and high tide, the water level could rise 2 meters or more (then return back to the original value 6 hours or so later).

    If the long-term “average” sea level is rising at 2 to 4 mm/year, it might take several decades to sort out the slow-rising “signal” from the huge “noise” of twice-daily fluctuations two orders of magnitude larger, and twice-monthly fluctuations (due to phases of the moon) in the amplitude of the twice-daily fluctuations.

    Then, if it is found that a 1-meter sea-level rise would flood a coastal city, in a few decades we could find out whether the city has 250 years or 500 years to build a 1-meter high seawall to protect the city. Most cities could afford to wait, in order to determine whether the investment is necessary.

    • Steve –

      I think it’s a bit more complicated than that.

      An instrument like a thermometer or a tide gauge has a known accuracy, one that’s based on repeated measures using the device. So a tide gauge might be said to be accurate +/- 2cm. The resulting measure taken with the instrument is said to fall somewhere between +2 and -2 cm of any observation. The actual number is presumed to be somewhere within that range.

      But the value itself is assumed to come from a normal distribution. So if it’s reported as “x” cm, it’s expected to lie somewhere between x-2 and x+2 cm, with a 68% probability. As the measure is repeated, confidence the value falls within that range goes up and we say the estimate of the true value is more “precise”. So after some number of repeated measures we can have more confidence (perhaps as much as 98%) that the value is correct, but that doesn’t change the range at all. We’re more confident the value is x, but only within the range of x +/- 2cm.

      This is hard to do in English. I hope that makes sense.

      • Maybe an example:

        We use an instrument with an accuracy of +/- 2cm once to measure a value. The value is assumed (according to the accuracy of the device) to fall within the stated range with a 68% probability; it will be within that range 68% of the time.

        We repeat the measure and collect the readings, then average them. As the number of observations increases, our confidence in the observations increase and the value of x changes. In the end, after many repeated measures, we can say we’re confident the value “x” is within the range of x+/- 2cm with more than 68% confidence. The accuracy of the measure hasn’t changed, but it’s precision has.

      • Kip it’s difficult for me to compare my attempts to convey basic measurement theory with the strength and necessity of you essay.

        Anything I can do to help. I seriously appreciate your efforts.

      • Kip – I forgot to thank you for putting up with the typos :) There’s no “edit” button on this board and I have learned to depend on post hoc editing. What can I say? I’m a bad, bad typist…

  70. EPILOGUE:

    Terrific discussions with a lot of good input from readers with a lot of different scientific backgrounds.

    Appreciate all the intelligent questions and help in clarifying the main issue — and especially the support for measurement pragmatism over statistical pedantry.

    I did not really expect to win over those whose professional status depends somewhat on the StatsFolk view being applicable in all cases and it was heartening to have some help in the effort.

    The essay presents a valuable lesson for those who are without vested interest in the opposing view.

    Again, if you have unresolved question or something you just have to tell me — leave a further comment and I’ll try to attend to it. If not, feel free to email me at my first name at the domain i4 decimal net.

    — kh

  71. Firstly, thank you Kip Hansen for a great post and for taking the time to respond to so many comments in this long thread.

    I totally agree with Kip and appreciate his very well written and simple explication of this aspect of metrology.

    I’ve come to this post late and have read all the comments above but only wanted to interject, if I wasn’t repeating what had been said already.

    Kip describes an obvious truth to me, that is a well know, well documented and well understood issue for practitioners in the real world.

    I’m a visual person and the commonly used image of a target to symbolise the – accuracy/precision – issue, is the first thing I thought of. But when doing this mental exercise, one can easily see that there is a need for a third term! I drew pictures of each “target” with the aim of making a cartoon example that would make it very clear to a layman. I was sure that anybody who had given this more than a moments thought would also have come up with the same result, so I did just a little research:

    According to ISO 5725-1 and VIM*, the general term ”accuracy”, is used to describe the closeness of a measurement to the true value. When the term is applied to sets of measurements of the same measurand, it involves a component of random error and a component of systematic error. In this case trueness is the closeness of the mean of a set of measurement results to the actual (true) value and precision is the closeness of agreement among a set of results.

    According to ISO 5725-1and VIM, Accuracy consists of Trueness (proximity of measurement results to the true value) and Precision (repeatability or reproducibility of the measurement) – Wikipedia

    Hope this helps to visualise the issues.

    cheers,

    Scott

    *BIPM International Vocabulary of Metrology (VIM)
    **I’ve adapted my graphic from the Wikipedia commons image.

    • Scott ==> Beauty. From the standard you quote:

      “When the term is applied to sets of measurements of the same measurand, “

      The “measurand” is “A quantity intended to be measured. An object being measured.”
      The only time “accuracy” applies to a the mean of a set of measurements is when they are of the same measureand.
      As you can tell from this and other essay, I use a lot of images to help communicate the issue I write about.
      Like your modified Accuracy/Precision trio.

      • @Kip – While I´m well aware that it would take you weeks to digest all comments, I´ll make a comment nevertheless. Take your time, if interested. :)

        I would stick to the standard -Wikipedia is not the standard. Definitions matters – a lot:

        «1.2 This Guide is primarily concerned with the expression of uncertainty in the measurement of a well-defined physical quantity — the measurand — that can be characterized by an essentially unique value. If the phenomenon of interest can be represented only as a distribution of values or is dependent on one or more parameters, such as time, then the measurands required for its description are the set of quantities describing that distribution or that dependence.» – GUM

        To investigate the issue of rounding, I think the measurand can be defined as:
        The average of the temperature measurements by 2000 thermometers,
        where:
        these thermometer has a resolution of 0,10 DegC,
        where
        these thermometers has an uncertainty of 0,01 DegC at 95% confidence level
        where:
        the thermometers are not drifting,
        where:
        these thermometers are used to measure a continuous variable,
        where:
        the variable is random in the measurement range of the thermometer,
        where:
        all the measurements are uncorrelated
        and where:
        each reading is rounded to the nearest integer,

        In that case, rounding of each measurement to the nearest integer does not cause a significant increase the uncertainty of the average that would be obtained from taking the average of the original unrounded reading of the thermometers.

        If one condition or premise is added or changed, the conclusion may no longer be valid and will have to be reconsidered.

        Whether the average temperature is representative for the average temperature of the earth is an entirely other question that certainly deserves some consideration. And whether the so-called raw measurements really are unadjusted, and whether the adjustments to the temperature data are valid, are serious questions.

        «D.1.1 The first step in making a measurement is to specify the measurand — the quantity to be measured; the measurand cannot be specified by a value but only by a description of a quantity. However, in principle, a measurand cannot be completely described without an infinite amount of information. Thus, to the extent that it leaves room for interpretation, incomplete definition of the measurand introduces into the uncertainty of the result of a measurement a component of uncertainty that may or may not be significant relative to the accuracy required of the measurement.» – GUM

        Actually, global average temperature is not defined in the Paris agreement. And climate scientists keep changing their measurement of ´it´ all the time without properly defining their products.

      • Science or Fiction,

        You remarked,
        “these thermometers are used to measure a continuous variable,
        where:
        the variable is random in the measurement range of the thermometer,…”

        These two statement strike me as being contradictory. Am I misunderstanding what you meant?

        You go on to offer, “all the measurements are uncorrelated”. This seems to be at odds with your first statement. For a time-series, at a particular site, there will be auto correlation. That is, one does not expect temperature to change instantaneously some ten’s of degrees, which would be the case if the variable were truly random. Any particular site can be considered a sample, which is composited for the global average. Therefore, all the sites will exhibit a degree of autocorrelation, which will vary depending on the season and location.

      • Kip==> Thank you again for taking the time to respond. You say:

        The “measurand” is “A quantity intended to be measured. An object being measured.” The only time “accuracy” applies to a the mean of a set of measurements is when they are of the same measured.

        Sure, I do agree with you but wanted to tease out the very first principles.

        If you aim at one target and take one shot – one measurement of one measurand – you can only talk about the result in terms of trueness (the proximity of the measurement result to the true value) or how close the bullet hole is to the centre of the target. You can not yet talk about accuracy because that requires a knowledge of precision. Precision is the repeatability of the measurement, it equates to a cluster of shots – or “the closeness of agreement among a set of results”.

        I simply meant, that more than one shot is required either of the same target or single shots on multiple targets before you can determine precision. Once you have a “measure” of precision and trueness you can begin to talk about accuracy.

        The reason precision is a necessary condition for accuracy is because it is impossible to separate random error – or a random truth in this case – from systematic error. A gun bolted to a test bench hitting the bullseye in one shot can not speak to accuracy because precision hasn’t been tested. The next several shots hitting all over the target or all shots going through the hole in the bullseye will illuminate the situation however!

        If there are two sides to the argument in the thread above, I think it is because people are talking at cross purposes. One side is arguing precision and the other trueness and they are confusing either term with accuracy.

        To restate, the trueness of your tide gauge example, equates to the target’s bullseye – the instantaneous water level outside the stilling well. While the the 181 (1 millimetre resolution) “cluster”, provides precision the +/-2cm calibrated range of the apparatus added to recorded value represents its real accuracy and true range.

        And I agree, averaging large numbers of these values will only produce a spurious accuracy because, although a quasi-precision might be gained, trueness is lowered and thus accuracy lost!

        cheers,

        Scott

      • @ Clyde Spencer
        I was just trying to define a hypothetical set of conditions to make it easier to understand that rounding of the readings per se will not be a problem. However, as you point out, the conditions are not true for a real world attempt to measure global average temperature by a number of thermometers. There is a large number of issues with an attempt to measure global average temperature. I´m most concerned about the adjustments. How adjustments for the urban heat effect apparently increases the temperature trend rather than reduces it is one of the things that makes me wonder. I don´t think rounding per se is the most significant problem. Even though rounding to the nearest integer could be a significant problem if the temperature range is small compared to the size of the rounding.

  72. I’m no expert in measurement, just thinking:

    First the instrument has to be calibrated. This process probably has a normally distributed outcome, but to the end user of an instrument this is meaningless. So the instrument has has a fixed accuracy x+-e_cal

    The instrument will make measurements with errors normally distributed with respect to the calibration error.
    Repeated measurements of the same quantity with this instrument will make it possible to reduce this error (e_norm).
    The absolute error will be x+-e_cal+-e_norm and cannot be reduced below +-e_cal even if e_norm were averaged to 0.0.

    If you use all calibrated instruments simultaneously to measure the same quantity, the sum of the calibration error and e_norm should be reduced by averaging because the calibration process was assumed to have normal errors. The same should apply if you were measuring different quantities and calculated their average.

    The example with Fahrenheit is different since this deals with the quantisation error which indeed cannot be reduced for one and the same instrument by averaging.
    But if you assume the calibration process to have normally distributed errors, using an ensemble of those instruments should make it possible to overcome this threshold.

    While I consider myself a “sceptic”, so far these musings leave me in principle on the side of those who claim the error can be reduced by averaging (considering the scenario of global temperature measurement (whatever that may mean in the end)).

    • Did the thermometer read 0.0 deg C in a slush of distilled/deionized water at SLP (sea level pressure)?
      Did the thermometer read 100.0 deg C in rolling/boiling distilled/deionized water at SLP?

      If these two calibration points are a perfect fit (or not, record the error and note it on any subsequent measurements made), then the thermometer is “calibrated”. This applies to checking the accuracy of Hg or alcohol thermometers but can be used to calibrate “thermistors” and RTD’s.

      Question: When I had to make water/wastewater outfall temperature checks, one must use a “calibrated” and/or “traceable” thermometer that is verified/calibrated yearly by ASTM standards. If I fail to do so, my records are tossed out, my company is fined (NOV), I might lose my job and possibly my license to treat water or wastewater or in cases of “pencil whipping” one could lose their freedom. I practiced due diligence for 20 years and left water/wastewater for the classroom with a clean record.

      When a climate scientist is feeding data into policy decisions which affect hundreds of millions or billions of people…. who is checking their calibration, measurement consistency, etc.? What is the consequence of failure of due diligence? We see from the response to Mr. Watts, et. al., site checking project that the policy makers and the policy feeders want little oversight. Too bad those that should be providing the oversight are also the one’s creating the records.

    • But if you assume the calibration process to have normally distributed errors, using an ensemble of those instruments should make it possible to overcome this threshold.

      You are entirely correct, and thanks for pointing this out.

      I just realized about Kip’s exact example above – one Tide Station – he is entirely correct that for that one tide station you cannot exceed the precision of that tide station because it’s ONE station and the calibration accuracy cannot be better than the precision of the instrument calibrated. So I apologize to Kip here, I was wrong about that detail. It takes at least 30 tide stations independently calibrated to exceed the accuracy and precision of the tidal measurement instrument (or the temperature measurement instrument).

      I still strongly believe (from professional opinion) that as long as the calibration sources have a precision and accuracy better than that of the measurement instruments, and the calibrations are independent, that the accuracy and precision of the global average temperature and tide levels exceeds that of individual instruments.

      Here’s a new problem though: from a signal processing interpretation, the calibration interval induces a noise at the frequency of the inverse of that interval, and the noise has a level corresponding to the instrument precision. And that calibration interval could alias with the horrible boxcar averaging methods used by climate scientists, further creating errors in the data. Yikes.

      Peter

  73. krmmtoday writes:

    While I consider myself a “sceptic”, so far these musings leave me in principle on the side of those who claim the error can be reduced by averaging (considering the scenario of global temperature measurement (whatever that may mean in the end)).

    Although I agree, you should take note of how little has been accomplished in this thread despite the considerable praise heaped upon the essay. The OP expends many words upon a triviality, namely that multiple measurements where true values are confined to within a finite interval around the reported value will average to an estimate necessarily confined to an interval equal to the average of the raw measurement intervals. (the intervals need not be equal, though they are in KH’s example). This could be deduced from a single line of mathematics. The discussion goes off the rails when KH makes the following claim:

    /i If each measurement is only accurate to ± 2 cm, then the monthly mean cannot be MORE accurate than that — it must carry the same range of error/uncertainty as the original measurements from which it is made. /bAveraging does not increase accuracy. /I/b

    The bolded claim is an equivocation between the total range of possible mean values vs the accuracy (i.e. the probable range) of deviation of the mean from that estimated by averaging recorded results. Imagine the counterpart of the above statement for measurements collected with measurement errors known a priori to be additive gaussian random variables. Since the “possible range” of a gaussian rv is +/- infinity none of the meaurements are informative according to the “possible range” criterion and neither is the average as it is also gaussian by implication. Sound persuasive? In order to talk meaningfully about the accuracy of a statistical estimate we have to talk probabilities and it doesn’t answer the mail to make statements like “this article is about measurements not probabilities”. To say something informative about an estimated mean value we have to know something meaningful about the probability distribution of the error of the estimated average. Thats what probabilities are for … to make quantitative statements about achieved real world values in the presence of various forms of uncertainty including sampling and measurement errors. You cannot pinch-hit for probabilities with interval arithmetic. To estimate parameters of the probability distribution of the error of a data average we need to consult some representation of the joint probability distribution of the measurement errors. That is what we need to know and it is all we need to know. But how can we make meaningful statements about something as potentially complicated as the multivariate probability distribution of instrument sampling errors? The answer arises from instrument design and testing, followed by calibration and data quality assurance procedures in instrument use.

    The instrument designer strives to: 1. Reduce systematic error and drift by design and calibration and 2. Increase sample to sample measurement error independence and decorrelation of random uncontrollable errors, often by adjusting the sample rate to conform to typical variations in signal to levels comparable to sampling error levels. These efforts can succeed in making the usual 1/n reduction in estimated variance a viable approximation but inevitably runs afoul of any residual instrument bias and if n is pushed too far, e.g. by oversampling, might materially underestimate error variance from residual correlations of measurement error. Fortunately the former error can be estimated from repeated measurement of known constant signals from lab standards and the latter can checked against observed variances of signal to ensure that the variance of the estimated mean is dominated by signal variation rather than sampling noise. A typical example of joint compromise would be to add a random dither signal to avoid quantization biases from signals whose multi-sample variation is below the LSB of a digitizer at the expense of of an increase (but useful increase) in random error variance etc. In any case use of the statistical formulae requires knowledge of the signals being interrogated, the instruments employed, adequate support from testing and, in the end, a non-conspiratorial attitude toward the actual errors encountered in practice. This latter attitude, although it routinely is and should be under constant surveillance, is altogether customary in science and engineering generally, but appears to be a very “hard sell” for many on this board.

    I have no illusions that anything that I may write will change the evident opinion of many readers of WUWT i.e. that numerous practioners of “climate science” (I hate that term) are utterly sunk in confirmation bias. But skeptics need to acknowledge that Judith Curry’s “uncertainty monster” cuts both ways and the only way out of the morass is the concrete accomplishments of well founded science that can achieve real, not manufactured, consensus amongst competent and well informed participants. I don’t think that the OP of this thread has materially advanced that cause.

    • Carl:

      I see you have some signal processing background and I agree with what you say, but I disagree that the article has done nothing. Despite the article being wrong, It’s inspired some of us to think about all sorts of interesting sources of errors.

      Also, in regards to a single station, Kip is right, though he doesn’t state correctly WHY he is right. He’s correct because the accuracy of calibration cannot exceed the precision of the instrument being calibrated.

      Of course, often calibration is done with finer-grained measurements than station reporting, and we can also independently calibrate multiple stations and thus get a gaussian curve of calibration errors and then get more accuracy (approaching the accuracy of the calibration source), but that’s not discussed in the article either. Has anyone looked into calibration methods for thermometers and tide stations 100 years ago?

      Furthermore, instruments are re-calibrated at some interval (such as yearly), which introduces noise at the level of the precision of the instrument and at the frequency inverse to the calibration interval. Which WILL affect low frequency noises and potentially introduce aliasing, false positives on step change detection, etc. Changes in calibration methods will iinduce even lower frequency noise, which will show up as a trendline.

      Don’t get me started on how horrible it is to draw a trend line on time-series data. It violates Nyquist and should not be done.

      Peter

  74. The root cause of so much advocting to and fro is simply that averaging can improve accuracy AND it can’t. BUT it depends on the kind of error, the kind of data, and the context of use.

    In general use it is best to NOT use an average to attempt to remove error as it gives false accuracy if used wrongly and most people don’t know when to use it, and not. Thus the rules taught to me in chemistry class to carry error range forward unchanged.

    Yet IFF you have an extensive property (and temperature is NOT one) you can use an average of measurements OF THE SAME THING OR EVENT (and sequential temperature readings over time or from different places are not the same air mass) to remove RANDOM error (but not systemactic error).

    https://chiefio.wordpress.com/2011/07/01/intrinsic-extrinsic-intensive-extensive/

    So both sides are technically correct that sometimes averaging can increase accuracy, but also most times it can not.

    Averaging temperature data is fundamentally broken due to the three things already listed. Temperature is an intrinsic property. The measurements are of different air masses so not measuring the same air with 10 thermometers to remove the random errors between them. Then the third problem is that much of the error is systematic (humidity problem in electronic sensors of one type, change from one class of intrument (LIG) to another (fast response electronic), change from whitewash to latex paint, aging of latex paint, etc.) so not removed by averaging anyway.

    In short, under very limited circumstances and done with great care, some types of measurement can be improved by averaging. BUT, unfortunately, temperature data is not that type, the measurements are not of the same thing, and the dominant errors are systematic so not improved anyway.

    The use of averages to “improve” temperature accuracy is hopelessly broken and wrong, but getting warmers to see why is nearly impossible, in part due to the existance of examples where other data can be improved. (like gravity (one thing, an extrinsic property) measured many times to remove random errors between the measurements). That the specific does not generalize to temperatures at different times and places with systematic equipment error escapes them.

    • “Temperature is an intrinsic property.”
      “gravity (one thing, an extrinsic property) measured many times to remove random errors”
      Actually, gravity is intensive, or intrinsic. But the point of the distinction is that an intensive property, when integrated over space, becomes extensive. Density ρ is intensive, but when integrated over space becomes mass, extensive. The product ρg is intensive, but integrated becomes weight (force), extensive.

      The process of averaging temperature is integrating over a surface, so the integral is extensive (not quite, perhaps, because it isn’t over a volume, but if you took it as representing a surface layer it would be extensive).

      “The measurements are of different air masses “
      The whole point is to measure different air masses, to get an estimate of an extensive property by sampling. Think of trying to estimate a gold ore body. You drill for samples, and measure gm/ton, an intensive property. You get as many as you can, at different and known spatial locations. Then you integrate that over space to see how much gold you have. People invest in that. The more samples, the better coverage you have, but also the more the effect of incidental errors of each local collection of ore are reduced in the final figure.

      Systematic error is usually reduced by averaging. Only some have aged latex, so the effect on the average is reduced. But a bias remains and affects the average. People make a lot of effort to identify and remove that bias.

      • NS,

        The point of drilling as many cores as can be afforded reasonably is to be sure that the volume is not seriously undersampled. Assuming that the gold has a normal distribution (probably an invalid assumption) a single sample could represent any point on the probability distribution curve. With more samples, it is more likely that the samples will fall within the +/- 1 SD interval, and the average will give an accurate estimate of the mean. That speaks to the accuracy of the volume estimate, which is of concern to investors. In this analogy, precision is of lesser concern than accuracy.

        The gold analogy breaks down with respect to temperatures and sea levels because in the gold case the attempt is to estimate the total FIXED quantity of gold in the ore body. The point of contention in the climatology issues is whether the annual average value changes over time are real, or are an artifact of sampling error. Because the annual average changes are typically to the right of the decimal point, the precision becomes important. Before one can even apply the standard error of the mean (IF it IS valid!) there has to be agreement on what the standard deviation of the sampled population is. The approach of using monthly averages to calculate annual averages strongly filters out extreme values, which affect the standard deviation. I have made the argument that based on the known range of Earth land temperatures, the standard deviation of diurnal variations is highly likely to be several tens of degrees.

        Climatologists are worried about the weight of the fleas on a dog when they aren’t sure of the weight of the dog without the fleas.

      • NS,
        You said, “Systematic error is usually reduced by averaging.” I seriously question the validity of that claim. I can imagine situations where it might happen. However, one of the most serious examples of systematic bias in climatology is the orbital decay of satellites, resulting in the temperatures being recorded at increasingly earlier times, which were not random! Systematic error can be corrected if its presence is identified, and can be attributed to some measurable cause. The point of calling it “systematic” is that it is NOT random and generally not amenable to correction by averaging.

      • ” The point of calling it “systematic” is that it is NOT random and generally not amenable to correction by averaging.”
        The way averaging reduces error is by cancellation. The only way it can completely fail to do that is if there is no cancellation – ie all errors are the same, in the same direction. With a small number of sensors, as with satellite, that can happen. But systematic error due say to aged latex, some will have it and some not. If you average a set of readings where half are affected by aging, the average will reflect about half the effect.

      • Then you integrate that over space to see how much gold you have.

        Correct, but one would never plot a trendline on the gold sample data and use that analysis to make an investment on the next plot over, which is what the climate scientists are asking us to do…

      • NS,

        You said, ” People make a lot of effort to identify and remove that bias.” The classic study done by Anthony demonstrates that they either failed frequently in their effort, or didn’t make the effort as you claim.

        If temperature data that are systematically high, because of poor siting, are averaged with other data, the bias is diluted. [Does homogenization propagate this bias beyond a single site?] However, the other side of the coin is that the ‘good’ data are corrupted. Neither is the same as “cancelling,” as when the variations are random.

        In the situation of the aging latex paint, that means EVERY station is subject to a degradation which is ongoing and continuous, with it being worse for old stations. That is, there is an increasing bias or trend built into every site!

      • Stokes and others,

        I’d like to draw your attention to a quote from “An Introduction to Error Analysis,” (Taylor, 1982), p.95:

        “We saw in Section 4.4 that the standard deviation of the mean [aka standard error of the mean] {sigma sub bar x} approaches zero as the number of measurements N is increased. This result suggested that, if you have the patience to make an enormous number of measurements, then you can reduce the uncertainties indefinitely, without having to improve your equipment or technique. We can now see that this is not really so. Increasing N can reduce the RANDOM component, {delta k sub random} = {sigma sub bar k} indefinitely. But any given apparatus has SOME systematic uncertainty, which is NOT reduced as we increase N. It is clear from (4.25) that little is gained from further reduction of {delta k random} once {delta k random} is smaller than {delta k systematic.} In particular, the total {delta k} can never be made less than {delta k systematic}. This simply confirms what we already guessed, that in practice a large reduction of the uncertainty requires improvements in techniques or equipment in order to reduce both the random and the systematic errors in each single measurement.”

        Basically, without a careful definition of the measurand, and identification of the types and magnitude of all the uncertainties, and rigorous assessment of the calculated statistics and their uncertainties, one is not justified in categorically stating that the accuracy and precision of the mean annual global temperature or mean sea level is simply the (unstated) standard deviation divided by the square root of the number of measurements.

      • Clyde ==> “But any given apparatus has SOME systematic uncertainty, which is NOT reduced as we increase N. ”

        When measurements are intentionally given as ranges, that IS the systematic uncertainty, by definition. The system is to state the measurement as a range, within which the true value certainty resides — equally certain at ANY point inside the range.

      • Clyde,
        “I’d like to draw your attention to a quote “
        This seems to go on endlessly. That is the wrong problem!!! It is not the situation in any kind of climate/sea level context that has been raised. The quote concerns the problem of trying to improve the accuracy of a single measurement by repetition. The climate problem is the estimation of a population mean by averaging many single measurements of different things. OK, they may be all temperatures. And at one site, they may be measured o different days with the same instrument. But measuring today’s max, and then tomorrow’s, is not a repeated measure in the sense of Taylor. No-one expects those measures to get the same result.

        Taylor’s text is here. The text starts by specifying that repeated measurement of the same thing is his topic.

        The important difference to climate is that there is now not just one “systematic uncertainty”. There are thousands, and they themselves will be in different directions and will be much reduced in the final average. There may be a residual component common to all the samples. That is the bias.

        In global temperature, the main defence against bias is the formation of anomalies. That removes consistent bias. Then you only have to worry about circumstances in which the bias changes. That is what homogenisation is all about.

      • NS,

        I think it is you who does not understand. You said, “The quote concerns the problem of trying to improve the accuracy of a single measurement by repetition.” That is wrong. The quote is about estimating the value of a fixed parameter by taking multiple measurements. It boils down to “…the estimation of a population mean by averaging many single measurements…” The diameter of a ball bearing and a fictitious representative global temperature are analogous problems except that, in the case of a variable, the parameter supposedly being measured is changing and becomes part of the systematic component, increasing the inherent uncertainty. Also, the simple SODM is not appropriate for data sets, such as time-series, that are correlated.

        I think that you have lost sight of the fact that the point of contention (such as claimed by Mark S Johnson) is whether or not the SODM can be increased indefinitely in order to provide sufficient precision to say that the annual temperature difference between year 1 and year 2 is statistically significant.

        You claim, “There are thousands, and they themselves will be in different directions and will be much reduced in the final average.” That is an unproven assumption, and without quantitative proof.

        You further claim, “In global temperature, the main defence against bias is the formation of anomalies.” Anomalies correct for elevation and climate differences. However, without determining what all the systematic errors are, and quantifying them, you are on thin ice to claim that they will all cancel out. They could just as easily be additive. You simply don’t know, and are hoping that they cancel out.

        Thank you for the link. It looks like a different edition of Taylor than what I have in my library and I will compare the two.

  75. There’s a pernicious inability here to distinguish between the inherent measurement error that persists in individual data points and the strongly reduced effect manifest in temporal or aggregate averages.

    • To expand upon the above point in the relevant context of continuous-time geophysical signals, consider the measurement M(t) at any time t to consist of the true signal value s(t) plus the signal-independent measurement error or noise n(t). At no instant of time will the noise disappear from the measurements. According to the Parseval Theorem, the total variance of the measurements will always be the sum of the respective variances of signal and noise.

      But if we take the temporal average of = + , the contribution of the noise term will tend to zero for unbiased, zero-mean noise, thereby greatly improving the available statistical estimate of signal mean value . A similar reduction in variance takes place whenever the averaging is done over the aggregate of sampled time-series (station records) within a spatially homogeneous area.

      What seems to confuse signal analysis novices is the categorical difference between instantaneous measurements and statistical constructs such as the mean of measurements–which is always the result not of direct measurement but of statistical estimation.

    • Moderator:

      Something went totally awry in trying to post my additional comment ten minutes ago. A mathematical formula was strangely distorted in the preview window and all subsequent text was overstruck. Please post anyway and I’ll make necessary corrections subsequently.

      • The correct reading of the beginning of the second paragraph is:

        But if we take the temporal average of M(t) = s(t) + n(t), the contribution …

        All of the subsequent overstriking should be ignored. To clarify the second paragraph further, append the sentence:

        Such aggregate averaging of even a handful of station records invariably improves the estimate of the homogeneous anomalies of s(t) at all available times t.

  76. POSTSCRIPT:

    It is unfortunate that despite over 400 comments back and forth, with some engineers and scientists siding with my viewpoint and others siding with Nick Stokes and the StatsFolk, no resolution has been possible.

    This final exchange between myself and Nick Stokes, whom I greatly admire but do not agree with, illustrates the hard nut at the crux of the problem:

    • Kip Hansen
    Nick ==> Do you think there is anything you and I can agree on, on this very narrow specific point? If so, pass it by me.

    Nick Stokes
    Kip, I think no agreement is possible because you reject probability as a basis for quantifying uncertainty, and I insist there is nothing else. (see link for the long version of Nick’s reply.)

    Perhaps some other professional statisticians will weigh in with an alternate view to Nick’s from a statistical viewpoint.

    • Kip,

      I share your respect for Nick and I am impressed by his understanding of the probabilistic quantification of uncertainty (though I do not always share his use of terminology). In contrast, I am no statistician (as my previous fumbled posts ably demonstrate). Nevertheless, I do know that probability theory is most definitely not the only mathematical instrument available for the quantification of uncertainty. See, for example, Info-gap analysis and possibility theory. I could add Dempster-Shafer Theory (although this is strictly-speaking an extension of subjective probabilities) and fuzzy logic (although its advocates claim that it embraces probability theory). I would be interested to hear Nick’s views on this subject area. For example, the following paper is offered as an example of the approaches that are being developed in order to address uncertainties that are not amenable to probability theory:

      https://link.springer.com/article/10.1007/s11023-017-9428-3

      I don’t think the debate is settled by deciding between probabilistic and non-probabilistic approaches. Both are needed for a comprehensive treatment of uncertainty.

      • John ==> Thanks for the link — very interesting — it is fascinating to me that anyone would honestly think, in the real world, that the only way to think about, deal with, or quantify uncertainty is with probability theory. It shakes my faith in sanity and common sense.

      • John,
        “See, for example, Info-gap analysis and possibility theory”
        Info-gap analysis is more commonly called info-gap decision theory. It does not quantify uncertainty, but the costs (or benefits) of uncertainty. You can then rig up a what-if sequence to calculate some kind of worst case exposure, with no associated likelihood. Possibility theory is more like Kip’s interval notion, but with fractional values. Wiki describes it as an extension of fuzzy logic. But the important thing is that by itself, it is not useful. If all things are possible, we’ve learnt nothing. To make it useful, you need a second number, the necessity. A key quote from Wiki:
        “The intersection of the last two cases is {\displaystyle \operatorname {nec} (U)=0} \operatorname {nec}(U)=0 and {\displaystyle \operatorname {pos} (U)=1} \operatorname {pos}(U)=1 meaning that I believe nothing at all about {\displaystyle U} U. Because it allows for indeterminacy like this, possibility theory relates to the graduation of a many-valued logic, such as intuitionistic logic, rather than the classical two-valued logic.”
        We’re not in Kansas any more.

        But the main thing is, they aren’t quantifying ucertainty, but something else. If you want to press the relevance here, you would have to show a scientific problem involving averaging on which it gave sensible results.

      • Nick,

        Okay, point taken Nick. To be precise, I should have said that Info-gap Decision Theory is a technique that models uncertainty in a non-probabilistic manner in order to determine a robust strategy. My point still stands, however, that probability theory is not the only game in town and there are circumstances (many of which are highly relevant to climate change) when such non-probabilistic techniques are more applicable for calculating how to proceed under uncertainty. It is not always possible (or, at least, it may sometimes be inadvisable) to model uncertainty using probability theory.

        As far as possibility theory is concerned, I am not sure what point you are trying to make with your wiki quote. At the end of the day, possibility theory is a non-probabilistic technique. It does not employ a probability density function (pdf) but a so-called possibility density function (πdf). Probability plays no role in the way in which uncertainty is modelled. Pointing out possibility theory’s kinship with fuzzy logic is simply to compare it to another non-probabilistic technique. Also, I presume you had meant to say that, by itself, ‘possibility’ is useless – not ‘possibility theory’.

        I think your claim that possibility theory ‘does not quantify uncertainty but something else’ depends upon whether you see confidence as the key indicator of uncertainty. In possibility theory, confidence in the predicate A, for the proposition ‘x is A’, is given by the difference between the possibility of A and the possibility of the compliment of A. Given the relationship between possibility and necessity, this works out as:

        Confidence(A) = Possibility(A) + Necessity(A) – 1

        I’m happy to read that as a quantification of uncertainty.

        I wonder if your insistence that possibility theory would have to find application in ‘a scientific problem involving averaging on which it gave sensible results’ betrays a probabilistic bias in your definition of uncertainty. Nevertheless, I offer the following two links that I trust will satisfy your curiosity
        :
        https://link.springer.com/chapter/10.1007/3-540-34777-1_40?no-access=true

        http://home.iitk.ac.in/~partha/possibility

        The first links to a research paper that uses possibility theory to analyse uncertainties associated with parameter perturbation in climate modelling. The second link proposes applications within transport analysis.

        Best regards

      • John,
        Thanks for the links. They do seem to be trying to get to useful applications. I haven’t been able to get the full text of the first, but it seems to be using possibility language for a bayesian outcome. I’ll read the second more carefully, and keep trying to get the full text of the first.

        My Kansas comment referred to the suggestion that possibility only made sense in a multi-valued logic. That’s a big switch for ordinary thinking about uncertainty.

  77. Kip,

    You say that it shakes your faith in sanity and common sense but, to be fair, before probability theory came along, no-one was thinking methodically about uncertainty at all. Since then it has enjoyed enormous success, to the extent that many would have their faith in sanity and common sense shaken to hear you express your views :-)

    Nevertheless, despite probability theory’s success, practitioners and philosophers alike are still unsure just what probability is! Thankfully, a more mature view of uncertainty is fast emerging. It’s just a shame that the revolution hasn’t reached the IPCC yet. Here’s another link that I think you will love:

    https://link.springer.com/article/10.1007%2Fs10670-013-9518-4

    • John ==> There is nothing wrong with probability theory that applying it only where it is correctly applicable (even under its own rules) doesn’t solve….it is not a universal panacea.

      It is this: “and I insist there is nothing else. ” [but probability theory — to deal with original measurement uncertainty] that gives me the intellectual heebie-jeebies.

Comments are closed.