What do you mean by “mean”: an essay on black boxes, emulators, and uncertainty

Guest post by Richard Booth, Ph.D

References:

[1] https://wattsupwiththat.com/2019/09/07/propagation-of-error-and-the-reliability-of-global-air-temperature-projections-mark-ii/

[2] https://wattsupwiththat.com/2019/10/15/why-roy-spencers-criticism-is-wrong  

  1. Introduction

I suspect that we can all remember childish arguments of the form “person A: what do you mean by x, B: oh I really mean y, A: but what does y really mean, B: oh put in other words it means z, A: really (?), but in any case what do you even mean by mean?”  Then in adult discussion there can be so much interpretation of words and occasional, sometimes innocent, misdirection, that it is hard to draw a sound conclusion.  And where statistics are involved, it is not just “what do you mean by mean” (arithmetic?, geometric?, root mean square?) but “what do you mean by error”, “what do you mean by uncertainty” etc.etc.?

Starting in the late summer of 2019 there were several WUWT postings on the subject of Dr. Pat Frank’s paper [1], and they often seemed to get bogged down in these questions of meaning and understanding.  A good deal of progress was made, but some arguments were left unresolved, so in this essay I revisit some of the themes which emerged.  Here is a list of sections:

B. Black Box and Emulator Theory – 2.3 pages (A4)

C. Plausibility of New Parameters – 0.6 pages

D. Emulator Parameters – 1.5 pages

E. Error and Uncertainty – 3.2 pages

F. Uniform Uncertainty (compared to Trapezium Uncertainty) – 2.5 pages

G. Further Examples – 1.2 pages

H. The Implications for Pat Frank’s Paper – 0.6 pages

I. The Implications for GCMs – 0.5 pages

Some of those sections are quite long, but each has a summary at its end, to help readers who are short of time and/or do not wish to wade through a deal of mathematics.  The length is unfortunately necessary to develop interesting mathematics around emulators and errors and uncertainty, whilst including examples which may shed some light on the concepts.  There is enough complication in the theory that I cannot guarantee that there isn’t the odd mistake.  When referring to [1] or [2] including their comments sections, I shall refer to Dr. Frank and Dr. Roy Spencer by name, but put the various commenters under the label “Commenters”.

I am choosing this opportunity to “come out” from behind my blog name “See – Owe to Rich”.  I am Richard Booth, Ph.D., and author of “On the influence of solar cycle lengths and carbon dioxide on global temperatures”.  Published in 2018 by the Journal of Atmospheric and Solar-Terrestrial Physics (JASTP), it is a rare example of a peer-reviewed connection between solar variations and climate which is founded on solid statistics, and is available at https://doi.org/10.1016/j.jastp.2018.01.026 (paywalled)  or in publicly accessible pre-print form at https://github.com/rjbooth88/hello-climate/files/1835197/s-co2-paper-correct.docx . I retired in 2019 from the British Civil Service, and though I wasn’t working on climate science there, I decided in 2007 that as I had lukewarmer/sceptical views which were against the official government policy, alas sustained through several administrations, I should use the pseudonym on climate blogs whilst I was still in employment.

  • Black Box And Emulator Theory

Consider a general “black box”, which has been designed to estimate some quantity of interest in the past, and to predict its value in the future.  Consider also an “emulator”, which is an attempt to provide a simpler estimate of the past black box values and to predict the black box output into the future.  Last, but not least, consider reality, the actual value of the quantity of interest.

Each of these three entities,

  •  black box
  • emulator
  • reality

can be modelled as a time series with a statistical distribution.  They are all numerical quantities (possibly multivariate) with uncertainty surrounding them, and the only successful mathematics which has been devised for analysis of such is probability and statistics.  It may be objected that reality is not statistical, because it has a particular measured value.  But that is only true after the fact, or as they say in the trade, a posteriori.  Beforehand, a priori, reality is a statistical distribution of a random variable, whether the quantity be the landing face of the die I am about to throw or the global HadCRUT4 anomaly averaged across 2020.

It may also be objected that many black boxes, for example Global Circulation Models, are not statistical, because they follow a time evolution with deterministic physical equations.  Nevertheless, the evolution depends on the initial state, and because climate is famously “chaotic”, tiny perturbations to that state, lead to sizeable divergence later.  The chaotic system tends to revolve around a small number of attractors, and the breadth of orbits around each attractor can be studied by computer and matched to statistical distributions.

The most important parameters associated with a probability distribution of a continuous real variable are the mean (measure of location) and the standard deviation (measure of dispersion).  So across the 3 entities there are 6 important parameters; I shall use E[] to denote expectation or mean value, and Var[] to denote variance which is squared standard deviation.  What relationships between these 6 allow the defensible (one cannot assert “valid”) conclusion that the black box is “good”, or that the emulator is “good”? 

In general, since the purpose of an emulator is to emulate, it should do that with as high a fidelity as possible.  So for an emulator to be good, it should, like the Turing Test of whether a computer is a good emulator of a human, be able to display a similar spread/deviation/range of the black box as well as the mean/average component.  Ideally one would not be able to tell the output of one from that of the other.

To make things more concrete, I shall assume that the entities are each a uniform discrete time series, in other words a set of values evenly spaced across time with a given interval, such as a day, a month, or a year.  Let:

  X(t) be the random variable for reality at integer time t;

  M(t) be the random variable for the black box Model;

  W(t) be the random variable for some emulator (White box) of the black box

  Ri(t) be the random variable for some contributor to an entity, possibly an error term.

 Now choose a concrete time evolution of W(t) which does have some generality:

  • W(t) = (1-a)W(t-1) + R1(t) + R2(t) + R3(t) where 0 ≤ a ≤ 1

The reason for the 3 R terms will become apparent in a moment.  First note that the new value W(t) is partly dependent on the old one W(t-1) and partly on random Ri(t) terms.  If a=0 then there is no decay, and a putative flap of a butterfly’s wings contributing to W(t-1) carries on undiminished to perpetuity.  In Section C I describe how the decaying case a>0 is plausible.

R1(t) is to be the component which represents changes in major causal influences, such as the sun and carbon dioxide.  R2(t) is to be a component which represents a strong contribution with observably high variance, for example the Longwave Cloud Forcing (LCF).  Some emulators might ignore this, but it could have a serious impact on how accurately the emulator follows the black box.  R3(t) is a putative component which is negatively correlated with R2(t) with coefficient -r, with the potential (dependent on exact parameters) to mitigate the high variance of R2(t).  We shall call R3(t) the “mystery component”, and its inclusion is justified in Section C.

Equation (1) can be “solved”, i.e. the recursion removed, but first we need to specify time limits.  We assume that the black box was run and calibrated against data from time 0 to the present time P, and then we are interested in future times P+1, P+2,… up to F. The solution to Equation (1) is

  • W(t) = ∑i=0t (1-a)i(R1(t-i) + R2(t-i) + R3(t-i)) + (1-a)t W(0)

The expectation of W(t) depends on the expectations of each Rk(t), and to make further analytical progress we need to make assumptions about these.  Specifically, assume that

  • E[R1(t)] = bt+c, E[R2(t) = d], E[R3(t)] = 0

Then a modicum of algebra derives

  • E[W(t)] = b(at + a-1 + (1-a)t+1)/a2 + (c+d)(1 – (1-a)t)/a + (1-a)t W(0)

In the limit as a tends to 0, we get the special case

  • E[W(t)] = bt(t+1)/2 + (c+d)t + W(0)

Next we consider variance, with the following assumptions:

  • Var[Rk(t)] = sk2, Cov[R2(t),R3(t)] = -r s2 s3, all other covariances, within or across time, are 0, so
  • Var[W(t)] = (s12+s22+s32-2r s2 s3)(1 – (1-a)2t)/(2a-a2)

and as a tends to zero the last two parentheses tend to t (implying variance increases linearly with t).

Summary of section B:

  • A good emulator can mimic the output of the black box.
  • A fairly general iterative emulator model (1) is presented.
  • Formulae are given for expectation and variance of the emulator as a function of time t and various parameters.
  • The 2 extra parameters, a, and R3(t), over and above those of Pat Frank’s emulator, can make a huge difference to the evolution.
  • The “magic” component R3(t) with anti-correlation -r to R2(t) can greatly reduce model error variance whilst retaining linear growth in the absence of decay.
  • Any decay rate a>0 completely changes the propagation of error variance from linear growth to convergence to a finite limit.
  • Plausibility Of New Parameters

The decaying case a>0 may at first sight seem implausible.  But here is a way it could arise.  Postulate a model with 3 main variables, M(t) the temperature, F(t) the forcing, and H(t) the heat content of land and oceans.  Let

  M(t) = b + cF(t) + dH(t-1)

(Now by the Stefan-Boltzmann equation M should be related to F1/4 , but locally it can be linearized by a binomial expansion.)  The theory here is that temperature is fed both by instantaneous radiative forcing F(t) and by previously stored heat H(t-1).  (After all, climate scientists are currently worrying about how much heat is going into the oceans.)  Next, the heat changes by an amount dependent on the change in temperature:

  H(t-1) = H(t-2) + e(M(t-1)-M(t-2)) = H(0) + e(M(t-1)-M(0))

Combining these two equations we get

  M(t) = b + cF(t) + d(H(0) + e(M(t-1)-M(0)) = f + cF(t) + (1-a)M(t-1)

where a = 1-de, f = b+dH(0)-deM(0).  This now has the same form as Equation (1); there may be some quibbles about it, but it shows a proof of concept of heat buffering leading to a decay parameter.

For the anti-correlated R3(t), consider reference [2]. Roy Spencer, who has serious scientific credentials, had written “CMIP5 models do NOT have significant global energy imbalances causing spurious temperature trends because any model systematic biases in (say) clouds are cancelled out by other model biases”.  This means that in order to maintain approximate Top Of Atmosphere (TOA) radiative balance, some approximate cancellation is forced, which is equivalent to there being an R3(t) with high anti-correlation to R2(t).  The scientific implications of this are discussed further in Section I.

Summary of Section C:

  • A decay parameter is justified by a heat reservoir.
  • Anti-correlation is justified by GCMs’ deliberate balancing of TOA radiation.
  • Emulator Parameters

Dr. Pat Frank’s emulator falls within the general model above.  The constants from his paper, 33K, 0.42, 33.3 Wm-2, and +/-4 Wm-2, the latter being from errors in LCF, combine to give 33*0.42/33.3 = 0.416 and 0.416*4 = 1.664 used here. So we can choose a = 0, b = 0, c+d = 0.416 F(t) where F(t) is the new GHG forcing (Wm-2) in period t, s1=0, s2=1.664, s3=0, and then derive

  • W(t) = (c+d)t + W(0) +/- sqrt(t) s2

(I defer discussion of the meaning of the +/- sqrt(t) s2, be it uncertainty or error or something else, to Section D.  Note that F(t) has to be constant to directly use the theory here.)

But by using more general parameters it is possible to get a smaller value of the +/- term.  There are two main ways to do this – by covariance or by decay, each separately justified in Section C.

In the covariance case, choose s3 = s2 and r = 0.95 (say).  Then in this high anti-correlation case, still with a = 0, Equation (7) gives

  • Var[W(t)] = 0.1s22t  (instead of s22t)

In the case of decay but no anti-correlation, a > 0 and s3 = 0 (so R3(t) = 0 with probability 1).  Now, as t gets large, we have

  • Var[W(t)] = (s12+s22)/(2a-a2)

so the variance does not increase without limit as in the a =0 case.  But with a > 0, the mean also changes, and for large t Equation (4) implies it is

  • E[W(t)] ~ bt/a + (b+c+d-b/a)/a

Now if we choose b = a(c+d) then that becomes (c+d)(t+1), which is fairly indistinguishable from the (c+d)t in Equation (8) derived from a=0, so we have derived a similar expectation but a smaller variance in Equation (10).

To streamline the notation, now let the parameters a, b, c, d, r be placed in a vector u, and let

  • E[W(t)] = mw(t;u),  Var[W(t)] = sw2(t;u)

(I am using a subscript ‘w’ for statistics relating to W(t), and ‘m’ for those relating to M(t).)  With 4 parameters (a, b, c+d, r) to set here, how should we choose the “best”?  Well, comparisons of W(t) with M(t) and X(t) can be made, the latter just in the calibration period t = 1 to t = P.  The nature of comparisons depends on whether or not just one, or many, observations of the series M(t) are available.

Case 1: Many series

With a deterministic black box, many observed series can be created if small perturbations are made to initial conditions and if the evolution of the black box output is mathematically chaotic.  In this case, a mean mm(t) and a standard deviation sm(t) can be derived from the many series.  Then curve fitting can be applied to mw(t;u) – mm(t) and sw(t;u) – sm(t) by varying u.  Something like Akaike’s Information Criterion (AIC) might be used for comparing competing models.  But in any case it should be easy to notice whether sm(t) grows like sqrt(t), as in the a=0 case, or tends to a limit, as in the a>0 case.

Case 2: One series

If chaotic evolution is not sufficient to randomize the black box, or if the black box owner cannot be persuaded to generate multiple series, there may be only one observed series m(t) of the random variable M(t).  In this case Var[M(t)] cannot be estimated unless some functional form, such as g+ht, is assumed for mm(t), when (m(t)-g-ht)2 becomes a single observation estimate of Var[M(t)] for each t, allowing an assumed constant variance to be estimated.  So some progress in fitting W(t;u) to m(t) may still be possible in this case.

Pat Frank’s paper effectively uses a particular W(t;u) (see Equation (8) above) which has fitted mw(t;u) to mm(t), but ignores the variance comparison.  That is, s2 in (8) was chosen from an error term from LCF without regard to the actual variance of the black box output M(t).

Summary of section D:

  • Pat Frank’s emulator model is a special case of the models presented in Section B, where error variance is given by Equation (7).
  • More general parameters can lead to lower propagation of error variance over time (or indeed, higher).
  • Fitting emulator mean to black box mean does not discriminate between emulators with differing error variances.
  • Comparison of emulator to randomized black box runs can achieve this discrimination.
  • Error and uncertainty

In the sections above I have made scant reference to “uncertainty”, and a lot to probability theory and error distributions.  Some previous Commenters repeated the mantra “error is not uncertainty”, and this section addresses that question.  Pat Frank and others referred to the following “bible” for measurement uncertainty

https://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf ; that document is replete with references to probability theory.  It defines measurement uncertainty as a parameter which is associated with the result of a measurement and that characterizes the dispersion of the values that could reasonably be attributed to the measurand.  It acknowledges that the dispersion might be described in different ways, but gives standard deviations and confidence intervals as principal examples.  The document also says that that definition is not inconsistent with two other definitions of uncertainty, which include the difference between the measurement and the true value. 

Here I explain why they might be thought consistent, using my notation above.  Let M be the measurement, and X again be the true value to infinite precision (OK, perhaps only to within Heisenberg quantum uncertainty.)  Then the JCGM’s main definition is a parameter associated with the statistical distribution of M alone, generally called “precision”, whereas the other two definitions are respectively a function of M-X and a very high confidence interval for X.  Both of those include X, and are predicated on what is known as the “accuracy” of the measurement of M.  (The JCGM says this is unknowable, but does not consider the possibility of a different and highly accurate measurement of X.)  Now, M-X is just a shift of M by a constant, so the dispersion of M around its mean is the same as the dispersion of M-X around its mean.  So provided that uncertainty describes dispersion (most simply measured by variance) and not location, they are indeed the same.  And importantly, the statistical theory for compounding variance is the same in each case.

Where does this leave us with respect to error versus uncertainty?  Assuming that X is a single fixed value, then prior to measurement, M-X is a random variable representing the error, with some probability distribution having mean mm-X and standard deviation sm.  b = mm-X is known as the bias of the measurement, and +/-sm is described by the JCGM 2.3.1 as the “standard” uncertainty parameter.  So standard uncertainty is just the s.d. of error, and more general uncertainty is a more general description of the error distribution relative to its mean.

There are two ways of finding out about sm: by statistical analysis of multiple measurements (if possible) or by appealing to an oracle, such as the manufacturer of the measurement device, who might supply information over and beyond the standard deviation.  In both cases the output resolution of the device may have some bearing on the matter. 

However, low uncertainty is not of much use if the bias is large.  The real error statistic of interest is E[(M-X)2] = E[((M-mm)+(mm-X))2] = Var[M] + b2, covering both a precision component and an accuracy component.

Sometimes the uncertainty/error in a measurement is not of great consequence per se, but feeds into a parameter of a mathematical model and thence into the output of that model.  This is the case with LCF feeding into radiative forcings in GCMs and then into temperature, and likewise with Pat Frank’s emulator of them.  But the theory of converting variances and covariances of input parameter errors into output error via differentiation is well established, and is given in Equation (13) of the JCGM.

To illuminate the above, we now turn to some examples, principally provided by Pat Frank and Commenters.

Example 1: The 1-foot end-to-end ruler

In this example we are given a 1-foot ruler with no gaps at the ends and no markings, and the manufacturer assures us that the true length is 12”+/-e”; originally e = 1 was chosen, but as that seems ridiculously large I shall choose e = 0.1 here.  So the end-to-end length of the ruler is in error by up to 0.1” either way, and furthermore the manufacturer assures us that any error in that interval is equally likely. I shall repeat a notation I introduced in an earlier blog comment, which is to write 12+/-_0.1 for this case, where the _ denotes a uniform probability distribution, instead of a single standard deviation for +/-.  (The standard deviation for a random variable uniform in [-a,a] is a/sqrt(3) = 0.577a, so b +/-_ a and b +/- 0.577a are loosely equivalent, except that the implicit distributions are different.  This is covered in the JCGM, where “rectangular” is used in place of “uniform”.)

Now, I want to build a model train table 10 feet long, to as high an accuracy as my budget and skill allow.  If I have only 1 ruler, it is hard to see how I can do better than get a table which is 120+/-_1.0”.  But if I buy 10 rulers (9 rulers and 1 ruler to rule them all would be apt if one of them was assured of accuracy to within a thousandth of an inch!), and I am assured by the manufacturer that they were independently machined, then by the rule of addition of independent variances, the uncertainty in the sum of the lengths is sqrt(10) times the uncertainty of each.

So using all 10 rulers placed end to end, the expected length is 120” and the standard deviation (uncertainty) gets multiplied by sqrt(10) instead of 10 for the single ruler case, an improvement by a factor of 3.16.  The value for the s.d. is 0.0577 sqrt(10) = 0.183”.   

To get the exact uncertainty distribution we would have to do what is called convolving of distributions to find the distribution of the sum_1^10 (X_i-12).  It is not a uniform distribution, but looks a little like a normal distribution under the Central Limit Theorem. Its “support” is not of course infinite, but is the interval (-1”,+1”), but it does tail off smoothly at the edges.  (In fact, recursion shows that the probability of it being less than (-1+x), for 0<x<0.2, is (5x)10/10!   That ! is a factorial, and with -1+x = -0.8 it gives the small probability of 2.76e-7, a tiny chance of it being in the extreme 1/5 of the interval.)

Now that seemed like a sensible use of the 10 rulers, but oddly enough it isn’t the best use.  Instead, sort them by length, and use the shortest and longest 5 times over.  We could do this even if we bought n rulers, not equal to 10.  We know by symmetry that the shortest plus longest has a mean error of 0, but calculating the variance is more tricky.

The error of the ith shortest ruler, plus 0.1, times 5, say Yi, has a Beta distribution (range from 0 to 1) with parameters (i, 101-i).  The variance of Yi is i(n+1-i)/((n+1)2(n+2)), which can be found at https://en.wikipedia.org/wiki/Beta_distribution .  Now

  Var[Y1 + Yn] = 2(Var[Y1]+Cov[Y1,Y100 ]) by symmetry.

Unfortunately that Wikipedia page does not give that covariance, but I have derived this to be

  • Cov[Yi,Yj] = i(n+1-j) / [(n+2)(n+1)2] if i <= j, so
  • Var[Y1 + Yn] = 2(n+1) / [(n+2)(n+1)2] = 2 / [(n+2)(n+1)]

Using the two rulers 5 times multiplies the variance by 25, but removing the scaling of 5 in going from ruler to Yi cancels this.  So (14) is also the variance of the error of our final measurement.

Now take n = 10 and we get uncertainty = square root of variance = sqrt(2/132) = 0.123”, which is less than the 0.183” from using all 10 rulers.  But if we were lavish and bought 100 rulers, it would come down to sqrt(2/10302) = 0.014”.

Having discovered this trick, it would be tempting to extend it and use (Y1 + Y2 + Yn-1 + Yn)/2.  But this doesn’t help, as the variance for that is (5n+1)/[2(n+2)(n+1)2], which is bigger than (14). 

I confess it surprised me that it is better to use the extremal rulers rather than the mean of them all. But I tested the mathematics both by Monte Carlo and by calculating the variance of the sum of n sorted rulers via (13) with the sum of n unsorted rulers, and for n=10 they agreed exactly.  I think the effectiveness of the method is because the variance of the extremal rulers is small because those lengths bump up against the hard limit from the uniform distribution.

That inference is confirmed by Monte Carlo experiments with, in addition to the uniform, a triangular and a normal distribution for Yi, still wanting a total length of 10 rulers, but having acquired n=100 of them.  The triangular has the same range as the uniform, and half the variance, and the normal has the same variance as the uniform, implying that the endpoints of the uniform represent +/-sqrt(3) standard deviations for the normal, covering 92% of its distribution.

In the following table 3 subsets of the 100 are considered, pared down from a dozen or so experiments.  Each subset is optimal, within the experiments tried, for one or more distribution (starred).  A subset a,b,c,… means that the a-th shortest and longest rulers are used, and the b-th shortest and longest etc. The fraction following the distribution is the variance of a single sample.  The decimal values are variances of the total lengths of the selected rulers then scaled up to 10 rulers.

               v  a  r  i  a  n  c  e  s      

dist\subset    1         1,12,23,34,45 1,34

U(0,1)    1/12 0.00479*  0.0689        0.0449

N(0,1/12) 1/12 0.781     0.1028*       0.2384

T(0,1)    1/24 0.0531    0.0353        0.0328*

We see that by far the smallest variance, 0.00479, occurs if we are guaranteed a uniform distribution, by using a single extreme pair, but that strategy isn’t optimal for the other 2 distributions.  5 well-spaced pairs are best for the normal, and quite good for the triangular, though the latter is slightly better with 2 well-spaced pairs.

Unless the manufacturer can guarantee the shape of the error distribution, assumption that it is uniform would be quite dangerous in terms of choosing a strategy for the use of the available rulers. 

Summary of Section E:

  • Uncertainty should properly be thought of as the dispersion of a distribution of random variables, possibly “hidden”, representing errors, even though that distribution might not be fully specified.
  • In the absence of clarification, a +/-u uncertainty value should be taken as one standard deviation of the error distribution.
  • The assumption, probably through ignorance, that +/-u represents a sharply bounded uniform (or “rectangular”) distribution, allows clever tricks to be played on sorted samples yielding implausibly small variances/uncertainties.
  • The very nature of errors being compounded from multiple sources supports the idea that a normal error distribution is a good approximation.
  • Uniform Uncertainty (compared to Trapezium Uncertainty)

As an interlude between examples, in this section we study further implications of a uniform uncertainty interval, most especially for a digital device.  By suitable scaling we can assume that the possible outputs are a complete range of integers, e.g. 0 to 1000.  We use Bayesian statistics to describe the problem.

Let X be a random variable for the true infinitely precise value which we attempt to measure.

Let x be the value of X actually occurring at some particular time.

Let M be our measurement, a random variable but including the possibility of zero variance.  Note that M is an integer.

Let D be the error, = M – X.

Let f(x) be a chosen (Bayesian) prior probability density function (p.d.f.) for X, P[X’=’x].

Let g(y;x) be a probability function (p.f.) for M over a range of integer y values, dependent on x, written g(y;x) = P[M=y | X’=’x]  (the PRECISION distribution).

Let c be a “constant” of proportionality, determined in each separate case by making relevant probabilities add up to 1.  Then after measurement M, the posterior probability for X taking the value z is, by Bayes’ Theorem,

  • P[X’=’x | M=y]  =  P[M=y | X’=’x] P[X’=’x] / c = g(y;x) f(x) / c

Usually we will take f(x) = P[X ‘=’ x] to be an “uninformative” prior, i.e. uniform over a large range bound to contain x, so it has essentially no influence.  In this case,

  • P[X’=’x | M=y] = g(y;x)/c where c = int g(y;x)dx (the UNCERTAINTY distribution).

Then P[D=z | M=y] = P[X=M-z | M=y] = g(y;y-z)/c.  Now assume that g() is translation invariant, so g(y;y-z) = g(0;-z) =: c h(z) defines function h(), and int h(z)dz = 1.  Then

  • P[D=z | M=y] = h(z), independent of y (ERROR DISTRIBUTION = shifted u.d.).

In addition to this distribution of error given observation, we may also be interested in the distribution of error given the true (albeit unknown) value.  (It took me a long time to work out how to evaluate this.)

Let A be the event {D = z}, B be {M = y}, C be {X = x}.  These events have a causal linkage, which is that they can simultaneously occur if and only if z = y-x.  And when that equation holds, so z can be replaced by y-x, then given that one of the events holds, either both or none of the other two occur, and therefore they have equal probability.  It follows that:

  P[A|C] = P[B|C] = P[C|B]P[B]/P[C] = P[A|B]P[B]/P[C]

  • P[D = z = y-x | X = x] = P[D = y-x | M = y] P[M = y]/P[X = x]

Of the 3 terms on the RHS, the first is h(y-x) from Equation (17), the third is f(x) from Equation (15), and the second is a new prior.  This prior must be closely related to f(), which we took to be uninformative, because M is an integer value near to X.  The upshot is that under these assumptions the LHS is proportional to h(y-x), so

  • P[D = y-x | X = x] = h(y-x)/∑i h(i-x)

Let x’ be the nearest integer to x, and a = x’-x, lying in the interval [-1/2,1/2).  Then y-x = y+a-x’ = a+k where k is an integer.  Then the mean m and variance s2 of D given X=x are:

  • m = ∑k (a+k)h(a+k) / ∑k h(a+k); s2 = ∑k (a+k-m)2h(a+k) / ∑k h(a+k)

A case of obvious interest would be an uncertainty interval which was +/-e uniform. That would correspond to h(z) = 1/(2e) for b-e < z < b+e and 0 elsewhere, where b is the bias of the error. We now evaluate the statistics for the case b = 0 and e ≤ 1.  The symmetry in a means that we need only consider a > 0.  –e < a+k < e implies that -e-a < k < e-a.  e < ½ implies there is an a slightly bigger than e such that no integer k is in the interval, which is impossible,so e is at least ½.  Since h(z) is constant over its range, in (20) cancellation allows us to replace h(a+k) with 1.

  • If a < 1-e then only k=0 is possible, and m = a, s2 = 0.
  • If a > 1-e then k=-1 and k=0 are both possible, and m = a – ½ , s2 = ¼ .

When s2 is averaged over all a we get 2(e-1/2)(1/4) = (2e-1)/4.

It is not plausible for e to be ½, for then s2 would be 0 whatever the fractional part a of x was. Since s2 is the variance of M-X given X=x, that implies that M is completely determined by X.  That might sound reasonable, but in this example it means that as X changes from 314.499999 to 314.500000, M absolutely has to flip from 314 to 315, and that implies that the device, despite giving output resolution to an integer, actually has infinite precision, and is therefore not a real device.

For e > ½, s2 is zero for a in an interval of width 2-2e, and non-zero in two intervals of total width 2e-1.  In these intervals for a (translating to x), it is non-deterministic as to whether the output M is 314, say, or 315.

In Equations (21) and (22) there is a disconcerting discontinuity in the expected error from 1-e at a = (1-e)- to (1/2-e) at a = (1-e)+.  This arises from the cliff edge in the uniform h(z).  More sophisticated functions h(z) do not exhibit this feature, such as a normal distribution, a triangle distribution, or a trapezium distribution such as:

  (23)  h(z) =

{ 2(z+3/4) for -3/4<z<-1/4

{ 1 for -1/4<z<1/4

{ 2(3/4-z) for 1/4<z<3/4

For this example we find

  (24)  if 0<a<1/4, m = a and s2 = 0,

           if 1/4<a<3/4, m = 1/2-a, s2 = 4(a-1/4)(3/4-a) <= 1/4

Note that the discontinuity previously noted does not occur here, as m is a continuous function of a even at a=1/4. The averaged s2 is 1/12, less than the 1/8 from the U[-3/4,3/4] distribution. 

All the above is for a device with a digital output, presumed to change slowly enough to be read reliably by a human.  In the case of an analogue device, like a mercury thermometer, then a human’s reading of the device provides an added error/uncertainty.  The human’s reading error is almost certainly not uniform (we can be more confident when the reading is close to a mark than when it is not), and in any case the sum of instrument and human error is almost certainly not uniform.

Summary of section F:

  • The PRECISION distribution, of an output given the true state, induces an ERROR distribution given some assumptions on translation invariance and flat priors.
  • The range of supported values of the error distribution must exceed the output resolution width, since otherwise infinite precision is implied.
  • Even when that criterion is satisfied, the assumption of a uniform ERROR distribution leads to a discontinuity in mean error as a function of the true value.
  • A corollary is that if your car reports ambient temperature to the nearest half degree, then sometimes, even in steady conditions, its error will exceed half a degree.
  • Further examples

Example 2: the marked 1-foot ruler

In this variant, the rulers have markings and an indeterminate length at each end.  Now multiple rulers cannot usefully be laid end to end, and the human eye must be used to judge and mark the 12” positions.  This adds human error/uncertainty to the measurement process, which varies from human to human, and from day to day.  The question of how hard a human should try in order to avoid adding significant uncertainty is considered in the next example.

Example 3: Pat Frank’s Thermometer

Pat Frank introduced the interesting example of a classical liquid-in-glass (LiG) thermometer whose resolution is (+/-)0.25K.  He claimed that everything inside that half-degree interval was a uniform blur, but went on to explain that the uncertainty was due to at least 4 things, namely the thermometer capillary is not of uniform width, the inner surface of the glass is not perfectly smooth and uniform, the liquid inside is not of constant purity, the entire thermometer body is not at constant temperature.  He did not include the fact that during calibration human error in reading the instrument may have been introduced.  So the summation of 5 or more errors implies (except in mathematically “pathological” cases) that the sum is not uniformly distributed.  In fact a normal distribution, perhaps truncated if huge errors with infinitesimal probability are unpalatable, makes much more sense.

The interesting question arises as to what the (hypothetical) manufacturers meant when they said the resolution was +/-0.25K.  Did they actually mean a 1-sigma, or perhaps a 2-sigma, interval?  For deciding how to read, record, and use the data from the instrument, that information is rather vital.

Pat went on to say that a temperature reading taken from that thermometer and written as, e.g., 25.1 C, is meaningless past the decimal point. (He didn’t say, but presumably would consider 25.5 C to be meaningful, given the half-degree uncertainty interval.)  But this isn’t true; assuming that someone cares about the accuracy of the reading, it doesn’t help to compound instrumental error with deliberate human reading error.  Suppose that the error variance of the instrument actually corresponds to 2-sigma, as the manufacturer wanted to give a reasonably firm bound, then the variance is ((1/2)(1/4))2 = 1/64.  If t2 is the error variance of the observer, then the final variance is 1/64+t2

The observer should not aim for a ridiculously low t, even if achievable, and perhaps a high t is not so bad if the observations are not that important.  But beware: observations can increase in importance beyond the expectations of the observer.  For example we value temperature observations from 1870 because they tell us about the idyllic pre-industrial, pre-climate change, world!  In the present example, I would recommend trying for t2 = 1/100, or as near as can be achieved within reason.  Note that if the observer can manage to read uniformly within +/-0.1 C, then that means t2 =  1/300.   But if instead she reads to within +/-0.25, t2 = 1/48 and the overall variance is multiplied by (1+64/48) = 7/3 ~ 1.52, which is a significant impairment of accuracy precision. 

The moral is that it is vital to know what uncertainty variance the manufacturer really believes to be the case, that guidelines for observers should then be appropriately framed, and that sloppiness has consequences.

Summary of Section G:

  • Again, real life examples suggest the compounding of errors, leading to approximately normal distributions.
  • Given a reference uncertainty value from an analogue device, if the observer has the skill and time and inclination then she can reduce overall uncertainty by reading to a greater precision than the reference value.
  • The implications for Pat Frank’s paper

The implication of Section B is that a good emulator can be run with pseudorandom numbers and give output which is similar to that of the black box.  The implication of Section D is that uncertainty analysis is really error analysis and good headway can be made by postulating the existence of hidden random variables through which statistics can be derived.  The implication of Section C is that many emulators of GCM outputs are possible, and just because a particular one seems to fit mean values quite well does not mean that the nature of its error propagation is correct.  The only way to arbitrate between emulators would be to carry out Monte Carlo experiments with the black boxes and the emulators.  This might be expensive, but assuming that emulators have any value at all, it would increase this value.

Frank’s emulator does visibly give a decent fit to the annual means of its target, but that isn’t sufficient evidence to assert that it is a good emulator.  Frank’s paper claims that GCM projections to 2100 have an uncertainty of +/- at least 15K.  Because, via Section D, uncertainty really means a measure of dispersion, this means that Equation (1) with the equivalent of Frank’s parameters, using many examples of 80-year runs, would show an envelope where a good proportion would reach +15K or more, and a good proportion would reach -15K or less, and a good proportion would not reach those bounds.  This is just the nature of random walks with square root of time evolution. 

But the GCM outputs represented by CMIP5 do not show this behaviour, even though, climate being chaotic, different initial conditions should lead to such variety.  Therefore Frank’s emulator is not objectively a good one.  And the reason is that, as mentioned in Section C, the GCMs have corrective mechanisms to cancel out TOA imbalances except for, presumably, those induced by the rather small increase of greenhouse gases from one iteration to the next.

However the real value in Frank’s paper is first the attention drawn to the relatively large annual errors in the radiation budget arising from long wave cloud forcing, and second the revelation through comments on it that GCMs have ways of systematically squashing these errors.

Summary of Section H:

  • Frank’s emulator is not good in regard to matching GCM output error distributions.
  • Frank’s paper has valuable data on LCF errors.
  • Thereby it has forced “GCM auto-correction” out of the woodwork.
  1. The implications for GCMs

The “systematic squashing” of the +/-4 W/m^2 annual error in LCF inside the GCMs is an issue of which I for one was unaware before Pat Frank’s paper. 

The implication of comments by Roy Spencer is that there really is something like a “magic” component R3(t) anti-correlated with R2(t), though the effect would be similar if it was anti-correlated with R2(t-1) instead, which might be plausible with a new time step doing some automatic correction of overshooting or undershooting on the old time step.  GCM experts would be able to confirm or deny that possibility.

In addition, there is the question of a decay rate a, so that only a proportion (1-a) of previous forcing carries into the next time step, as justified by the heat reservoir concept in Section C.  After all, GCMs presumably do try to model the transport of heat in ocean currents, with concomitant heat storage.

It is very disturbing that GCMs have to resort to error correction techniques to achieve approximate TOA balance.  The two advantages of doing so are that they are better able to model past temperatures, and that they do a good job in constraining the uncertainty of their output to the year 2100.  But the huge disadvantage is that it looks like a charlatan’s trick; where is the vaunted skill of these GCMs, compared with anyone picking their favourite number for climate sensitivity and drawing straight lines against log(CO2)?  In theory, an advantage of GCMs might be an ability to explain regional differences in warming.  But I have not seen any strong claims that that is so, with the current state of the science.

Summary of Section I:

  • Auto-correction of TOA radiative balance helps to keep GCMs within reasonable bounds.
  • Details of how this is done would be of great interest; the practice seems dubious at best because it highlights shortcomings in GCMs’ modelling of physical reality.
0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

184 Comments
Inline Feedbacks
View all comments
Beta Blocker
February 7, 2020 6:48 am

Given the many assumptions made in the GCMs about physical processes and their many interactions with each other, does there come a point where developing truly useful mathematical definitions for terms such as ‘range of error’, ‘uncertainty’, ‘error propagation’ becomes impossible, for all practical purposes?

Reply to  Beta Blocker
February 7, 2020 10:35 am

I think that physics depends on mathematics and we have to do the best we can with the maths available. Its utility, especially in regard to “practical purposes”, can always be debated.

RJB

JRF in Pensacola
February 7, 2020 6:56 am

Dr. Booth:
You’ve provided much to consider here. Some points I don’t understand:

“So standard uncertainty is just the s.d. of error, and more general uncertainty is a more general description of the error distribution relative to its mean.” What is the difference between “standard” and “general” uncertainty and is “standard” uncertainty so easily defined by the standard deviation of the error?

“In the absence of clarification, a +/-u uncertainty value should be taken as one standard deviation of the error distribution.” Why should we assume that “a +/-u uncertainty value should be taken as one standard deviation of the error distribution”?

I’m not challenging either of these, just looking for help understanding both.

Carlo, Monte
Reply to  JRF in Pensacola
February 7, 2020 8:16 am

He is confusing standard uncertainty (u) and expanded uncertainty (+/-U), as they are defined in the GUM. (u) is standard deviation, while (U) is (u) multiplied by a coverage factor.

JRF in Pensacola
Reply to  Carlo, Monte
February 7, 2020 9:25 am

Thanks for your expanded reply that follows. I responded more there.

Reply to  JRF in Pensacola
February 7, 2020 10:46 am

JRF: I am just interpreting the JCGM at that point, and “standard uncertainty” is a term they use for a +/-1 s.d. interval for the error. I use “general uncertainty” to include a better description of the error distribution.

If someone says “the uncertainty is +/-3 widgets”, then unless they are more specific the most reasonable assumption is that they are using standard uncertainty, which is a +/-1 sigma (s.d.) bound. Does that help?

RJB

JRF in Pensacola
Reply to  See - owe to Rich
February 7, 2020 12:31 pm

Thank you. Helpful but see my further reply to CM below about uncertainty and “ignorance”.

Rick C PE
Reply to  See - owe to Rich
February 7, 2020 3:29 pm

This is why the GUM specifies that measurement uncertainty statements include either not just the +/- U but also the standard uncertainty and the coverage factor (K). The following is an example provided by NIST.

ms = (100.02147 ± 0.00070) g, where the number following the symbol ± is the numerical value of an expanded uncertainty U=k * uc with U determined from a combined standard uncertainty (i.e., estimated standard deviation ) uc= 0.35 mg and a coverage factor k=2. Since it can be assumed that the possible estimated values of the standard are approximately normally distributed with approximate standard deviation uc, the unknown value of the standard is believed to lie in the interval defined by U with a level of confidence of approximately 95%.

Carlo, Monte
Reply to  Rick C PE
February 8, 2020 11:01 am

Which goes to the subtitle of the document: “Guide to the expression of uncertainty in measurement”: it is a standard way of expressing uncertainty.

Kevin kilty
February 7, 2020 7:11 am

This is a lot to digest in any small amount of time, but isn’t one of the most important points, or even the most important point, summarized right here?

The real error statistic of interest is E[(M-X)2] = E[((M-m_m)+(m_m-X))^2] = Var[M] + b^2, covering both a precision component and an accuracy component.

For climate models we do not have a credible estimate of the “uncertainty” of M-X, first, because important drivers of climate are involved and no matter how stable the governing differential equations, these will not damp away (they may also be misrepresented with a set of differential equations missing terms or with erroneous values of coefficients), and second, we don’t have a handle on “b”. I would hate to have our economic future decided by “b”.

February 7, 2020 7:48 am

Alice sparring with the caterpillar.
Words meant whatever he wanted.

February 7, 2020 7:59 am

If the GCMs do have a negatively correlated feedback parameter that constrains the models to match the past and not fly apart in the future… How would that parameter distinguish between natural and anthropogenic forcing?

As the natural forcing dwarfs the anthropogenic forcing (especially in the past) surely this negatively correlated feedback parameter must make the anthropogenic forcing irrelevant.

Unless the negatively correlated feedback parameter was very finely chosen.

Carlo, Monte
February 7, 2020 8:10 am

>In the absence of clarification, a +/-u uncertainty value should be
>taken as one standard deviation of the error distribution.

This is not how uncertainty is treated in the GUM, quoting:

“uncertainty (of measurement)
parameter, associated with the result of a measurement, that characterizes the
dispersion of the values that could reasonably be attributed to the measurand”

“2.3.1
standard uncertainty
uncertainty of the result of a measurement expressed as a standard deviation”

“2.3.4
combined standard uncertainty
… positive square root of a sum of [individual standard uncertainty] terms…”

“2.3.5
expanded uncertainty
quantity defining an interval about the result of a measurement that may be expected
to encompass a large fraction of the distribution of values that could reasonably be
attributed to the measurand”

Standard uncertainty has the symbol u [lowercase]; it is standard deviation, and does not have +/- attached to it.

Expanded uncertainty is standard uncertainty (u) multiplied by a coverage factor (k), and has the symbol U [uppercase]:

U = k * u

Because of the expansion by the statistical coverage, U is expressed as +/-[value].

The coverage factor is associated with Student’s t, and in practice is nearly always simply assumed to be k = 2 for “95% coverage” (in this respect, k = 3 would correspond to 99% coverage). However, even though it is standard practice to use k = 2, if the variability (i.e. sampling) distribution is not normal, which is often the case, the real coverage percentage cannot be assumed.

Reply to  Carlo, Monte
February 8, 2020 2:08 am

“expanded uncertainty quantity defining an interval about the result of a measurement”
What does “the result of a measurement” mean? Is this an expression differentiating between the taking of a measurement and the values obtained by taking the measurement, or something more exotic? If the former, why label it “the result of a measurement” rather than just “a measurement”?

“Standard uncertainty has the symbol u [lowercase]; it is standard deviation, and does not have +/- attached to it.”
Does that mean value expressed in this “standard uncertainty” is twice the numerical +/= uncertainty?
If that question isn’t clear:
there has to be a measurement m to which the uncertainty, however it is expressed, is related.
While“standard uncertainty” has no +/- attached to it,
it must represent some range in which the ‘real’ value exists.
Is that range evenly distributed around the measurement m?
If so, then it could also be expressed as
m +/- 0.5u,
otherwise it has to actually be
m +/- (the value of u).
Which is correct?

Carlo, Monte
Reply to  AndyHce
February 8, 2020 4:40 pm

A result is a numerical value, at the end of a measurement procedure; the GUM gives guidance about performing a formal uncertainty analysis (UA) on the measurement procedure in order to quantify the uncertainty to be attached to the result. As is quite common in standards writing, the authors are strictly adhering to the terms defined in order to minimize misunderstandings, as the expense of more words. But to most people a “result of a measurement” is simply a measurement. The GUM identifies the measurement procedure as:

Y = f(X1, X2, X3, …), where the Xs might be multiple secondary measurements needed to obtain Y.

>Does that mean value expressed in this “standard uncertainty” is twice
>the numerical +/= uncertainty?

No, remember that the GUM is intended to be a standard way of expressing uncertainty; prior to its arrival, there were lots of different ways, each with their own terminology and largely incompatible with each other. Many of these would apply (+/-) to standard deviations, so confusion here is quite understandable.

An easy way to think of standard uncertainty is as the root-sum-square of individual uncertainty components (or sources of error), all expressed as standard deviations (s). The s values can come from statistical analysis, and/or from estimations of the ranges of errors along with their probability distributions. Then (for many measurement procedures):

u = sqrt[ s1^2 + s2^2 + s3^s3 + … + sn^2]
U = k * u

>While“standard uncertainty” has no +/- attached to it,
>it must represent some range in which the ‘real’ value exists.

Keep in mind that this elusive animal may not even lie within Y +/- U, and the GUM does not require this. The terminology discussion is too long for here, studying the GUM terminology a bit should help.

>Is that range evenly distributed around the measurement m?
>If so, then it could also be expressed as
>m +/- 0.5u,
>otherwise it has to actually be
>m +/- (the value of u).
>Which is correct?

For me, I have come to view (+/-U) as the “fuzzyness” attached to measurement. The GUM does not require any particular probability distribution, these are aspects of an individual measurement procedure that are investigated as part of an uncertainty analysis.

An example UA: a liquid metal thermometer is calibrated at a range of temperatures in bath by a cal lab to get a series of T_therm versus T_bath data points. Standard regression gives the calibration curve T_bath = a * T_therm + b, with lots of regression statistics, especially the standard deviation of T_bath as a function of T_therm. This is a GUM Type A standard uncertainty, s1.

There is also the thermometer scale, which is graduated in 0.5C increments, and the human task of reading the scale is an error source. For the UA, it is estimated that the temperature can be anywhere in an interval of +/-0.25C, with triangular probability distribution. Following the GUM, a Type B uncertainty is calculated from this interval, giving s2.

s1 and s2 are then combined to get the standard uncertainty u. Note that for this example, s2 could be much larger than s1, so that the task of reading the thermometer is the dominate error source. Also, note that s2 appears twice, once in the cal lab during calibration, and once again in use.

JRF in Pensacola
February 7, 2020 9:20 am

CM:
Correct me if I’m misunderstanding:
“uncertainty” is an undefined dispersion of values associated with a measurement.
“standard uncertainty” is the standard deviation (of error?) associated with a measurement (and is, therefore, a statistic?)
“expanded uncertainty” encompasses a large fraction of “uncertainty” (but not all?) but is multiplied by a selected k of interest if a normal distribution but k becomes unknown if the distribution is not normal (and, therefore “uncertainty” becomes truly uncertain).

So, “uncertainty” can be a statistic based on a standard deviation in a normal distribution and is only truly uncertain in a distribution that is not normal (?).

I was thinking that “uncertainty”, or “Uncertainty” was based on more than just statistical variation. But, I will say my confusion on this topic is quite high.

Carlo, Monte
Reply to  JRF in Pensacola
February 7, 2020 10:51 am

JRF:

Uncertainty analysis (UA) is a large topic and it takes time to get ones’ head wrapped around the GUM, my explanation was a very bare-bones. The GUM is written around the task of performing a formal UA for a given measurement, which can be thought of as some process that produces a numerical result, such as using a liquid thermometer to measure water temperature.

The metrological vocabulary used in the GUM Annex B came from the “International vocabulary of metrology — Basic and general concepts and associated terms (VIM)”, also produced by the JCGM.

Another quote from the GUM, which may help:

‘2.2.1 The word “uncertainty” means doubt, and thus in its broadest sense
“uncertainty of measurement” means doubt about the validity of the result of a
measurement. Because of the lack of different words for this general concept of
uncertainty and the specific quantities that provide quantitative measures of the
concept, for example, the standard deviation, it is necessary to use the word
“uncertainty” in these two different senses.’

>“uncertainty” is an undefined dispersion of values associated with a measurement.

Yes, but an uncertainty analysis is an attempt to quantify the dispersion. +/-U is a way of expressing the fuzziness associated with a numeric measurement.

>“standard uncertainty” is the standard deviation (of error?) associated
>with a measurement (and is, therefore, a statistic?)

Yes, the “standard” adjective refers to standard deviation, and u should be thought of as standard deviation. However, u values can come from many different sources, and the GUM has a lot of text for quantifying them.

In a UA, the combined standard uncertainty is calculated as the root-sum-square of the individual standard deviations.

An example: a thermometer calibration done by reading it while immersed in a fluid, over a range of temperatures. The calibration is then obtained as a linear regression X-Y fit of temperature versus temperature. Each individual temperature measurement has its own u, and additional uncertain comes from the regression statistics, such as the standard deviation of the slope of the line. All of these are then used to get the combined uncertainty.

u is statistical, but not always. The GUM divides u as Type A or Type B:

‘2.3.2
Type A evaluation (of uncertainty)
method of evaluation of uncertainty by the statistical analysis of series
of observations’

‘2.3.3
Type B evaluation (of uncertainty)
method of evaluation of uncertainty by means other than the statistical
analysis of series of observations’

The distributions of “series of observations” can be normal, or non-normal (even unknown). A Type B uncertainty is typically a judgement that a measurement X can vary between X-a and X+a, i.e. an interval, with an assumed probability distribution. Going back to the thermometer, if it has 0.5 degree gradations, the Type B uncertainty associated with reading the scale could be expressed with a = 0.5 or a = 0.25 (a judgement call), and a uniform or triangular distribution (another judgement call). The GUM tells how to calculate a standard deviation from these estimates.

Another judgement call is what value of k to use; the common pitfall is to assume that k=2 automatically means 95% of all measurements will be within +/-U. This is only true if the distribution is normal. It is common for UAs done for laboratory accreditation purposes to be required to quote U with k=2, without regard to any real statistical distribution.

JRF in Pensacola
Reply to  Carlo, Monte
February 7, 2020 12:29 pm

CM:
Thank you for your extended comments! When Dr. Pat Frank posted his article about uncertainty (and I’m not sure whether to capitalize or not) I thought, “This is good, this is important” and I continue to think that because, in my mind, he introduced (and others reiterated) the importance of incorporating “ignorance” into the discussion of models and analysis, “ignorance” being defined as that which we don’t know but which can be described, to some degree, mathematically) Dr. Booth’s article seemed to move away from “ignorance” and more to statistics (and I may be wrong about that, I definitely worry about my own ignorance!). Type A UA, as you describe is a statistical exercise (?) but you (or the citation) describes Type B as involving judgement calls. Are the judgement calls accounting for variation, “ignorance”, both?

I’m almost harping on “ignorance” because I don’t think modelers (of anything) consider their “ignorance” enough when evaluating their models, particularly if the number of variables is high and even unknowns must be estimated.

Thanks, again!

JRF in Pensacola
Reply to  JRF in Pensacola
February 7, 2020 2:31 pm

CM, I appreciate your time and I won’t impose further. I’m digging into the GUM for further education!

Carlo, Monte
Reply to  JRF in Pensacola
February 8, 2020 4:44 pm

No problem, having done a number of formal UAs for work, I had to become familiar with the innards of the GUM on at least a basic level. I am by no means an expert, nor am I a statistician.

Reply to  JRF in Pensacola
February 7, 2020 11:36 am

It is a complicated issue. The GUM basically deals with measuring one thing with one device. If you assume that measurement errors are random, and you take a number of independent measurements (that is, for example, using different people) then the “true value” will be surrounded by small errors. If they are random, there will be as many short measurements as there are long measurement and they will have a “normal” distribution. The mean of that distribution will be the “true value”.

That doesn’t necessarily mean accurate, but the measurements of that device should be pretty repeatable. Accuracy and precision are whole different subject in the GUM.

The GUM also doesn’t deal with how to handle trending temperature measurements from different locations. That is a whole different area of statistics.

February 7, 2020 9:30 am

The deliberate balancing of TOA radiation in the GCM’s as discussed here raises a new point about using GCM’s at all to diagnose the impact of past and future greenhouse gas emissions and concentrations. Now it seems even much worse than we thought.

Reply to  David Dibbell
February 10, 2020 4:21 am

I realize this thread is getting stale, but nevertheless I’m replying to my own comment here, having concluded that no, the GCM’s do not actively drive the calculations to achieve a prescribed TOA radiative balance. It kept recurring to me, “That can’t be right.”
But there is certainly tuning in the development of the model which drives the TOA balance toward more or less stable results over time. And there also is the necessity for a conservation-of-energy “fix” to counter any residual imbalance from the operation of the model itself. See the research article linked both here and farther below in another comment. The article addresses these these considerations for the GFDL’s new climate model CM4.0.

https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019MS001829

It remains plain to me that a GCM, even this new one, has nowhere near the resolving power to diagnose or project the temperature response to incremental forcing attributed to CO2, or from human causes in total.

February 7, 2020 10:21 am

Author’s comment: I have noticed that in translation from “Word” to “WordPress”, the section letters got turned into bullet points. Fortunately the section summaries can help people to keep track of the lettering of sections.

Richard Booth

Reply to  See - owe to Rich
February 7, 2020 3:40 pm

Rich,
So apt that WordPress created an error about an essay on error. Next, how do you quantify the WordPress error, to better cope with it in the future?
Jokes aside, yours is an important essay. Pat Frank and I have corresponded for years.
Both have experience with analytical chemistry. That is a disciplice in which you live or die by your ability to both quantify analytical uncertainty limits and maintain your daily output inside those limits.
You arrive at conclusions. One of these suggests that it is not possible to calculate valid limits unless you have identified the existence of, plus the weight of, EVERY perturbing variable; and that you have the math and logic skills to process the weights into an acceptable summary form.
If you accept that proposition, it follows that you cannot assign valid, overall uncertainties to GCM outputs. Those who model GCMs must know this. I have to conclude that they have devised ways to quell their rebellious consciences and motor on, knowing they are spreading scientific porkies. Geoff S

Phoenix44
February 7, 2020 10:27 am

 “where is the vaunted skill of these GCMs, compared with anyone picking their favourite number for climate sensitivity and drawing straight lines against log(CO2)?”

Exactly. We can model future temperatures in exactly that way and use lots of different sensitivity figures to give us the range of possible temperatures. We do not need these huge models at all. But the modelers claim that the “actual” sensitivity is an emergent property of the models, which is nonsense for the reasons you mention – sensitivity to startup conditions (fundamentally unknowable for the model) and artificial limiting of the model in particular.

Unfortunately the sensitivity figure is vital to the claims about Climate Change and admitting it is unknown simply collapses the whole scare. And so we have the pretence that models that cannot tell us the figure can actually do so. But if they can, then we can predict future temperatures without the models. But we cannot, which proves the models are not able to tell us what we need to know. In other words, as long as we need the models, we should not use the models.

Editor
February 7, 2020 10:51 am

Dr. Booth, in response I offer this from Freeman Dyson, his discussion with Enrico Fermi:

“When I arrived in Fermi’s office, I handed the graphs to Fermi, but he hardly glanced at them. He invited me to sit down, and asked me in a friendly way about the health of my wife and our new- born baby son, now fifty years old. Then he delivered his verdict in a quiet, even voice.
 
“There are two ways of doing calculations in theoretical physics”, he said. “One way, and this is the way I prefer, is to have a clear physical picture of the process that you are calculating. The other way is to have a precise and self- consistent mathematical formalism. You have neither.”
 
I was slightly stunned, but ventured to ask him why he did not consider the pseudoscalar meson theory to be a self- consistent mathematical formalism. He replied,
 
“Quantum electrodynamics is a good theory because the forces are weak, and when the formalism is ambiguous we have a clear physical picture to guide us. With the pseudoscalar meson theory there is no physical picture, and the forces are so strong that nothing converges. To reach your calculated results, you had to introduce arbitrary cut-off procedures that are not based either on solid physics or on solid mathematics.”
 
In desperation I asked Fermi whether he was not impressed by the agreement between our calculated numbers and his measured numbers. He replied, “How many arbitrary parameters did you use for your calculations?” I thought for a moment about our cut-off procedures and said, “Four.” He said,
 
“I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”
 
With that, the conversation was over. I thanked Fermi for his time and trouble, and sadly took the next bus back to Ithaca to tell the bad news to the students.

I looked at your model. You have no “clear physical picture of the process that you are calculating”. You also have no “precise and self- consistent mathematical formalism”.

Instead, you have five arbitrary parameters. As a result, I fear that it is no surprise that you can make the elephant wiggle his trunk.

My best to you,

w.

Reply to  Willis Eschenbach
February 7, 2020 4:04 pm

Willis, by my model, I presume you mean my Equation (1). I have not in fact fitted anything to that, so have not tried to make the veritable elephant wiggle her trunk. The point about my model (1) is that it generalizes Pat Frank’s and with some parameters output pretty much the same mean values but will lead to very different evolution of error/uncertainty over time (which may have some chance of emulating GCM errors).

And anyway, in Section C “Plausibility of New Parameters” I do describe a physical picture, namely storage of heat, which can lead to a decay parameter ‘a’ and bounded uncertainty.

Best to you too, and it has been good to see you writing more on WUWT again.
Rich.

Jeff Alberts
Reply to  See - owe to Rich
February 7, 2020 5:36 pm

“the veritable elephant wiggle her trunk”

It is a well-known fact that only elephants of the male sex (not gender) wiggle their trunks. The females are watching for the wiggle.

p.s. I made that up.

Reply to  See - owe to Rich
February 7, 2020 5:45 pm

See – owe to Rich February 7, 2020 at 4:04 pm

Willis, by my model, I presume you mean my Equation (1). I have not in fact fitted anything to that, so have not tried to make the veritable elephant wiggle her trunk.

Rich, thanks for your kind words. I had a gallbladder operation so I was recuperating for a bit, but I’m back to full strength. Well, at least something near full strength.

Regarding your post, I do mean Equation (1).

By my count, the free parameters are k, the “0” and the”2″ in the ∑ limits, the ∑ variable “i”, the “11” in the middle, S, g, and the “9” in the final denominator.

That’s eight freely chosen or tuned parameters, and you are in the middle of an elephant circus …

Best regards,

w.

Reply to  Willis Eschenbach
February 8, 2020 1:36 am

Willis, ah, too many Equation (1)’s! The title of this posting, and the ensuing content, has nothing about the sun or carbon dioxide. It is about matters arising from Pat Frank’s very interesting paper of about 6 months ago.

The equation you quote is from my paper, which I was merely mentioning in my “coming out” speech as part of my credentials, not as subject matter here. Still, since you raise the point, I did try to write a WUWT posting on that paper in April 2018, but one of Anthony’s contacts dissuaded him from publishing. Who knows, it might even have been you.

Anyway, there are not 8 free continuous and effective parameters in that equation, there are 5, and only 4 if b_2 is set to 0 as commonly happens in the paper. To see this, consider the 4 free continuous parameters b_0, b_1, b_2, S. Once those are chosen, we can subtract out the terms with known values L(n-i) and C(n-g), leaving x = k+11(b_0+b_1+b_2)-S log2(C(9)). But that is only one free parameter: whatever you choose for the first 4, k can be chosen to give whatever value of x you want to use.

I didn’t explain that in the paper, because it is well known to statisticians as “confounding of parameters”. As for g, it is either 0 or 1, with pretty similar results, and isn’t a continuous parameter.

Rich.

Reply to  See - owe to Rich
February 9, 2020 1:48 pm

Thanks, Rich. You say:

“Anyway, there are not 8 free continuous and effective parameters in that equation, there are 5, and only 4 if b_2 is set to 0 as commonly happens in the paper. To see this, consider the 4 free continuous parameters b_0, b_1, b_2, S. Once those are chosen, we can subtract out the terms with known values L(n-i) and C(n-g), leaving x = k+11(b_0+b_1+b_2)-S log2(C(9)). But that is only one free parameter: whatever you choose for the first 4, k can be chosen to give whatever value of x you want to use.”

True. Sorry, I missed that. However, that still leaves k, the “0” and the ”2″ in the ∑ limits, g, b_0, b_1, b_2, and the “9” in the final denominator.

My point is simple. Whether you have four or eight tunable parameters, I’d be absolutely shocked if you could NOT fit it to the data. That’s the point of what “Johnny” van Neumann said. With that many free parameters and a totally free choice of the equation, you can fit it to anything.

Finally, I’ve never understood how the length of a sunspot cycle could possibly affect global temperatures when the amplitude of said cycles doesn’t affect the temperatures. What is the possible physical connection between the two?

Best regards,

w.

Reply to  See - owe to Rich
February 10, 2020 2:46 am

Willis (Feb 9 1:48pm): I’ll reply with a specific point and a general point.

Specific: k – no, that is not extra because x = k+11(b_0+b_1+b_2)-S log2(C(9)) covers it. 0 and 2 in the limits – no, they are not extra because they merely serve to enumerate b_0, b_1, b_2 which are already in there. 9 – no that is not extra because, again, x covers it.

General: the theory of hypothesis testing in the General Linear Model. Von Neumann’s quip is amusing but actually unfair. We use mathematical statistics to determine whether a parameter has a significant effect – we don’t just look at the data, as often happens on blogs, and say “look at how well that fits”. Consider tide predictions, where I believe there are many more than 4 parameters in the models; presumably von Neumann would complain about those, and yet because of the wealth of data each one has a certain amount of value and a certain statistical significance.

With global temperatures averaged over 14 11-year intervals, there isn’t that much information relatively, so it is hard to find significant effects. The mathematical analysis in the paper shows that S is easily significant because of the overall upward trend, b_0 isn’t significant and gets dropped, b_1 is significant and gets retained, and b_2 is just outside significance and either gets dropped on strict criteria or retained on aesthetic criteria. In the strict case the free parameters are k, S and b_1 so those 3 parameters in any case meet von Neumann’s pedagogy.

Rich.

Reply to  Willis Eschenbach
February 10, 2020 10:13 am

See – owe to Rich February 10, 2020 at 2:46 am

Willis (Feb 9 1:48pm): I’ll reply with a specific point and a general point.

Specific: k – no, that is not extra because x = k+11(b_0+b_1+b_2)-S log2(C(9)) covers it. 0 and 2 in the limits – no, they are not extra because they merely serve to enumerate b_0, b_1, b_2 which are already in there. 9 – no that is not extra because, again, x covers it.

Thanks, Rich. The variables k, 11, and S are, as you pointed out, confounded parameters. However, they are only confounded if you specify the rest. But the ones that you specify include “9”. So that one is indeed one of the tunable parameters. And we have to include the confounded parameter (made up of k, 11, and S as you stated elsewhere). I included it as “k”, and although you can give it any name it is a tunable parameter.

So that still leaves what I’ll call C (the confounding of k, 11, and S), g, b_0, b_1, b_2, and the “9” in the final denominator. With six tunable parameters, how well your model fits the data is
MEANINGLESS. Seriously. I know you did it with solar cycle lengths as input (but without any explanation how a cycle that lasts a year longer has some magical effect).

But I could do the same with say global population or the price of postage stamps or money spent on pets or a hundred other input variables.

So what?

Seriously, so what? I know this is hard to accept, just as it was hard for Freeman Dyson to accept. And I’m sorry to be the one to burst your bubble.

But if you can’t fit a simple temperature curve given the free choice of equation, variables, and six tunable parameters, you should hang up your tools and go home. It’s nothing more than an futile exercise in tuning.

General: the theory of hypothesis testing in the General Linear Model. Von Neumann’s quip is amusing but actually unfair. We use mathematical statistics to determine whether a parameter has a significant effect – we don’t just look at the data, as often happens on blogs, and say “look at how well that fits”. Consider tide predictions, where I believe there are many more than 4 parameters in the models; presumably von Neumann would complain about those, and yet because of the wealth of data each one has a certain amount of value and a certain statistical significance.

Let me start by saying that Neumann’s statement was not an “amusing quip”. We know that because hearing it caused one of the best scientists of the century, Freeman Dyson, to throw away a year’s work by him and his students. So no, you can’t pretend it’s just something funny that “Johnny” said. It is a crucial principle of model building.

Next, tides are something that I know a little about. I used to run a shipyard in the Solomon Islands. The Government there was the only source of tide tables at the time, and they didn’t get around to printing them until late in the year, September or so. As a result, I had to make my own. The only thing I had for data was a printed version of the tide tables for the previous year.

What I found out then was that for any location, the tides can be calculated as a combination of “tidal constituents” of varying periods. As you might imagine, the strongest tidal constituents are half-daily, daily, monthly, and yearly. These represent the rotations of the earth, sun, and moon. There’s a list of some 37 tidal constituents here, none of which are longer than a year.

But the reason Neumann wouldn’t object to them is that they are backed by a clear physical theory. You’re overlooking the first part of Fermi’s discussion with Dyson, viz:

Then he delivered his verdict in a quiet, even voice. “There are two ways of doing calculations in theoretical physics”, he said. “One way, and this is the way I prefer, is to have a clear physical picture of the process that you are calculating. The other way is to have a precise and self- consistent mathematical formalism. You have neither.”

For the tides, we indeed have an extremely clear physical picture of the process we’re calculating. So the question of tuning never arises.

With global temperatures averaged over 14 11-year intervals, there isn’t that much information relatively, so it is hard to find significant effects. The mathematical analysis in the paper shows that S is easily significant because of the overall upward trend, b_0 isn’t significant and gets dropped, b_1 is significant and gets retained, and b_2 is just outside significance and either gets dropped on strict criteria or retained on aesthetic criteria. In the strict case the free parameters are k, S and b_1 so those 3 parameters in any case meet von Neumann’s pedagogy.

With six tunable parameters fitting only fourteen data points, you have almost half as many parameters as data points. I’m sorry, but that is truly and totally meaningless. You desperately need to be as honest as Dyson was. He didn’t complain and claim that maybe it was four tunable parameters, not five. Instead:

I thanked Fermi for his time and trouble, and sadly took the next bus back to Ithaca to tell the bad news to the students.

You need to do what Dyson did, accept the bad news, put your model on the shelf, and move on to a more interesting problem of some kind.

Sadly indeed,

w.

Reply to  Willis Eschenbach
February 11, 2020 3:32 am

Willis Feb 10 10:13am:

Willis, I’ll prepend your comments with ‘W’, mine with ‘R’, followed by my unmarked replies.

R: Specific: k – no, that is not extra because x = k+11(b_0+b_1+b_2)-S log2(C(9)) covers it. 0 and 2 in the limits – no, they are not extra because they merely serve to enumerate b_0, b_1, b_2 which are already in there. 9 – no that is not extra because, again, x covers it.

W: Thanks, Rich. The variables k, 11, and S are, as you pointed out, confounded parameters. However, they are only confounded if you specify the rest. But the ones that you specify include “9”. So that one is indeed one of the tunable parameters. And we have to include the confounded parameter (made up of k, 11, and S as you stated elsewhere). I included it as “k”, and although you can give it any name it is a tunable parameter.

W: So that still leaves what I’ll call C (the confounding of k, 11, and S), g, b_0, b_1, b_2, and the “9” in the final denominator.

No, it is not S which is confounded with k, but log2(C(9)), which is a constant, and if I choose a different time index instead of 9, say 3, then in place of k I use k’ = k-S log2(C(9))+S log2(C(3)) and get the same value x = k’+11(b_0+b_1+b_2)-S log2(C(3)).

And as I explained earlier, g only takes 2 possible values, with minor effects on the fit. g=0 means no lag between CO2 rise and temperature rise, and g=1 means an 11-year lag. In the paper I mention that 11 years is roughly consistent with other published estimates, so I could simply have taken g=1 as my given parameter.

The continuously tunable parameters are k, b_0, b_1, b_2, S, which is 5, but as noted later they get reduced to 3: k, b_1, S.

W: With six tunable parameters, how well your model fits the data is MEANINGLESS. Seriously. I know you did it with solar cycle lengths as input (but without any explanation how a cycle that lasts a year longer has some magical effect).

You haven’t read the explanation in Section 6.2, which does nevertheless leave SCLs as a proxy deserving of further research.

W: But I could do the same with say global population or the price of postage stamps or money spent on pets or a hundred other input variables.

Nice joke, but I do think the sun and CO2 have more to do with climate than those.

W: So what?
Seriously, so what? I know this is hard to accept, just as it was hard for Freeman Dyson to accept. And I’m sorry to be the one to burst your bubble.

W: But if you can’t fit a simple temperature curve given the free choice of equation, variables, and six tunable parameters, you should hang up your tools and go home. It’s nothing more than an futile exercise in tuning.

Well, those 6 were really 5 and turned out to be 3 (again, see below). Given the amount of noise in temperature data, it would have been remarkable if all 5 had shone through with statistical significance.

R: General: the theory of hypothesis testing in the General Linear Model. Von Neumann’s quip is amusing but actually unfair. We use mathematical statistics to determine whether a parameter has a significant effect – we don’t just look at the data, as often happens on blogs, and say “look at how well that fits”. Consider tide predictions, where I believe there are many more than 4 parameters in the models; presumably von Neumann would complain about those, and yet because of the wealth of data each one has a certain amount of value and a certain statistical significance.

W: Let me start by saying that Neumann’s statement was not an “amusing quip”. We know that because hearing it caused one of the best scientists of the century, Freeman Dyson, to throw away a year’s work by him and his students. So no, you can’t pretend it’s just something funny that “Johnny” said. It is a crucial principle of model building.

R: Yes, but it was being applied to theoretical physics, where we expect things to be more cut and dried, with some undoubted surprises along the way like quantum mechanics.

W: Next, tides are something that I know a little about. I used to run a shipyard in the Solomon Islands. The Government there was the only source of tide tables at the time, and they didn’t get around to printing them until late in the year, September or so. As a result, I had to make my own. The only thing I had for data was a printed version of the tide tables for the previous year.

W: What I found out then was that for any location, the tides can be calculated as a combination of “tidal constituents” of varying periods. As you might imagine, the strongest tidal constituents are half-daily, daily, monthly, and yearly. These represent the rotations of the earth, sun, and moon. There’s a list of some 37 tidal constituents here, none of which are longer than a year.

I rest my case m’lud.

W: But the reason Neumann wouldn’t object to them is that they are backed by a clear physical theory. You’re overlooking the first part of Fermi’s discussion with Dyson, viz:
Then he delivered his verdict in a quiet, even voice. “There are two ways of doing calculations in theoretical physics”, he said. “One way, and this is the way I prefer, is to have a clear physical picture of the process that you are calculating. The other way is to have a precise and self- consistent mathematical formalism. You have neither.”

W: For the tides, we indeed have an extremely clear physical picture of the process we’re calculating. So the question of tuning never arises.

But the estimation of 37 parameters is tuning, and as I said, there is such a vast quantity of data available that statistical analysis can show the relative and non-zero importance of each of those.

R: With global temperatures averaged over 14 11-year intervals, there isn’t that much information relatively, so it is hard to find significant effects. The mathematical analysis in the paper shows that S is easily significant because of the overall upward trend, b_0 isn’t significant and gets dropped, b_1 is significant and gets retained, and b_2 is just outside significance and either gets dropped on strict criteria or retained on aesthetic criteria. In the strict case the free parameters are k, S and b_1 so those 3 parameters in any case meet von Neumann’s pedagogy.

W: With six tunable parameters fitting only fourteen data points, you have almost half as many parameters as data points. I’m sorry, but that is truly and totally meaningless.

Your comprehension rather failed you there. 3 does not equal six. It isn’t even an adjacent integer.

W: You desperately need to be as honest as Dyson was. He didn’t complain and claim that maybe it was four tunable parameters, not five. Instead: I thanked Fermi for his time and trouble, and sadly took the next bus back to Ithaca to tell the bad news to the students. You need to do what Dyson did, accept the bad news, put your model on the shelf, and move on to a more interesting problem of some kind.

I shall be honest when the time comes, but not in response to your rather feeble criticisms and appeal to the omniscience of von Neumann. That time will be when, and if, new data falsifies my model. I think that the current and next solar cycles may indeed yield temperatures which require changes to my model parameters beyond the point of breaking. For example, b_1, currently standing at an 80:1 chance of its value occurring at random, might decline to 10:1, considered non-significant. Or the Durbin-Watson test on the residuals may become so significant that the model becomes irretrievable. If that happens, the past good fit of the model will have to be put down to spurious correlation. Or perhaps I’ll get lucky and a new Hiatus will occur this decade!

Rich.

February 7, 2020 11:26 am

I haven’t had time to digest all of this.

One comment on the 10 ruler (or 100 ruler) case. Your case relies upon a statistical distribution of how to achieve the best serial measurement. That certainly appears ok at first brush. However, my understanding of Pat Frank’s paper is that you only have one ruler. The ruler has a given uncertainty and is used in serial measurements. I don’t believe your calculations covered that case adequately.

It is simple logic that if a ruler is 0.1 inch short and you use it serially 10 times, your measurement will be short by one inch. Likewise, if it is 0.1 inch long and used 10 times, your measurement will be one inch long. This doesn’t even cover random errors like pencil width, parallax, etc.

Even if the uncertainty is expressed as a standard deviation, that only applies to what the measurement may be for one measurement. What is the uncertainty distribution after using it serially for 10 times? This what his paper was about. If you use a piece of data with an uncertainty at the input of a GCM, it will flow through in some fashion to the output. If you then take that output and feed it back in to another run, the uncertainty will compound again. Each time you run the GCM, feeding the output of the last run into the input of the next, the uncertainty will compound.

If you have programmed it in such a way as to cancel this type of error, what you have done is chosen what you want the output to be.

Barry Hoffman
Reply to  Jim Gorman
February 7, 2020 12:41 pm

…… And when your calculations ultimately produce a potential “Temperature” variance far beyond what one observes in the real world, the only conclusion that can be deduced is that the model is fatally flawed.

“It doesn’t matter how beautiful your theory is, it doesn’t matter how smart you are. If it doesn’t agree with experiment, it’s wrong” – Feynman

Jeff Alberts
Reply to  Barry Hoffman
February 7, 2020 5:38 pm

“If it doesn’t agree with experiment, it’s wrong”

What if it’s a bad experiment?

Clyde Spencer
Reply to  Jeff Alberts
February 7, 2020 8:19 pm

Jeff
You asked, “What if it’s a bad experiment?” Then you resort to statistics. 🙂

Barry Hoffman
Reply to  Jeff Alberts
February 8, 2020 6:25 am

The “experiment” is the collection of real world raw data. Not tree ring proxy or any other proxy. When the product of a models output is re-entered as viable data for the next model run, and the uncertainty values exceed known parameters, the only possible conclusion is the model is worthless.

Paul Penrose
Reply to  Jim Gorman
February 7, 2020 9:48 pm

Jim,
I was also going to object to this part of the analysis, but I would also point out that it is not reasonable to expect a manufacturer to specify an error distribution. In fact, I think you will find that outside the area of high-precision lab equipment, they don’t. All they will tell you is that any products outside the specified parameters (e.g. +/- 0.1 inch) are rejected (not sold). But they don’t make any attempt to describe the error distribution; it is unknown and unknowable. Without a full analysis of the GCM codes to account for all the floating point errors (both representational and rounding) and other potential sources of error (coding “bugs”) we can’t assume any particular error model. Any analysis of error propagation must deal with this issue, and I don’t believe that Dr. Booth has.

Reply to  Paul Penrose
February 8, 2020 1:49 am

Paul, a point I make is that when numerous errors combine, their sum tends to normality by the Central Limit Theorem. It is then reasonable to use standard deviations of that in the error propagation. But correlation of errors does need to be addressed in the propagation; nevertheless apart from my R_2(t) versus R_3(t) I follow Pat Frank in assuming no correlation.

RJB

Reply to  See - owe to Rich
February 8, 2020 7:39 am

This assumption about the CLT only applies when errors can be shown to be random,i.e., a normal distribution. Further assumptions are that you are measuring the same thing with the same device. In other words, repeatable measurements.

Each temperature measurement that is recorded is a non-repeatable measurement. It is a one time, one location measurement. You can not reduce uncertainty of a single measurement by using measurements at different times or locations with different devices because you can’t assume a normal distribution of errors. Therefore, it is the only measurement you will ever have with whatever uncertainty budget applies to single measurements. In essence you have a mean with a variance for each measurement. When you average measurements you must also combine the variences.

Reply to  Jim Gorman
February 9, 2020 8:47 am

Jim, the concept of reducing uncertainty through repeated statistically independent samples does not depend on the presence of a normal distribution. It is simple mathematics from calculating variances of means of random variables.

But anyway my comment at Feb 8 1:49am didn’t involve a mean. Rather, it was pointing out that if an overall error arises from several independent sources, then by the CLT the error distribution tends to normality, whereas Paul Penrose seemed to be saying it was completely unknown. But that doesn’t _reduce_ the error variance, which is the sum of all the component variances.

RJB

Reply to  Jim Gorman
February 9, 2020 1:24 pm

See –> You missed the point. When taking a measurement, the population of the multiple random readings of the same thing must approximate a normal distribution if you wish to simply average the readings to find a “true value”. From the GUM:

“3.1.4 In many cases, the result of a measurement is determined on the basis of series of observations obtained under repeatability conditions (B.2.15, Note 1).

NOTE 1 The experimental standard deviation of the arithmetic mean or average of a series of observations (see 4.2.3) is not the random error of the mean, although it is so designated in some publications. It is instead a measure of the uncertainty of the mean due to random effects. The exact value of the error in the mean arising from these effects cannot be known.

NOTE 2 In this Guide, great care is taken to distinguish between the terms “error” and “uncertainty”. They are not synonyms, but represent completely different concepts; they should not be confused with one another or misused.”

The CLT will allow you to take samples from a population and use sample means to determine the mean of that population. The sample mean will tend toward a normal distribution but that does nothing to change the precision, variance, or uncertainty of the population. A lot of people believe the “error of the mean” of a sample mean distribution means the mean of the population gains all of these. It does not. It only describes how close the sample mean is to the mean of the populaation.

Too many people use the CLT and error of the mean to justify adding digits of precision to averages. You can not do that. Significant digits are still important.

Reply to  Jim Gorman
February 10, 2020 1:54 pm

Jim (Feb 9 1:24pm): I deferred replying to your comment as it took a little more thought than some of the others.

The JCGM does not say anything about measurements having to be from a normal distribution in order for their mean and standard deviation to be useful. Sections 4.2.2 and 4.2.3 define s^2(q_k) and s^2(q*) as the variances of the observations and of the mean (I am using q* as more convenient here than q with a bar on top). If there are n observations then s^2(q*) = s^2(q_k)/n, and the (standard) uncertainty of the mean is defined to be u(q*) = s(q*), which does decrease (on average) as n grows, in a sqrt(1/n) fashion.

So I don’t see in what sense I “missed the point”.

Rich.

Reply to  Jim Gorman
February 12, 2020 6:37 pm

Rich –> Look at what you are dividing by 1/sqrt n. You are dividing thepopulation standard deviation squared. That means you end up with a standard deviation that is smaller than the population mean. It tells you that the sample mean is closer and closer to the population mean. In other words, the sample mean distribution becomes tighter and tighter around population mean. When n = infinity, the sample mean would be the same as the population mean.

Please note, this calculation has nothing to do with the accuracy, precision, or variance of the population. It only tells you how close you have approximated the mean.

Paul Penrose
Reply to  See - owe to Rich
February 11, 2020 10:27 am

Dr. Booth,
The statement, “when numerous errors combine, their sum tends to normality by the Central Limit Theorem” is only true if you are talking about true random errors. But what if what you are calling “errors” are really biases and are not random at all? In that case I don’t think your statement is true any longer. And if, as I assert, these “errors” are unknown, then it is as equally likely that they are biases versus random noise. In the face of this unknown, don’t we have to proceed along the worst case path, which is to say, the “errors” propagate forward as a simple sum?

Paul Penrose
Reply to  Paul Penrose
February 11, 2020 10:38 am

I meant to say “propagate forward as a root sum square”, not a “simple sum”.

Lance Wallace
February 7, 2020 1:35 pm

Dr. Booth, in your published article Section 4.10, you apply your model to the warming between 1980 and 2003 studied by Benestad (2009). The period from 1980 on also coincides with the satellite observations of global temperature. HADCRUT4 depends on local sparse uncertain ground and sea temperature measurements, whereas the satellite observations are the only true measure of global temperature (although themselves subject to different interpretations of the data: eg the UAH and RSS data series).

Would you consider repeating your effort in section 4.10 using the satellite observations instead of HADCRUT4? Preferably both RSS and UAH? If the results stay much the same, you have at least verified your approach with two (or three) independent sets of observations.

Reply to  Lance Wallace
February 8, 2020 1:54 am

Lance, I would love to be able to do that, but the statistical signal is too weak to turn into a significant result over just 40 years. With the HadCRUT4 data, the warming from 1910 to 1940 and then slight cooling gives the model part of its traction.

But anyway that paper isn’t what this WUWT article is about.

Thanks for the idea,
RJB

February 7, 2020 1:57 pm

“The implication of comments by Roy Spencer is that there really is something like a “magic” component”
It isn’t. The implication is simply that GCMs are based on conservation laws. In particular they conserve energy. So it has to add up. Where they make some local error, it is the overall conservation requirement that ensures TOA balance. Not “GCMs have to resort to error correction techniques”.

” the practice seems dubious at best because it highlights shortcomings in GCMs’ modelling of physical reality.”

No, it is GCM’s correct modelling of physical reality, in which energy is conserved. A reality totally missing from Pat Frank’s toy model.

Reply to  Nick Stokes
February 7, 2020 3:51 pm

Nick, do you admit that the GCMs are subject to errors of +/-4W/m^2 in LCF (Longwave Cloud Forcing)? If they are, which other parameters of the GCMs automatically adjust to correct that error? Or, does LCF not even enter into the GCMs, for example because they use a fixed amount of radiation input, plus a bit for the annual increase in GHG LW downwelling?

I admit I am showing my ignorance about GCMs here 🙂

RJB

Reply to  See - owe to Rich
February 7, 2020 5:24 pm

“are subject to errors of +/-4W/m^2 in LCF”
That number has been grossly mis-used; it is actually a spatial variability. But insofar as the GCMs do get cloud opacity wrong; they simply give a consistent solution for a more or less cloudy world. Energy is still conserved.

Jeff Alberts
Reply to  Nick Stokes
February 7, 2020 5:39 pm

“they simply give a consistent solution for a more or less cloudy world. Energy is still conserved.”

In other words, they make stuff up.

Michael S. Kelly
Reply to  Nick Stokes
February 7, 2020 9:04 pm

“Energy is still conserved.”

You keep saying this, and it isn’t true. The mathematically written-out Navier Stokes (NS) equations are based on conservation of mass, momentum, and energy. Those aren’t the equations that any general circulation model use. Instead, they use a bastardized algebraic version of the NS equations, with any of a variety of discrete approximations representing the partial derivatives, where the spatial components are represented by computational grid-points surrounding the Earth.

Modern computers have allowed modelers to use ~10E5 grid points to model the entire atmosphere. Even so, the spatial resolution is on the order of hundreds of kilometers, which doesn’t amount to anything a reasonable person would consider “resolution.” It is insufficient to resolve a thunderstorm. Heck, it is insufficient to resolve a hurricane, really.

On top of that, these models are expected to integrate reliably 100 years into the future, and do so in less that 100 years run time. So even with the gigantic spacing of the grid points, the relatively small minimum time step (less than 5 minutes) demanded of an explicit solution method dictates that either implicit or “spectral” solution methods be employed. Each permits longer time steps, with the trade off of loss of accuracy for each step. Implicit methods, for example, are solved iteratively. The iteration at each grid point and time step is stopped when a pre-set error value is achieved, otherwise the computation would go on forever.

The errors at every single one of those 10E5 grid points are small, but finite. And they serve as an erroneous set of initial conditions for the next time step. The errors do include errors in each of the “conserved” quantities, including energy. And they accumulate over time.

All of this is besides the fact that, in order to handle turbulence (which dominates atmospheric physics), modelers have to employ Reynolds averaging. Even if one believed that the discretized NS equations were the same as the parent partial differential equations (they aren’t), the introduction of turbulence models to close the Reynolds averaged equations renders them non-physical. They are not physics-based any longer.

Get over it.

Reply to  Michael S. Kelly
February 8, 2020 12:10 am

“Instead, they use a bastardized algebraic version of the NS equations, with any of a variety of discrete approximations representing the partial derivatives”

The NS equations are always expressed with algebra. And all CFD (and all applied continuum mechanics) will use discrete approximations. The main deviation of GCMs is in using the hydrostatic approximation for vertical pressure. That means dropping terms in acceleration and vertical viscous shear. You can check whether that is justified, and make corrections (updraft etc) when not.

As with all CFD, you can express the discretisation in conserved form. That is, you actually do the accounting for mass, momentum and energy in each cell, with boundary stresses etc. You aren’t then relying on partial derivatives.

As for integrating 100 years into the future, well, it does. That is the point of the fading effect of initial conditions. GCMs, like CFD, express the chaotic attractor. As long as it maintains stability, it will keep doing that; the ability to do so doesn’t wear out.

“(less than 5 minutes) demanded of an explicit solution method dictates that either implicit or “spectral” solution methods be employed”
Spectral is really just a better organised explicit. I think the time step is more than 5 minutes (the Courant condition is based on gravity waves which are a bit slower than sound), and basically, that is what they use. AFAIK, they don’t use implicit.

“And they accumulate over time.”
Actually that is what has sometimes upset Willis. They use energy “fixers” which basically add a global conservation equation. That makes the system slightly overdetermined, but stops the accumulation.

“the introduction of turbulence models to close the Reynolds averaged equations”
Well, all CFD does that. It makes the momentum more diffusive, as it should be, possibly not to the correct extent. But it doesn’t break the conservation.

Reply to  Michael S. Kelly
February 8, 2020 1:41 pm

“vertical pressure” I meant vertical momentum.

Reply to  Michael S. Kelly
February 9, 2020 1:55 pm

Nick Stokes February 8, 2020 at 12:10 am

“And they accumulate over time.”

 
Actually that is what has sometimes upset Willis. They use energy “fixers” which basically add a global conservation equation. That makes the system slightly overdetermined, but stops the accumulation.

STOP WITH THE HANDWAVING ACCUSATIONS!!! Provide a quote where I said I was “upset” by the accumulation of energy imbalance in the models. You can’t, because I never said that, it’s just another of your endless lies about me.

What I actually said, AS I HAD TO REMIND YOU BEFORE, was that I was upset that Gavin didn’t put a Murphy Gauge on the method to determine if the error was either big or small. Here’s the quote:

Willis Eschenbach January 18, 2020 at 12:24 pm

Nick, please learn to read before attacking at random. I didn’t say he was a lousy programmer for how he dealt with the energy imbalance. I specifically said that was OK.

I said he was a lousy programmer for not putting a Murphy Gauge on the amount of energy re-distributed, so he could see when and where it went off the rails.

Stop your god-damned lying about what I said, Nick. That’s twice you’ve tried the same lie. It is destroying what little is left of your reputation.

w.

Michael S. Kelly
Reply to  Michael S. Kelly
February 9, 2020 6:59 pm

“The NS equations are always expressed with algebra.”

Well, no, they’re always expressed as a set of non-linear partial differential equations in their native form. Solution techniques are usually (though not always) expressed in algebraic form involving discrete approximations of the derivatives, partial or otherwise.

And all CFD (and all applied continuum mechanics) will use discrete approximations.” I never distinguished GCMs from CFD. It’s CFD and its pretensions with which I have the basic problem. It’s a tautology to say that all CFD will use discrete approximations. Discrete approximations – and the rise of the digital computer to calculate them – are the only reason there is such a thing as CFD, which is the art of approximating solutions of continuous differential equations through the use of discrete (algebraic) equations. Not all applied continuum mechanics uses discrete approximations, however, not even fluid dynamics. The application of finite calculus to fluid dynamics is popular because math is hard. It isn’t impossible in all cases. (see https://cds.cern.ch/record/485770/files/0102002.pdf, for example)

“As with all CFD, you can express the discretisation in conserved form.”

I thought that was the whole point. My point is that any numerical method run on a digital computer is subject to error, and integration of differential equations is subject to error in initial conditions, truncation error (pertaining to truncation of the series representing the solution function, whether Taylor, binomial, “spectral”, or whatever), and roundoff error (related but not limited to machine word length). For climate models, we can also add a class of error that does exist but is completely unquantified (and ignored): bit errors not detected by standard error correcting techniques. With the stupendous number of computations required for a 100-year climate run, these must have a substantial effect. These errors occur in every computed dependent variable at each time step, and though their order can be estimated, their magnitude and sign are completely unknown, and all of them contribute to the error in energy. There isn’t any way to correct for energy error in any way that can be proven consistent with reality.

“As for integrating 100 years into the future, well, it does.”

I can numerically integrate the equations of motion for the planets in the solar system (a much simpler problem) for 100 years, too. And the results will be wrong. But they will be results. In fact, I can keep it going for 100 million years. The results will be wronger, but they will still be results. It is thought that we can get realistic results for planetary positions over the span of a million years, but nobody really knows. Now, we can integrate the equations of motion of an ICBM and show that it can hit a target 6,000 nautical miles away with a 50% probability of hitting within 500 feet. Test flights verify that ability. But they’re 30 minutes in duration. Getting to the Moon takes 2 1/2 days. We can’t do that without course corrections – measuring actual state vectors in flight via radar and stellar updates, and computing a new trajectory to correct the one that we thought was correct in the first place, just so we don’t miss by a hundred kilometers. It strains credulity to think that an integration of phenomena whose physics are far less well understood than those of celestial mechanics could give any kind of meaningful results, particularly one involving a vastly larger number of computations at an accuracy order far lower than that of the integration schemes used for space flight. It doesn’t just strain it, it tears it limb from limb.

On CFD as providing chaotic attractors, that’s one of my beefs with the discrete time derivative. Attractors involve a fixed point, a point at which f(x) = x. Finite difference approximations of the time derivative make that possible computationally, when it might not be possible for an analytic solution. It’s an problem I’ve been studying, but can provide no definitive answers yet.

Spectral methods just replace the linear spatial interpolation functions, which are series-derived (Taylor or other) with series using orthogonal basis functions. The time derivative is still series-derived, and the time steps AFAIK are taken by implicit methods to avoid the Courant limit. And yes, they are much longer than 5 minutes, or we’d have 100 year simulations that took 1,000 years to run.

The part about the turbulence models was off topic, but I am always frustrated when people assert that CFD is “physics.” And not all CFD uses them. Direct Numerical Simulation doesn’t, but it also would be incapable of running a 100 year simulation of the climate with enough nodes to capture turbulence at all significant levels.

Reply to  Michael S. Kelly
February 9, 2020 11:14 pm

Willis,
“Provide a quote where I said I was “upset” by the accumulation of energy imbalance in the models. “

OK, from here

“A decade or more ago, I asked Gavin how they handled the question of the conservation of energy in the GISS computer ModelE. He told me a curious thing. He said that they just gathered up any excess or shortage of energy at the end of each cycle, and sprinkled it evenly everywhere around the globe …
As you might imagine, I was shocked … rather than trying to identify the leaks and patch them, they just munged the energy back into balance.”

Well, you said “shocked”; I said “upset”. I can remember a rather more extensive disapproval, which I can’t currently locate.

Reply to  Michael S. Kelly
February 10, 2020 12:09 am

Michael Kelly,
“It isn’t impossible in all cases. (see…”
Well, I said applied continuum mechanics, which this really isn’t. And I don’t think there is much done in structural mechanics, elasticity etc that isn’t discretised.

“for 100 years, too. And the results will be wrong”
Well, not that wrong. The planets will still be in much the same orbit. Kepler’s laes etc will still be pretty much followed. You’ll get some phase discrepancies. Kind of like getting the climate right but the weather wrong.

“are taken by implicit methods to avoid the Courant limit. And yes, they are much longer than 5 minutes, or we’d have 100 year simulations that took 1,000 years to run

Actually implicit methods are slower. One of my beefs here is that implicit methods aren’t really an improvement. They require an iterative solver, which basically reenacts internally the many steps that an explicit method would use.

“And yes, they are much longer than 5 minutes”
Well, somewhat, because the layered atmosphere isn’t as stiff as a block of air would be (on compression is can rise). Here is a practical discussion:
“The experimental cases have GCM time steps of 600, 900, and 3600 s (fscale is higher with a smaller time step).”
I think 3600s was coarse resolution.

Reply to  Michael S. Kelly
February 10, 2020 12:48 am

Nick Stokes February 9, 2020 at 11:14 pm

Willis,

“Provide a quote where I said I was “upset” by the accumulation of energy imbalance in the models. “

OK, from here

“A decade or more ago, I asked Gavin how they handled the question of the conservation of energy in the GISS computer ModelE. He told me a curious thing. He said that they just gathered up any excess or shortage of energy at the end of each cycle, and sprinkled it evenly everywhere around the globe …
As you might imagine, I was shocked … rather than trying to identify the leaks and patch them, they just munged the energy back into balance.”

Well, you said “shocked”; I said “upset”. I can remember a rather more extensive disapproval, which I can’t currently locate.

Nice try, but no cigar. You said I was upset by the accumulation of energy. But as your quote itself proves, now that you’ve bothered to quote it, I was NOT shocked by the accumulation.

Instead, I was shocked by the fact that rather than try to identify any possible leaks, they made no effort to see if anything was wrong or to even to see if the amount accumulating was unreasonably large.

Instead they just sprinkled it around the planet. The accumulation and the sprinkling didn’t bother me as you falsely claimed. The lack of any attempt to monitor the process did, and I went on to talk about Murphy gauges.

w.

Reply to  Nick Stokes
February 7, 2020 5:31 pm

TOA balance does not imply a discrete climate state, much less a physically accurate climate state.

Climate models make very large errors in the distribution of energy among the climate sub-states.

From Zanchettin, et al., (2017) Structural decomposition of decadal climate prediction errors: A Bayesian approach. Scientific Reports 7(1), 12862

[L]arge systematic model biases with respect to observations … affect all of mean state, seasonal cycle and interannual internal variability. Decadal climate forecasts based on full-field initialization therefore unavoidably include a growing systematic error

Model drifts and biases can result from the erroneous representation of oceanic and atmospheric processes in climate models, but more generally they reflect our limited understanding of many of the interactions and feedbacks in the climate system and approximations and simplifications inherent to the numerical representation of climate processes (so-called parameterizations).

All these errors in climate sub-states are present despite the imposed TOA balance.

But for you, Nick, “limited understanding” evidently achieves a “correct modeling of physical reality.

There is zero reason to think GCMs deploy a “correct modelling of physical reality.”

Second, my emulation equation is not a “toy model.” Toy model is just another instance of you being misleading (again), Nick.

Eqn. 1 demonstrates that GCMs project air temperature merely as a linear extrapolation of CO2 forcing. The paper explicitly disclaims any connection between the emulator and the physical climate.

Thus, “This emulation equation is not a model of the physical climate. It is a model of how GCMs project air temperature.

You knew that Nick, when you chose to mislead.

Reply to  Nick Stokes
February 7, 2020 7:09 pm

Nick,
The presence of conservation laws and all energies adding up is not really relevant.
Permit a simpler analogy, whole rock analysis in analytical chemistry. The sum of all of the analyses of the chemical elements has to add up to 100% of the weight of the rock. (Conservation of Mass?). This says nothing much about the size and location of errors. In practice, the larger errors tend to go with the abundant elements, like oxygen, silicon, aluminium. These elements are typically in the tens of % range. Then, there are trace elements like (say) mercury, where we are in the parts per billion range. The error assigned to (say) oxygen analysis is far removed from the error associated with mercury analysis. Large errors in trace mercury analysis have next to no effect on the total mass balance.
Back to the GCM case, the errors associated with the larger energy components will generally dominate the overall error analysis. Given the central role of Top of Atmosphere energy balance, I show again this figure from about 2011.
http://www.geoffstuff.com/toa_problem.jpg from Kopp & Lean
http://onlinelibrary.wiley.com/doi/10.1029/2010GL045777/full

Here we have the classic problem of subtracting two large numbers (energies in and out) to get a tiny difference whose even smaller variation has significance for the problem at hand. But, the TOA energy balance shown by the responses of various satellite instruments in the figure above is heavily dependent on the subjective act of adjustment and aligning of the satellite data in the absence of an absolute comparator.
Which leads to a more general question, is it valid to apply classic error analysis methods to numbers that are invented or subjectively adjusted as opposed to measured? For example, how does one calculate the useful error of historic gridded surface sea temperatures when a large % of them are invented by interpolation?
Geoff S

Reply to  Geoff Sherrington
February 7, 2020 8:14 pm

Geoff,
“The sum of all of the analyses of the chemical elements has to add up to 100% of the weight of the rock.”
The difference is that the whole of mechanics can be derived from the conservation laws. There isn’t anything else. And that is the basis for solution. It is really built in.

“For example, how does one calculate the useful error of historic gridded surface sea temperatures when a large % of them are invented by interpolation?”
The entire field outside the points actually measured is “invented” by interpolation. We know about temperature by sampling, as we do throughout science. Why are you analysing those rock samples? Because you want to know the properties of a whole mass of rock. Fortunes depend on it. You “invent by interpolation”. Suppose you do mine and crush it. How do you know the value of what you produced? Again, you analyse samples. Hopefully with good statistical advice. It’s all you can do.

Clyde Spencer
Reply to  Nick Stokes
February 7, 2020 8:46 pm

Stokes
You said, “It’s all you can do.” No! If the analyses prove to be wrong then you can try to determine why they are wrong. You might want to alter the sampling procedure or alter your model of mineralization. You might want to look for a different assayer or chemist. You might want to look at your ore-processing stream to see if you are losing things of value.

It isn’t sufficient to say that because your approach is based conservation laws is has to be right, and then ignore surprises.

Reply to  Clyde Spencer
February 7, 2020 11:48 pm

“If the analyses prove to be wrong then you can try to determine why they are wrong. “

It isn’t an issue of whether the analysis of those samples is accurate. The issue is that you then have to make inferences about all the rock you didn’t sample. Same as with temperature, and just about any practical science. How do you know the strength of materials in your bridge? You measure samples. Or look up documentation, which is based on samples of what you hope are similar materials. How do you know the salinity of the sea? You measure samples. The safety of your water supply? Samples.

Clyde Spencer
Reply to  Clyde Spencer
February 8, 2020 10:03 am

Stokes
As usual, you either missed the point or chose to construct a strawman. You said, “How do you know the strength of materials in your bridge? You measure samples.” The issue is that if your bridge fails, you ask why. It may well turn out that that the sampling was done incorrectly. It is also possible that the formulas used were wrong or calculated incorrectly. In any event, a good engineer doesn’t resort to defending the design by claiming that it was based on “conservation laws.” They try to determine why the bridge failed and make appropriate corrections to avoid repeating the mistake(s).

Clyde Spencer
Reply to  Geoff Sherrington
February 7, 2020 8:40 pm

Geoff
Further to your remarks, when an analysis doesn’t add up to 100% (all the time!) an assumption that is made is that the error is proportional to the calculated oxide percentage. Thus, every oxide reported is adjusted in proportion to the raw percentage. That is not an unreasonable assumption, but it may not be true. Most of the error may actually be associated with just one oxide. Therefore, if the assumption of proportionality is not true, then error is introduced into all the calculated amounts.

The same thing could be true about conservation of energy for TOA. If the GCM step-output is scaled back to 100%, without determining WHY it is in error, then an error is retained and propagated instead of being expunged. So, while Stokes and others think that they are keeping the calculations on track, they may just be propagating an unknown error and blithely go on their way convinced that they are doing “science based” modeling. In reality, they are pretending that the uncertainty doesn’t propagate and it becomes a self-fulling belief because they constrain the variation over time with a rationalization.

Reply to  Clyde Spencer
February 8, 2020 6:20 pm

Thank you Clyde,
You are showing that you know what I mean.
The analytical chemist, faces with a discovery of unacceptable errors, either withdraws the resulkts thought to be wrong, or embarks on new investigations using a variety of available techniques.
The GCM modeller seems to merely double down and try to bullshit a way through the wrong results. I admit that the modeller has fewer options about going to other investigations, but the alternative seems to be ignored. The alternative is to withdraw the results known to be in error.
It verges on the criminal to continue to push results known to be so wrong.
Geoff S

February 7, 2020 5:00 pm

Rich, “… this means that Equation (1) with the equivalent of Frank’s parameters, using many examples of 80-year runs, would show an envelope where a good proportion would reach +15K or more, and a good proportion would reach -15K or less,

If your equation (1) means that, then it has no correspondence with my emulation eqn. (1), nor with the uncertainty calculation following from my eqn. (5).

The predictive uncertainty following from model calibration error says nothing whatever about the magnitude of model output.

Reply to  Pat Frank
February 7, 2020 5:42 pm

Rich, “But the GCM outputs represented by CMIP5 do not show this behaviour, …

Yet once again, the same bloody mistake: that predictive uncertainty is identical to physical output error.

Correcting this mistake is, for some people, evidently a completely hopeless project.

The ±15 C predictive uncertainty says nothing whatever about the magnitude of GCM outputs, Rich. It implies nothing about model behavior.

Reply to  Pat Frank
February 8, 2020 2:33 am

Pat, let’s face it, we are never going to agree about “uncertainty” (or perhaps even what we mean by “mean”!). I have stated pretty clearly, supported I believe by the JCGM, how uncertainty relates to model error. That then says a lot about model behaviour, and it is model behaviour we are interested in, because global policy is mistakenly based on it.

If the GCMs did not use “conservation of energy” to correct for error, then I have absolutely no doubt that they would indeed wander off to +/-15 degrees or more by 2100. Your paper has been invaluable, and I give that credit in Section H, in extracting the admission of auto-correction, which means the models probably don’t care too much about LCF errors except perhaps in respect of regional distribution of temperature (GCM experts: I’d still like to hear more on that subject).

Rich.

Pat Frank
Reply to  See - owe to Rich
February 11, 2020 7:56 am

Rich, “,… then I have absolutely no doubt that they would indeed wander off to +/-15 degrees or more by 2100.

But that’s not the meaning of my analysis, nor what I or my paper are saying.

You insist upon an incorrect understanding of predictive uncertainty. That mistake is fatal to your whole argument.

I’ll have more to say later.

Paul Penrose
Reply to  Pat Frank
February 7, 2020 10:11 pm

Pat,
I don’t see how people keep getting this point wrong. What your “uncertainty envelope” means is that in order to have any confidence in the model outputs, they would have to be outside that envelope. Why? Because any results within the envelope are just as likely to be caused by random errors of various kinds, versus valid expressions of the underlying physical theories. Of course, if they were outside the envelope they would be equally unbelievable. So the only conclusion is that the model outputs beyond a few months (days?) from the starting time are simply useless.

Reply to  Paul Penrose
February 9, 2020 8:58 am

Paul–> What you are saying is what was mentioned above. One must resolve the reasons for the uncertainties until the “output”uncertainty is small enough to allow one to make the conclusion the the calculated result exceeds the uncertainty interval. If the interval continues to be too large, keep searching and revising until you can legitimately reduce it further.

Reply to  Pat Frank
February 8, 2020 3:05 am

Pat, can you prove mathematically that my Equation (1) has no correspondence with your (1) or your (5)? My explanation of equivalence is in Section D Equation (8).

Rich.

Clyde Spencer
February 7, 2020 8:07 pm

“Instead, sort them by length, and use the shortest and longest 5 times over. We could do this even if we bought n rulers, not equal to 10. We know by symmetry that the shortest plus longest has a mean error of 0, …”

The statements are true only if the error about the nominal value is symmetrical. A manufacturing error could introduce a bias such that the arithmetic mean was 12″ but the mode was some other length.

Ken Irwin
Reply to  Clyde Spencer
February 8, 2020 12:22 am

I once had a problem with a manufacturer of ground steel bar – which was consistently oversized.
The diameter distribution data – although well capable within limits – was always biased to the oversize end of the limits ? Curious.
When I visited the factory and spoke to the operating staff I was told “The boss tells us he sells by the kilogram and we must work toward the upper limit” – Ah-Haa – stupid but understandable.

The assumption of errors stacking up to a normal distribution does not hold if there is some socio-political bias in the way the end to end rulers are stacked – i.e. a bias to long or short.

I fear climate modelling is fraught with such biases and the distribution curve moves and expands in one direction – increasing with every addition.

“It is beyond coincidence that all these errors should be in the same direction.”
Dr. Matt Ridley – Angus Millar lecture to the Royal Society – Edinburgh – Nov 1st 2011

Reply to  Ken Irwin
February 8, 2020 4:37 am

Ken (Feb 8 12:22am), actually the socio-political bias does not prevent errors stacking up to a normal distribution. It just means that the arithmetic mean (see, owe to Rich that he is saying what he means by “mean”) is not the target diameter that you expected it to be. So the amount of uncertainty, i.e. dispersion of error, may be as expected, while the mean is not. This points up the importance of verification, which is what you admirably did. But if the bias didn’t take the results outside the tolerance quoted, if any, by the manufacturer, then you couldn’t sue them.

RJB

Reply to  Clyde Spencer
February 8, 2020 2:09 am

Clyde, very true. But my analysis was predicated on a uniform distribution, which is symmetrical about its central value, whether biased or not. I did this because some previous Commenters had said that given an uncertainty interval one should assume any value within that is equally likely. All sorts of bizarre conclusions flow from that assumption, which ahead of time, one migh not have expected mathematically.

RJB

February 7, 2020 8:12 pm

“I talk to the trees/ But they don’t listen to me … “ Lyrics from “Paint Your Wagon”, 1969.
“I talk to the trees/That’s why they put me away …” Lyrics by Spike Milligan, soon after.
The Mann, Bradley & Hughes “hockeystick”paper of 1998 used properties of tree rings to derive an alleged temperature/time history of the Northern hemisphere. It is plausible that these authors knew that temperature was not the only variable able to influence such tree ring properties. Moisture, fertilizer, insect damage are 3 further properties of influence.
In a proper error analysis, estimates of error of each variable are made, then combined to give an overall error. For moisture, there was some historic data and some prior work linking moisture levels to ring properties. There was much less historic data about fertilization effects, both natural and man-made. There was effectively no useful historic data to quantify insect effects on ring growth.
It follows that it was improper to calculate error envelopes. Those shown in this much-criticized hockey stick paper must have an unscientific element of invention.
http://www.geoffstuff.com/mbh98.jpg
Geoff S

David Stone
February 8, 2020 12:57 am

I found the paper most interesting.
I compare the results of our weather forecasting models to the “climate” ones. Weather has much higher resolution and gives fairly good results 1 to 2 weeks ahead. Beyond this they say little that is very useful, and accuracy falls very rapidly. Temperature expectations become +- several degrees and may well be hugely out. Paths of weather fronts are fairly unpredictable a week in the future. If I reduce the input data sampling to 100km squares the results are useless. Why is a similar model attempt, with even lower resolution, supposed to predict the “climate” many years into the future? One may say climate is better understood, but this is untrue. Huge effort has gone into weather prediction for many years, largely to get the error terms under control. The fact that the climate models do not agree with reality, says that they are useless. Why does anyone still believe anything they say, particularly after reading the analysis above!

Reply to  David Stone
February 8, 2020 8:26 am

David –> I see you quoted “climate”. This is a bugaboo of mine. The earth doesn’t have a climate! It has a myriad of climates that result in biomes. The GCM’s would better be classified as GTM’s, Global Temperature Models.

John Dowser
February 8, 2020 1:09 am

But to criticize any GCM with this logic flies into to face of the reality of the broader CFD (Computational Fluid Dynamics) based design methods, which are used in exactly the same way to do actual engineering and where results from the runs can be tested with reality again and again. They all proof without doubt that errors get squashed, not multiplied or propagated as suggested by simplifying the model to something simple.

So my advice is to work with people versed in CFD to explain how the amount of solutions are generally limited (steady states) which influences error rates and uncertainty both, in the long run for sure.

Reply to  John Dowser
February 8, 2020 6:15 am

I think your advise is good, however, you talk of users of CPD testing against reality, creators of GCMs do not, can not test their creations against reality. Here lies the problem.

Testing the global temperature anomaly output from a GCM against a measured!! global temperature anomaly is pointless since Australia could be a little warmer and Canada a little cooler and the measured temperature remains constant. The solution to this is compare every grid cell temperature from the model with every real grid cell in reality.

Impossible and therefore a waste of time, money and intellectual capital.

Clyde Spencer
February 8, 2020 10:20 am

Steve
I have long advocated that instead of distilling all temperature measurements into a single number, the measurements should be aggregated and averaged for each and every Köppen climate zone. That would tell us if warming is uniform (which it almost certainly isn’t), and would make it clear what climate zones are most poorly instrumented, and therefore have the greatest uncertainty, and provide us with a more sensitive measure of the regional changes for those zones that have the highest quality measurements. Yet, climatologists insist on on using a single number, and don’t acknowledge the very large annual variance associated with a global average.

Reply to  Clyde Spencer
February 9, 2020 9:06 am

Clyde –> You have hit a nail on the head. Let me also add that we should also begin to include humidity levels. Enthalpy is what is important. One can have a two biomes with similar temperatures but massive difference in humidity and consequently heat. Heat is what we should be dealing with, not temperature.

Clyde Spencer
Reply to  Jim Gorman
February 12, 2020 7:44 pm

Jim
Precipitation is taken into account in defining the Köppen climate zones.

February 8, 2020 5:27 pm

For those interested, I found this recent research article by Held, et al. “Structure and Performance of GFDL’s CM4.0 Climate Model” in the Journal of Advances in Modeling Earth Systems, November 2019.

https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019MS001829

See figure 15 in particular – “Root mean square errors (RMSE) in net, net shortwave, and outgoing longwave radiation (in W/m2) at top‐of‐atmosphere (TOA) for the annual mean and the individual seasons…”

See also figure 4. In a nearby paragraph there is mention of “a uniformly distributed energy‐conservation fix.”

Reply to  David Dibbell
February 9, 2020 8:50 am

David, thanks for spotting that. It sounds very interesting because I would love to get to the bottom of how the energy conservation thing is enacted in practice. However, at the moment I get “This site is currently unavailable. We apologize for the inconvenience while we work to restore access as soon as possible.” I hope I have better luck later.

RJB

Reply to  See - owe to Rich
February 9, 2020 10:42 am

“I would love to get to the bottom of how the energy conservation thing is enacted in practice”
It’s fairly simple. I mentioned it here. You have a whole lot of equations that are meant to conserve energy locally. There are always small errors, which will mostly cancel. But sometimes there is a bias, which leads to a drift in total energy. So you add an equation requiring total energy to be conserved. As good a way as any of enforcing that is to measure the discrepancy and redistribute it as a addition (or removal) of small amounts of energy uniformly. The corrections are much smaller than the local error level, but they counter the bias. Something similar is done with mass, which means the mass of the various components.

Reply to  Nick Stokes
February 9, 2020 1:11 pm

Nick (Feb 9 10:42am): thanks, without calling upon your name I was hoping you might contribute. The extra detail you give here is useful: the discrepancy in energy is redistributed over all the places/variables which relate to energy. But how doeas a GCM decide what the correct total energy flux should be? It can’t be a fixed amount, slowly increasing each year with added CO2, because then feedback effects from, say, Arctic ice melting would be overruled and become ineffective.

Any clarification on that would be welcome.

Rich.

Reply to  See - owe to Rich
February 9, 2020 1:51 pm

Rich
“But how doeas a GCM decide what the correct total energy flux should be? I”
It doesn’t. That is a different issue. The fixer just ensures that the total energy in the system is conserved. That means that if there has been an outflux, the amount remaining is what it should be (ie after deducting what was emitted). As to what that outflux should be, it is determined by the radiative transfer equations.

Reply to  See - owe to Rich
February 10, 2020 3:07 am

Nick (Feb 9 1:51pm): OK, can we take an example for the flux? Suppose the radiative transfer equations, presumably including albedo and LCW (Longwave Cloud Forcing), say that there is a net influx of 3.142 W/m^2, which may have arisen mostly out of LCF variation, and should correspond to temporary global warming. You are saying, I think, that if the GCM instead adds up to 2.782 W/m^2 then the difference of 0.360 W/m^2 gets ploughed back in to make the RTEs correct. Now, what time interval are we talking about? And more importantly, how will that net 3.142 W/m^2 affect the state of the world on the next time step?

I am assuming that +/-4 W/m^2 cannot randomly add to the overall energy budget, since if it did the GCMs would indeed wander to +/-15K by the end of this century. Does it just add a small amount of energy to the land and oceans?

Rich.

Reply to  See - owe to Rich
February 10, 2020 12:43 pm

Rich
“Now, what time interval are we talking about? And more importantly, how will that net 3.142 W/m^2 affect the state of the world on the next time step?”
The time interval is probably every timestep, which would be 10-30 minutes. The role of LCF or whatever is not special; it just does a general accounting check for global energy. I would expect the discrepancy is much less than you describe.

In terms of global effect, well, we are trying to solve conserved equations, so the effect should be to get it right. The better view is to ask what would be the effect of not correcting. In CFD, energy either runs down and everything goes quiet, or the opposite. Actually in my experience it is more often mass conservation that fails, or that is noticed first.

A worry might be that the added energy is not put in the right place. But the discrepancy correction is a slow process, and the general processes that move energy around (mix) are fast – think weather. So it really doesn’t matter.

Reply to  See - owe to Rich
February 11, 2020 5:09 am

Nick Feb 10 12:43pm:

I should have realized that GCM time steps, or “ticks” as in computer parlance, would be relatively small. Suppose for arithmetic convenience we take there to be 40000 ticks in a year, so a tick is 13.14 minutes. Then extrapolating Pat Frank’s +/-4 W/m^2 per year, taken as gospel truth for the moment, down in scale to 1 tick, gives us +/-4/sqrt(40000) = +/-0.02 W/m^2 per tick.

So, could LCF in a GCM actually randomly walk at +/-0.02 W/m^2 at each tick? If so, after 81 years it would reach +/-36 W/m^2 which I think we’ll agree is a large range.

Or, is there something constraining how far LCF can go? Lindzen would say “yes, the iris hypothesis” and Eschenbach would say “yes, tropical clouds”. (Or are those short-wave cloud effects rather than long-wave?)

As for whether the +/-4 W/m^2 is gospel truth I’ll study David Dibbell’s link for further information on that.

Rich.

Clyde Spencer
Reply to  Nick Stokes
February 12, 2020 7:54 pm

Stokes
I reminded of the TV show with Neil deGrasse Tyson where he was trying to illustrate the difference between climate and weather by walking a dog on long leash on a beach. Actually, he had it backwards because where the dog (weather) could go was controlled by Tyson (climate) and the length of the leash. If the dog had been free to run where it wanted, and Tyson had to chase after the dog it would have been a better analogy.

However, adjusting the GCMs to conserve energy is a bit like trying to keep the dog from breaking the leash and making Tyson chase after it.

Reply to  See - owe to Rich
February 9, 2020 10:51 am

Dr. Booth, I checked just now and the link works. But just in case there is some other reason it is not working for you, here is a link to a pdf of the article.
https://www.dropbox.com/s/iat07c0369paba1/Held_et_al-2019-Journal_of_Advances_in_Modeling_Earth_Systems.pdf?dl=0

Best to you.
DD
P.S. – It seems intuitive to me to grasp Pat Frank’s analysis and conclusions about uncertainty and reliability, even as model outputs are stable. In this new-and-improved climate model, from figure 15, the RMSE of the annual mean of the outgoing longwave TOA is about 6 W/m^2, over a hundred times what would be required for the model to “see” the reported or projected annual increase in anthropogenic forcing.

February 10, 2020 1:11 pm

Let me offer this comment on error and the central limit theory. Taking samples of a population and using the central limit theory to determine a more and more accurate value for the mean does *NOT*, and let me emphasize the *NOT*, in any way affect the variance and standard deviation of the overall population. The variance and standard deviation of the population will remain the same no matter how many times you sample the population and use the central limit theory to calculate a more accurate mean.

If that overall population is made up of data points that have their own, individual variance and standard deviation then those variances add directly to determine the overall variance. No amount of calculating a more accurate mean using the central limit theory will change that simple fact.

If you have a linear relationship of y = x1 + x2 + x3, and each of these contributing factors are independent random variables with individual standard deviations and variances then the variance of y is the sum of the variances of x1, x2, and x3. There is no dividing by the population size or multiplying by the population size or anything like that. The variances simply add. Then you take the square root of the sum of the variances to determine the standard deviation of y.

The ten ruler examples of Dr. Booth are not combining things with their own variance and standard deviation. Each ruler is a specific length whether you know exactly what it is or not. You can then use standard statistical methods to determine the variance and standard deviation for that population of rulers using standard statistical methods. But this hypothetical really doesn’t have much to do with adding variances of independent random variables.

If the standard deviation of each individual random variable is a measure of uncertainty, error, or a combination of both, then when combined those standard deviations will become a Root-Sum-Square of the associated variances.

No amount of sampling or central limit theory applications can change the variance or standard deviation of each individual random variable or the standard deviation of the combined population. The sampling and central limit theory can only give you a more accurate calculation for the mean but that mean will still be subject to the same uncertainty or error calculated by the sum of the variances of each member of the population.

This is what Pat Frank tried to show with his calculations and which so many people seem to keep getting confused. You can get your mean calculation as accurate as you want but it won’t change the uncertainty or error associated with the combined populations. It doesn’t matter if you have multiple inputs of independent random variables with individual standard deviations or if you have an iterative process where each step provides an individual output with a standard deviation that becomes the input to the next iteration. The combination of these will still see a Root-Sum_Square increase in the overall standard deviation.

Now, let me comment on the usefulness of calculating an arbitrarily precise mean. In the real world this is a waste of time. In the real world the mean simply can’t have any more significant digits than the inputs used to calculate the mean. In the real world there is no reason to have any more significant digits in the calculated mean than the significant digits used to calculate the mean. Take, for example, a carpenter picking 8’x2″x4″ boards from a pile of 1000 to use in building a stud wall for an inside wall of a house. There is simply no use in calculating a mean out to 10 digits when he can, at best, measure only to the nearest tenth of an inch. It’s a useless exercise. In fact, the mean is a useless number to him other than in finding the pile with 8’x2″x4″ boards. He will have to sort through the pile until he finds enough members that will fit his stud wall. If he doesn’t pay attention and gets some that are too short and tries to use them then he will wind up with wavy drywall in the ceiling and perhaps even cracked drywall somewhere down the timeline. If he gets some too long then he will waste wood in cutting it to the proper length.

If you think about it enough, this applies to the global average annual temperature record as well. You can calculate that mean to any accuracy you want and it won’t matter if you can’t measure any individual temperature average to that preciseness. And the uncertainty in each individual average used in the global average carries over into the global average by the rule of root-sum-square.

No amount of statistical finagaling will change the fact that models like the GCM’s have uncertainty and that the total uncertainty is the root-sum-square of all the factors making up the GCM output.

February 10, 2020 2:18 pm

Tim, I mostly agree with what you say. But one has to be very careful about how the measurements M_i are combined. In my 10 1-ft rulers example (based on an old one of yours) then it is the sum of the M_i which matters, not the mean. In your building of a stud wall, I take it that the boards are being erected parallel to each other. Again, the mean of the board lengths is irrelevant. But approximate sorting of the boards by length would at least allow the carpenter to choose a sample with somewhat similar lengths, which might be practically more important than the actual length.

But taking a global average temperature does use a mean, M* = sum_1^n M_i/n. And now the uncertainty (that is, standard deviation of the error) of M* is the quotient by sqrt(n) of the uncertainty of each M_i, so if n is large more significant digits can certainly be quoted for it than for the individual measurements.

Rich.

Reply to  See - owe to Rich
February 10, 2020 4:45 pm

“But taking a global average temperature does use a mean, M* = sum_1^n M_i/n. And now the uncertainty (that is, standard deviation of the error) of M* is the quotient by sqrt(n) of the uncertainty of each M_i, so if n is large more significant digits can certainly be quoted for it than for the individual measurements”

The mean is made up of individual members whose measurements can only have so many significant digits because of the resolution of the measuring device. Trying to calculate a mean with more significant digits than the individual members of the population is a waste of time. You’ll never be able to confirm if any member of the population is actually of the length you calculate for the mean.

This is why claiming that year X is 0.01deg or 0.001deg hotter than year Y when your temperature data is only good out to the tenth of a degree (and that is a *stretch*). You simply cannot realistically gain significant digits in the mean. That’s a pipe dream of mathematicians and computer programmers. You simply cannot average 10.1deg and 10.4deg and say that the mean is 10.25 deg. You have absolutely no way to know what the mean is past the tenth of a degree. That 10.25deg has to be rounded using whatever rules you want to follow. It will either be 10.2deg or 10.3deg. You cannot artificially gain precision by using an average, not in the real world.

“And now the uncertainty (that is, standard deviation of the error) of M* is the quotient by sqrt(n) of the uncertainty of each M_i”

What is your “n” variable? The population size? If so then this is wrong. The variances, i.e. the uncertainty, of each M_i simply add. There is no sqrt(n) quotient involved. The standard deviation becomes the sqrt(u_1 + u_2 + u_3 …. + u_n)

I think you are stuck in the rut of trying to calculate the variance of a population which is the [Sum(x_i – x_avg)^2]/n. This is true when you assume the value of each member of the population is assumed to be perfectly accurate and you want to know the variance of the population. This is how you use the central limit theory to calculate the mean of a population more and more accurately, except what you are doing is actually finding the standard deviation around the mean. The more samples you take the smaller than standard deviation becomes. But that doesn’t decrease the standard deviation of the population itself, it only tells you how accurate your calculation of the mean is.

But if you are combining independent random variables, i.e. each with their own standard deviation and variance, then you simply add the variances of each independent random variable to get the variance of the combination. And the individual temperatures you are combining to get the annual average global temperature are actually individual random variables each with their own standard deviation and variance. The standard deviation and variance of those random variables are what makes up the uncertainty associated with each. You may choose a value for each individual random variable to use in calculating a mean of the combination but that in no way lessens the uncertainty you wind up with.

Take a pile of 1000 8’x2″x4″ boards. Assume you have a 100% accurate measurement device. Give it to a worker and have him measure each and every board. You can then take those measurements and calculate a mean. You can then use that mean and those measurements to determine the variance and standard deviation associated with that pile. Now you get another pile of 1000 8’x2″x4″ boards from a different supplier. You go through the process of determining the variance and standard deviation of the new pile.

Now have your forklift operator combine the two piles. How do you calculate the variance and standard deviation for the combined pile? You simply add the variances of the two individual piles. The square root of that gives your standard deviation. var(y) = var(pile1) + var(pile2).

It’s been 50 years since I had probability and statistics while getting my engineering degree but I’m pretty sure combining the variances of independent random variables hasn’t changed since then.

February 11, 2020 4:27 am

Tim Feb 10 4:45pm:

First, I’ll repeat my comment from Feb 10 1:54pm:

“The JCGM does not say anything about measurements having to be from a normal distribution in order for their mean and standard deviation to be useful. Sections 4.2.2 and 4.2.3 define s^2(q_k) and s^2(q*) as the variances of the observations and of the mean (I am using q* as more convenient here than q with a bar on top). If there are n observations then s^2(q*) = s^2(q_k)/n, and the (standard) uncertainty of the mean is defined to be u(q*) = s(q*), which does decrease (on average) as n grows, in a sqrt(1/n) fashion.”

Next I’ll make more specific points in response.

1. The above means that the Central Limit Theorem, which covers convergence of sums of i.i.d. variables to normality, is not the correct thing to quote regarding the standard deviations of sums and means.

2. I agree with your statement “The more samples you take the smaller that standard deviation becomes. But that doesn’t decrease the standard deviation of the population itself, it only tells you how accurate your calculation of the mean is. ”

3. “Standard uncertainty” is equivalent to standard deviation. In any specific case the question is what is the variable of interest. If it is a sample value, then the uncertainty is s(q_k) which does not decrease with the number n of samples taken. If it is the mean value, then the uncertainty is s(q*) which does decrease, like 1/sqrt(n), with the number of samples.

4. In the case of global mean temperatures, they are not i.i.d, because q_k depends on the location and time of the measurement. Nevertheless, it is a not unreasonable assumption that each q_k is some value m_k plus an error term e_k where all the e_k’s ARE i.i.d. In this case the uncertainties rest in the identical distributions of the e_k’s, and s(q*) = s(e*) so the uncertainty of q* again diminishes with sample size.

5. So yes, if you “are combining independent random variables, i.e. each with their own standard deviation and variance, then you simply add the variances of each independent random variable to get the variance of the combination” then the assertion is true provided that the combination is summation. But if it is averaging, then by dividing that sum by n the variance is then divided by n^2, and that is what gives the reduction in the uncertainty of the mean.

Rich.

Reply to  See - owe to Rich
February 11, 2020 5:53 am

Rich,

You seem to recognize what the issue is in your No. 2 statement but then go on to ignore it. A more accurate calculation of the mean does *not* reduce the variance and standard deviation of the population. And it is the variance and standard deviation of the population that determines the uncertainty. It truly is that simple.

As I tried to point out with the two piles of 2″x4″‘s, when you combine them the overall variance is the sum of their variances. No amount of calculating the mean of the combination more accurately will change that combined variance. And it is that combined variance that determines the uncertainty associated with that more accurate mean.

It simply doesn’t matter how accurately you calculate the mean of a population. That doesn’t make it the “true value” in any way, shape, or form thus it doesn’t decrease the population uncertainty surrounding that mean as defined by the standard deviation of the population.

Averaging the means of independent random variables does not reduce the variance and standard deviation of the combination of those independent random variables. It only determines the mean of the means. You can reduce the variance of your calculation of the mean, i.e. you can make it more accurate but you don’t reduce the variance of the population.

Let me repeat one more time: if the standard deviation of the population is +/- u, then calculating the mean more accurately won’t change that standard deviation of the population to +/- (u/n). And it is +/- u that is the uncertainty. Standard-error-of-the-mean is not the same thing as the uncertainty of the population.

If you are measuring one thing with one device then you can tabulate the data points and use the central limit theorem and calculate a more accurate measurement of that one thing. But you are not combining independent random variables when you do this. You *are* combining independent random variables when you calculate the mean of numerous temperature means which each have their own standard deviation. It’s why uncertainty propagates as the root-sum-square – because that uncertainty tells you that you have a random variable, not an individual measurement that represents a true value.

Reply to  Tim Gorman
February 11, 2020 6:59 am

Again I am in agreement with much of what you say, in particular your “Let me repeat one more time: if the standard deviation of the population is +/- u, then calculating the mean more accurately won’t change that standard deviation of the population to +/- (u/n). And it is +/- u that is the uncertainty. Standard-error-of-the-mean is not the same thing as the uncertainty of the population.”

But you have ignored my statement 5: it does matter how you “combine” the data points (and please, don’t quote the CLT again as it’s not relevant). Suppose I have uncorrelated variables M_1,…,M_n each with uncertainty +/-u_i (not yet assuming those are all equal). What is the uncertainty of f(M_1,…,M_n)? It is, by Equation (10) in the JCGM 5.1.2,

sqrt( sum_{i=1}^n (df/dM_i)^2 u_i^2 )

Now let f(M_1,…,M_n) = (M_1+…+M_n)/n, so df/dM_i = 1/n and you have uncertainty of the mean is

sqrt( sum u_i^2/n^2 ) = u_1/sqrt(n) if all the u_i’s are equal.

If the mean of a set of measurements is the value of interest, as with the Global Average Surface Temperature, then its uncertainty does decrease as n increases.

Yes, we agree that the population uncertainty does not decrease, but often (not so much in the case of planks) we are actually interested in the population mean.

Rich.

Reply to  See - owe to Rich
February 11, 2020 8:39 am

Rich, “If the mean of a set of measurements is the value of interest, as with the Global Average Surface Temperature, then its uncertainty does decrease as n increases.

Except that field calibration experiments show that the error distribution around each temperature measurement is not normal. iid statistics do not apply.

Let’s also note that the uncertainty in a measurement mean cannot be less than the resolution limit of the instruments, no matter how many individual measurements go into the mean. Likewise regarding the resolution limit of a physical model.

I have published</a< (869.8 KB pdf) on the problem of temperature measurement error, and will have much more to say about it in the future.

Also here.

The people who compile the GASAT record are as negligent about accounting for error as are climate modelers.

Reply to  Pat Frank
February 11, 2020 8:40 am

Sorry about the html error

Reply to  Pat Frank
February 11, 2020 11:19 am

“the uncertainty in a measurement mean cannot be less than the resolution limit of the instruments”

The mathematics says it can, and will be, less than that, if the instrumental errors are not correlated and there are sufficiently many measurements. u_1/sqrt(n), derived from the JCGM, does decrease with n.

Rich.

Reply to  Pat Frank
February 11, 2020 2:24 pm

Rich,

“The mathematics says it can, and will be, less than that, if the instrumental errors are not correlated and there are sufficiently many measurements. u_1/sqrt(n), derived from the JCGM, does decrease with n.”

Once again you are trying to justify using the central limit theorem to say that you can increase the resolution of an instrument. You can’t!

I tried to explain that to you with one of my examples with 1000 8’x2″x4″ boards. If I can only measure the boards to a resolution of 1/8″ then no matter how many measurements I take and average together can, in the real world at least, it give me a resolution of more than 1/8″. I don’t care how many digits you calculate the mean out to, I will never be able to find a board of that resolution because I simply can’t measure with that resolution! If I tell you a board is 7’11 7/8″ long how do you know if it is 7′ 11 13/16″ long or 7′ 11 15/16″ long? You can use it to calculate the mean down to the 1/16″, the 1/32″, or the 1/64′” but you don’t decrease the uncertainty in any way because the measurements I gave you doesn’t support anything past 1/8″. Anything past that will remain forever uncertain.

It violates both the rules of significant digits as well as the rules for uncertainty.

Reply to  Pat Frank
February 12, 2020 5:07 am

Tim Feb11 2:24pm

Tim, I apologize. I had failed to distinguish between two cases.

The first case is repeated measurements of the same variable (measurand) under apparently identical conditions. Then, as you say, it is true that the mean squared error cannot be driven down below a certain limit related to the resolution, no matter how many samples are taken.

The second case is measurements of different variables, one each, for example temperatures at different places and times. In this case I believe that the rounding errors can partially cancel out, so while the error of the sum of them increases, the error of the mean of them decreases.

I have been working on this with the formulation in my Section F, and I intend to share some results when I have more time.

Rich.

Reply to  See - owe to Rich
February 11, 2020 1:54 pm

Rich,

“Now let f(M_1,…,M_n) = (M_1+…+M_n)/n, so df/dM_i = 1/n and you have uncertainty of the mean is”

Once again you are trying to equate the uncertainty of the mean with the standard deviation of the population.

M_1 … M_n are not random variables with a probability distribution function. They are uncertainty intervals. You keep trying to formulate them as random variables where you can calculate the most probable outcome by finding the mean with less and less standard error of the mean.

“If the mean of a set of measurements is the value of interest, as with the Global Average Surface Temperature, then its uncertainty does decrease as n increases.”

The mean of the set of measurements is *NOT* the only value of interest since it is *not* the most likely value in a probability distribution function. The other value of interest is the uncertainty of that set of measurements. And that uncertainty interval does not go down by 1/sqrt(n).

February 11, 2020 6:35 am

Re Nick Stokes recent and David Dibbell Feb 8 5:27pm:

OK, I’ve looked at David’s reference, and here’s a thing. David says “See figure 15 in particular – “Root mean square errors (RMSE) in net, net shortwave, and outgoing longwave radiation (in W/m2) at top‐of‐atmosphere (TOA) for the annual mean and the individual seasons…”. But in the text above that figure it says “Figure 15 shows the RMS biases in the net TOA fluxes…”. Now “biases” is an interesting operative word. Recall that in my Secion E I wrote “The real error statistic of interest is E[(M-X)^2] = … Var[M] + b^2”, and b is the bias E[M]-X. So b may be a significant player in the data portrayed by Figure 15.

If so, suppose an annual RMSE of 4 W/m^2, i.e. MSE of 16 W^2/m^4, consists of 15 for the squared bias and 1 for the variance Var[M]? Then since bias is not a random component leading to a random walk through time, but uncertainty = sqrt(Var[M]) is, the uncertainty which gets propagated is now +/-1 W/m^2 not 4, much lower than in Pat Frank’s paper. We cannot tell from Figure 15 what proportion of the RMSE is funded by the bias, but perhaps some GCM aficionado could root out the underlying data to find out. This seems important to me.

Rich.

Reply to  See - owe to Rich
February 11, 2020 8:27 am

How do you know what the TOA biases are in the simulation of a future climate, Rich?

If you don’t know the discrete TOA biases in every step of the simulated future climate — and you don’t — then how do you estimate the reliability of the predicted climate?

Reply to  Pat Frank
February 11, 2020 11:12 am

Pat, to be honest I’m just worrying about the biases and uncertainties in the past (i.e. calibration runs) for now. When I understand that, I’ll address the future. But one possibility would be that the biases only change slowly. Who knows, without the data?

Rich.

Reply to  See - owe to Rich
February 11, 2020 7:43 pm

The air temperature measurement bias errors change pretty across every day, Rich. Because wind speed and irradiance vary within every day, and between days.

Take a look at K. G. Hubbard and X. Lin, Realtime data filtering models for air temperature measurements. Geophys. Res. Lett., 2002. 29(10): p. 1425 1-4;
doi: 10.1029/2001GL013191.

Monthly means are badly ridden with systematic error, and no one knows the magnitude of the biases for any of the individual temperature measurements that go into a mean.

H&L 2002 above show the MMTS sensor measurements average about ±0.34 C of non-normally distributed error — and that’s for a well-maintained and calibrated sensor operating under ideal field conditions.

Typically, the measurement errors are much larger than any random jitter (which typically arises in the electronics and wiring of a modern sensor), the magnitude of which (typically ±0.1-0.2 C) can be determined in the lab.

With respect to GCMs, we know the uncertainties from the past because calibration runs are available. Those uncertainties that arise from within the models signify errors that are injected into simulations of the future climate. In every single step of a simulation.

There’s no valid ignoring of them, or assuming them away, or wishing them away.

Reply to  See - owe to Rich
February 11, 2020 8:29 am

Rich,

“hen since bias is not a random component leading to a random walk through time, but uncertainty = sqrt(Var[M]) is”

What makes you think uncertainty creates a random walk? I still don’t think we have a common understanding of what uncertainty is.That is probably my fault. While I often use the terminology of a random variable to demonstrate how to handle combinations of values with an uncertainty interval, that doesn’t mean uncertainty *is* a random variable.

Random variables are typically defined as having a population whose members can take on different values. A frequency plot of how often those different values occur creates your probability distribution function, i.e. it’s what defines a normal distribution, a poisson distribution, etc. That probability distribution function will have a standard deviation and variance associated with it. A random variable can create a random walk, simply by definition. The random variable will create values around the mean.

An uncertainty interval is not a probability distribution function. In no way does the uncertainty interval try to define the probability of any specific value occurring in the population. It merely says the true value will probably be somewhere in the interval. You may have a nominal value associated with what you are discussing but that is not a mean which is defined as being the most likely value to be found in a probability distribution function.

An uncertainty interval around a nominal value cannot create a random walk since it doesn’t define any specific values or the probability of those specific values happening. A narrow uncertainty interval tells you that the nominal value is close to the true value. A wide uncertainty interval tells you that the nominal value is questionable. But neither tells you what the true value is (unless the uncertainty interval is zero).

A thermometer with an uncertainty of X +/- u where u is small, e.g. 70deg +/- 0.001deg, is pretty accurate and probably gives a good representation of the true value. A thermometer with an uncertainty of X +/- v where v is large, e.g. 72deg +/- 0.5deg, gives a temperature far more questionable. But neither uncertainty interval tells you anything about what the probability of any specific value might be. Therefore neither cannot generate a random walk. The only way the nominal value can be the true value is if u or v is equal to zero.

Since an +/- uncertainty interval about a nominal value looks exactly like the +/- standard deviation about a mean, the general rule is to treat them the same. When combining values with independent uncertainties you add them (root-sum-square) just like you add variances of random variables.

Using the example above of two different thermometers you can certainly average the two nominal values and get 71deg. But the uncertainty becomes sqrt( u^2 + v^2) or +/- 0.500001. The uncertainty will never go down, it will only go up. If you have ten thermometers with an uncertainty of +/- 0.5deg then when combined the uncertainty (root-sum-square) becomes sqrt(10 * .25) = +/- 1.58deg.

This is why trying to calculate the mean out to an arbitrary number of digits is useless. The uncertainty will overwhelm whatever difference you think you are calculating. Suppose you have 10 thermometers with an uncertainty of 0.001. The uncertainty of the combination becomes the sqrt(10 * 1e-06) = +/- 0.003deg. If your difference from one year to another is less than 0.003deg then you simply don’t know if that difference is real or not. If you are combining 1000 thermometers with an uncertainty of +/- 0.01 then the combined uncertainties become +/- 0.03deg. Any difference from one period to another that is less than 0.03deg is questionable.

How many thermometers around the world have an uncertainty of +/- 0.001deg? +/- 0.01deg?

It’s why I say the global annual average temperature is pretty much a joke. You never get an uncertainty interval given for that global annual average temperature but it’s going to be huge. It gets even more humorous when you are trying to compare to a base consisting of records from the late 19th century and early 20th century. It doesn’t matter how accurately you calculate the mean of those nominal values having an uncertainty, you can’t decrease the uncertainty.

Reply to  Tim Gorman
February 11, 2020 11:08 am

Tim Feb 11 8:29am: I did mention random walks in my Section H, but should probably have included it in Section B too. From that section:

(1) W(t) = (1-a)W(t-1) + R1(t) + R2(t) + R3(t) where 0 ≤ a ≤ 1

I am happy to take a=0 for now. This white box model of how a black box GCM works is iterative, with random errors around the means of R_i(t). The standard deviations of those errors are the (standard) uncertainties. The model can be run with Monte Carlo values for those errors, and the evolution of W(t) has an uncertainty, i.e. standard deviation of its output over many runs, which grows proportionally to sqrt(t), as Pat has been maintaining (though he has just referred to uncertainty rather than to Monte Carlo outputs).

HTH, Rich.

Reply to  See - owe to Rich
February 11, 2020 2:13 pm

Rich,

You are still conflating random errors with uncertainty. Random errors imply a probability distribution function, uncertainty does not.

If I tell you that a temperature measurement is 72deg with an uncertainty of +/- 0.5deg, exactly what makes you think there is a probability distribution function associated with the uncertainty? All I am telling you is that I am uncertain what the true value is. That implies no gaussian, poisson, or any other kind of probability distribution function. I don’t know if the nominal value of 72 is the most likely value of a distribution nor do I know if it is the true value. When I combine that measurement with another measurement from a completely independent thermometer that has its own uncertainty range then calculating the mean of the two, to any number of digits you like, *still* doesn’t tell me that the mean that is calculated is the true value either. All you can say is that the true value lies somewhere in the combination of the two uncertainty intervals, i.e. sqrt(u_1 + u2). That uncertainty interval certainly can’t be decreased merely by dividing by the sqrt(2).

Again, as you keep agreeing, the standard error of the mean is meaningless when applied to the entire population.

I’m not sure why you think Monte Carlo runs will help anything. If the GCM out put is determinative and linear, which they apparently are, then varying the inputs will only give you a sensitivity measurement for the model, it won’t help define an uncertainty interval.

If the model is *not* determinative and linear, and the output can vary over several runs for the same inputs, then the model has a basic uncertainty built into it that has to be added into any uncertainty calculations.

Reply to  Tim Gorman
February 12, 2020 6:50 am

Tim Feb11 2:13pm

I’ll address your points (prepended by T:) individually.

T: You are still conflating random errors with uncertainty. Random errors imply a probability distribution function, uncertainty does not.

Please read my Section E again for my position on uncertainty, derived from the JCGM. It is a measure of dispersion of the measurement M, which is the same as the dispersion of the error M-X (where X is the true value). “Standard” uncertainty takes the standard deviation of M, or M-X, as the value for the uncertainty. This implies that there is indeed a probability distribution function underlying, because otherwise the s.d. can’t be calculated. Moreover, without s.d.’s, there is no mathematics to justify the addition rule of independent uncertainties. We don’t necessarily know the probability distribution, but we can try different examples and see what the implications are. That is what I do in Section F.

T: If I tell you that a temperature measurement is 72deg with an uncertainty of +/- 0.5deg, exactly what makes you think there is a probability distribution function associated with the uncertainty? All I am telling you is that I am uncertain what the true value is. That implies no gaussian, poisson, or any other kind of probability distribution function. I don’t know if the nominal value of 72 is the most likely value of a distribution nor do I know if it is the true value. When I combine that measurement with another measurement from a completely independent thermometer that has its own uncertainty range then calculating the mean of the two, to any number of digits you like, *still* doesn’t tell me that the mean that is calculated is the true value either. All you can say is that the true value lies somewhere in the combination of the two uncertainty intervals, i.e. sqrt(u_1 + u2). That uncertainty interval certainly can’t be decreased merely by dividing by the sqrt(2).

See my previous answer. And if you have written +/-0.5 without further clarification, other users of the JCGM will take that to be standard uncertainty, i.e. a s.d of 0.5. Saying +/-0.5 doesn’t say you are uncertain, but by how much. If you actually thought the error was equally likely to be anywhere in (-0.5,+0.5) then the s.d. is 1/sqrt(12) = 0.289 and you will have been misleading people who thought you meant standard uncertianty.

T: Again, as you keep agreeing, the standard error of the mean is meaningless when applied to the entire population.

It’s good we agree on something!

T: I’m not sure why you think Monte Carlo runs will help anything. If the GCM out put is determinative and linear, which they apparently are, then varying the inputs will only give you a sensitivity measurement for the model, it won’t help define an uncertainty interval.
If the model is *not* determinative and linear, and the output can vary over several runs for the same inputs, then the model has a basic uncertainty built into it that has to be added into any uncertainty calculations.

I agree with the “determinative”, or “deterministic” as I would say, but can you explain in what respect the GCM output is linear, and is that important? I believe that small perturbations to the GCM initial conditions yield, via chaos theory, to very different outputs, which are best treated statistically, and that a good emulator for the GCM should come somewhere near to matching those statistics. But I’ll admit to +/-1 million neurons’ uncertainty on this issue!

Reply to  Tim Gorman
February 12, 2020 7:34 am

Rich, “ This implies that there is indeed a probability distribution function underlying, because otherwise the s.d. can’t be calculated. Moreover, without s.d.’s, there is no mathematics to justify the addition rule of independent uncertainties.

Empirical standard deviations are calculated regularly in the experimental sciences (and engineering), without too much concern about whether all the statistical ducks are in line, Rich. The reason is because empirical error SDs give a useful estimate of the reliability of a result.

Mere calculation of an empirical SD says nothing about the error distribution. It certainly does not imply a normal distribution.

Likewise the propagation of calibration error to yield an uncertainty envelope. The underlying statistical iid assumptions are typically not met, but the empirical approach nevertheless yields a useful estimate of reliability.

I’ve finally had time to for a more detailed look at your analysis, Rich. So far, it lacks coherence. Your emulator is bereft of circumstantial relevance.

Reply to  Tim Gorman
February 12, 2020 12:09 pm

Rich,

“This implies that there is indeed a probability distribution function underlying, because otherwise the s.d. can’t be calculated.”

“Moreover, without s.d.’s, there is no mathematics to justify the addition rule of independent uncertainties.”

What happens if you consider the uncertainty interval to be a uniform continuous probability function with every point in the uncertainty interval having the same probability? In this case the mean (b-a)/2 has the same probability of being the true value as any other point in the distribution.

Thus the variance of each becomes (b-a)^2/12. When combining the two the variance becomes 2(b-a)^2/12 or twice the variance of each. The standard deviation becomes the sqrt(2) times the individual variances. If you combine three individual independent temperatures with the same uncertainty interval you get sqrt(3) times the individual intervals as being the combined intervals. And again, your calculated combined mean has no more chance of being the true value than any other point in the interval.

This is no different than what Pat Frank came up with: root-sum-square of an uncertainty interval when used iteratively over and over.

In fact, I would offer that the reality is even worse than this. If you try to average two temperatures, 60deg +/- 0.5deg and 72deg +/- 0.5deg, you come up with a far different calculation.

S_c^2 = { (b-a)(S_1^2 + (X_1-X_c)^2] + (b-a) (S_2^2 + (X_2 – X_c)^2] } / 2(b-a)

b-a = 1 (+/- 0.5)
X_c = (60+72)/2 = 66
X_1 = 60, X_2 = 72
S_1 = S_2 = 0.5
S_1^2 = S_2^2 = 0.25

Factor out (b-a) and you get [ (.25+36) + (.25+36) ] / 2 = 72.5/2 = 36.25 = S_c^2

Consider, I go to Africa and measure the heights of 1000 pygmys. I use a yardstick with a resolution of 1/4″. I note with each measurement whether the subject was slouching, was standing flat footed, or was standing on tip-toe. At the end I have 300 subjects that were slouching, 200 that were on tip-toe, and 500 that were flat footed. I then calculate the mean for all of the recorded heights. Just how sure can you be that the calculated mean is *really* the true mean when the actual height of 500 of the 1000 subjects are questionable?

Now, I do the same thing for 1000 Watusis. Then I combine the data from the two populations and calculate a new mean. Does the variance and standard deviation of the combined population increase or decrease? Does the uncertainty of the true value of the mean increase or decrease?

Does the mean of the combined data actually tell you anything? If I order 2000 pairs of pants sized to the mean of the combined data just how many of the subjects will those pants actually fit? Now make the pygmys your minimum temperatures and the Watusis your maximum temperatures. Does the mean of those actually tell you anything? Or is it about as useless and the pants ordered above?

“See my previous answer. And if you have written +/-0.5 without further clarification, other users of the JCGM will take that to be standard uncertainty, i.e. a s.d of 0.5. Saying +/-0.5 doesn’t say you are uncertain, but by how much. If you actually thought the error was equally likely to be anywhere in (-0.5,+0.5) then the s.d. is 1/sqrt(12) = 0.289 and you will have been misleading people who thought you meant standard uncertianty.”

The standard deviation is not 1/sqrt(12), it is sqrt[ (b-a)^2/12]. See above. For the case where the interval is +/- 0.5 because 1^2 = 1. If the interval was +/- 0.4 you would have a totally different situation. You would have a totally different situation if each measurement device has a different uncertainty interval.

“I agree with the “determinative”, or “deterministic” as I would say, but can you explain in what respect the GCM output is linear, and is that important?”

The general case of a combining variances of two or more inputs is:

(S_y)^2 = Sum [ (df/x_i)^2 * (s_xi)^2 ]

If f(x) = y = x_1 + x_2 then

df/x_1 = 1
df/x_2 = 1

and the variances just add.

If f(x) = y = (x_1)^2 + (x_2)^2

then df/x_1 = 2 and df/x_2 = 2

and the combination becomes 4(s_x1)^2 + 4(x_x2)^2

Clyde Spencer
Reply to  Tim Gorman
February 12, 2020 8:44 pm

Tim
I think that you are being inconsistent and careless in your use of terms such as “uncertainty,” “accuracy,” and “significant figures.”

One can have a thermometer (or more likely thermocouple) that can be read to 0.001 degree F. If what is being measured is the temperature of an ice-water bath and the nominal temperature is 32 degrees, than one can say that it is very precise with 5 significant figures. However, if it reads 33 degrees it is not accurate! Precision and accuracy are not independent. One cannot have high accuracy with low precision, but one can have low accuracy with high precision. To make headway, there has to be agreement on the definition of the terms used. Now, to complicate things, if one has a large number of high-precision ‘thermometers,’ with variable accuracy, I don’t think that the Law of Large numbers will compensate for the variable accuracy. Indeed, a few badly calibrated ‘thermometers’ will skew the distribution of readings and possibly turn a normal distribution into a non-normal one. But, determining that may be next to impossible because a single temperature (GMST) is not what is being measured! Instead, it is tens of thousands of different temperatures (which, incidentally, have a standard deviation of several tens of degrees on an annual basis!)

Where the improvement in precision, with precision-limited instruments, has a long history is in surveying. There, the same instrument is used measuring the same angle over and over. Thus, the random errors in reading the scale, vernier inscribing errors, and eccentricity in the ring, cancel out by the sq rt of n principle.

In climatology, one is NOT using the same thermometer over and over, and one is NOT measuring the same temperature. There is an old joke about the NTSC TV standard standing for “Never Twice the Same Color.” Here we have a situation where no two temperatures are ever exactly the same, and no two thermometers are exactly the same with respect to accuracy and the attainable precision. What is being measured, and what is being used to measure it, appear similar. However, one has to take into account that they are really all different. Thus, one has to determine what the uncertainty in both accuracy AND precision are, to be able to say anything intelligent about what an average of all the readings means.

Reply to  Tim Gorman
February 13, 2020 6:13 am

Tim Feb12 12:09pm

“What happens if you consider the uncertainty interval to be a uniform continuous probability function with every point in the uncertainty interval having the same probability? In this case the mean (b-a)/2 has the same probability of being the true value as any other point in the distribution.”

Correct; and I cover this case in my Section F preceding Equations (21) and (22).

“Thus the variance of each becomes (b-a)^2/12. When combining the two the variance becomes 2(b-a)^2/12 or twice the variance of each. The standard deviation becomes the sqrt(2) times the individual variances. If you combine three individual independent temperatures with the same uncertainty interval you get sqrt(3) times the individual intervals as being the combined intervals.”

Correct insofar as the standard deviation is now (b-a)sqrt(3/12) = (b-a)/2. But the combined intervals now span 3(b-a), which is not sqrt(12) times the s.d., which serves to prove false your next statement:

“And again, your calculated combined mean has no more chance of being the true value than any other point in the interval.”

Incorrect. The convolution of 3 uniform distributions is not uniform. Here is a relevant paragraph from my Example 1, for the case of 10 uniforms with a=-0.1, b=+0.1:

“To get the exact uncertainty distribution we would have to do what is called convolving of distributions to find the distribution of the sum_1^10 (X_i-12). It is not a uniform distribution, but looks a little like a normal distribution under the Central Limit Theorem. Its “support” is not of course infinite, but is the interval (-1”,+1”), but it does tail off smoothly at the edges. (In fact, recursion shows that the probability of it being less than (-1+x), for 0<x<0.2, is (5x)10/10! That ! is a factorial, and with -1+x = -0.8 it gives the small probability of 2.76e-7, a tiny chance of it being in the extreme 1/5 of the interval.)”

BTW I have said that I will give more detail on the reduction, or not, of uncertainty when means are used. I have been delayed in this because I managed to confuse myself over whether it was my Equation (16) or Equation (19) that was important, and of course it is (16). I think that's what is known as "full disclosure".

Rich.

Reply to  Tim Gorman
February 13, 2020 11:42 am

Clyde,

Basically you are repeating exactly what I have been saying over and over.

“I think that you are being inconsistent and careless in your use of terms such as “uncertainty,” “accuracy,” and “significant figures.””

That is exactly the point. These three terms go together. No matter the accuracy of your thermometer, it will always have a resolution limit and therefore some uncertainty. And the resolution limit determines how many significant figures you can actually use.

“I don’t think that the Law of Large numbers will compensate for the variable accuracy.”

Of course it won’t. The law of large numbers only applies when you make multiple measurements of the same thing using the same device, exactly like your transit.

It’s why you have to combine uncertainties when you are combining data from various instruments. No amount of calculating the mean with larger and larger quantities of values that have in inbuilt uncertainty can lessen the uncertainty.

“But, determining that may be next to impossible because a single temperature (GMST) is not what is being measured! Instead, it is tens of thousands of different temperatures (which, incidentally, have a standard deviation of several tens of degrees on an annual basis!)”

Exactly! It’s why it becomes impossible to say that this year is .0.001deg or 0.01deg hotter than last year. Your overall uncertainty interval is wider than that. So you really don’t know!

“hus, one has to determine what the uncertainty in both accuracy AND precision are, to be able to say anything intelligent about what an average of all the readings means.”

Even the ARGO floats, with thermistors capable of discerning differences in temperature of 0.001deg have calibration curves. The thermistors themselves are not perfectly linear and individual elements can age differently depending on the environment they are subjected to. In addition, the actual temperature readings are dependent on several things such as the rate of flow of water past the thermistor so if the water path encounters any changes (e.g. algae growth, etc) that throws off the calibration. Even the salinity level or pollution level of the water being measured can throw off the calibration.

If I confuse terms it is in hopes of trying to explain how uncertainty can be understood by standard statistical methods. An uncertainty interval is not a probability function with a standard deviation but it can be treated mathematically in the same manner. Just as you add variances of a population with a normal probability curve as described by a standard deviation you can add uncertainty intervals by considering them to be standard deviations. But you can only take this so far. You can say the uncertainty intervals of two different thermometers are like a uniform continuous probability function in order to show how to combine the uncertainty intervals but they are *not* uniform continuous probability functions normalized to the interval (0,1) so you can’t combine them through convolution to get a triangle function which supposedly more accurately defines a mean value with a higher probability of occurrence.

This whole issue gets even worse when you consider that most of these statistical methods are based on samples taken from the same population group. When you try to calculate variances and standard deviation for two totally independent population groups, e.g. minimum temp +/- u_min and maximum temp +/- u_max, it gets even more complicated than what we’ve discussed here. I tried to show that with my pygmy and Watusi population combination. You can wind up with a meaningless mean and crazy variances and standard deviations.

Yet none of the climate studies or climate models seem to take any of this into consideration. As Pat Frank showed the climate models are simply black boxes with a linear transfer function no matter how complicated their makeup of differential equations are. And the models can’t even adequately treat the uncertainty associated with that simple setup, they just ignore it totally!

Reply to  See - owe to Rich
February 11, 2020 2:10 pm

Dr. Booth – two points here. First, about what the “bias” is. From Figure 16 and its caption, it looks like this “bias” is the difference of annual means, gridpoint by gridpoint, subtracting the CERES values averaged over the period 2001-2015 from the CM4.0 values averaged over the period 1980-2014. (If I have misunderstood this, I welcome a correction.) Second, then, I’m not suggesting the RMSE 6 W/m^2 value from Figure 15 corresponds somehow to the +/- 4 W/m^2 value appearing in Pat Frank’s paper. They are quite different.
DD