What do you mean by “mean”: an essay on black boxes, emulators, and uncertainty

Guest post by Richard Booth, Ph.D

References:

[1] https://wattsupwiththat.com/2019/09/07/propagation-of-error-and-the-reliability-of-global-air-temperature-projections-mark-ii/

[2] https://wattsupwiththat.com/2019/10/15/why-roy-spencers-criticism-is-wrong  

  1. Introduction

I suspect that we can all remember childish arguments of the form “person A: what do you mean by x, B: oh I really mean y, A: but what does y really mean, B: oh put in other words it means z, A: really (?), but in any case what do you even mean by mean?”  Then in adult discussion there can be so much interpretation of words and occasional, sometimes innocent, misdirection, that it is hard to draw a sound conclusion.  And where statistics are involved, it is not just “what do you mean by mean” (arithmetic?, geometric?, root mean square?) but “what do you mean by error”, “what do you mean by uncertainty” etc.etc.?

Starting in the late summer of 2019 there were several WUWT postings on the subject of Dr. Pat Frank’s paper [1], and they often seemed to get bogged down in these questions of meaning and understanding.  A good deal of progress was made, but some arguments were left unresolved, so in this essay I revisit some of the themes which emerged.  Here is a list of sections:

B. Black Box and Emulator Theory – 2.3 pages (A4)

C. Plausibility of New Parameters – 0.6 pages

D. Emulator Parameters – 1.5 pages

E. Error and Uncertainty – 3.2 pages

F. Uniform Uncertainty (compared to Trapezium Uncertainty) – 2.5 pages

G. Further Examples – 1.2 pages

H. The Implications for Pat Frank’s Paper – 0.6 pages

I. The Implications for GCMs – 0.5 pages

Some of those sections are quite long, but each has a summary at its end, to help readers who are short of time and/or do not wish to wade through a deal of mathematics.  The length is unfortunately necessary to develop interesting mathematics around emulators and errors and uncertainty, whilst including examples which may shed some light on the concepts.  There is enough complication in the theory that I cannot guarantee that there isn’t the odd mistake.  When referring to [1] or [2] including their comments sections, I shall refer to Dr. Frank and Dr. Roy Spencer by name, but put the various commenters under the label “Commenters”.

I am choosing this opportunity to “come out” from behind my blog name “See – Owe to Rich”.  I am Richard Booth, Ph.D., and author of “On the influence of solar cycle lengths and carbon dioxide on global temperatures”.  Published in 2018 by the Journal of Atmospheric and Solar-Terrestrial Physics (JASTP), it is a rare example of a peer-reviewed connection between solar variations and climate which is founded on solid statistics, and is available at https://doi.org/10.1016/j.jastp.2018.01.026 (paywalled)  or in publicly accessible pre-print form at https://github.com/rjbooth88/hello-climate/files/1835197/s-co2-paper-correct.docx . I retired in 2019 from the British Civil Service, and though I wasn’t working on climate science there, I decided in 2007 that as I had lukewarmer/sceptical views which were against the official government policy, alas sustained through several administrations, I should use the pseudonym on climate blogs whilst I was still in employment.

  • Black Box And Emulator Theory

Consider a general “black box”, which has been designed to estimate some quantity of interest in the past, and to predict its value in the future.  Consider also an “emulator”, which is an attempt to provide a simpler estimate of the past black box values and to predict the black box output into the future.  Last, but not least, consider reality, the actual value of the quantity of interest.

Each of these three entities,

  •  black box
  • emulator
  • reality

can be modelled as a time series with a statistical distribution.  They are all numerical quantities (possibly multivariate) with uncertainty surrounding them, and the only successful mathematics which has been devised for analysis of such is probability and statistics.  It may be objected that reality is not statistical, because it has a particular measured value.  But that is only true after the fact, or as they say in the trade, a posteriori.  Beforehand, a priori, reality is a statistical distribution of a random variable, whether the quantity be the landing face of the die I am about to throw or the global HadCRUT4 anomaly averaged across 2020.

It may also be objected that many black boxes, for example Global Circulation Models, are not statistical, because they follow a time evolution with deterministic physical equations.  Nevertheless, the evolution depends on the initial state, and because climate is famously “chaotic”, tiny perturbations to that state, lead to sizeable divergence later.  The chaotic system tends to revolve around a small number of attractors, and the breadth of orbits around each attractor can be studied by computer and matched to statistical distributions.

The most important parameters associated with a probability distribution of a continuous real variable are the mean (measure of location) and the standard deviation (measure of dispersion).  So across the 3 entities there are 6 important parameters; I shall use E[] to denote expectation or mean value, and Var[] to denote variance which is squared standard deviation.  What relationships between these 6 allow the defensible (one cannot assert “valid”) conclusion that the black box is “good”, or that the emulator is “good”? 

In general, since the purpose of an emulator is to emulate, it should do that with as high a fidelity as possible.  So for an emulator to be good, it should, like the Turing Test of whether a computer is a good emulator of a human, be able to display a similar spread/deviation/range of the black box as well as the mean/average component.  Ideally one would not be able to tell the output of one from that of the other.

To make things more concrete, I shall assume that the entities are each a uniform discrete time series, in other words a set of values evenly spaced across time with a given interval, such as a day, a month, or a year.  Let:

  X(t) be the random variable for reality at integer time t;

  M(t) be the random variable for the black box Model;

  W(t) be the random variable for some emulator (White box) of the black box

  Ri(t) be the random variable for some contributor to an entity, possibly an error term.

 Now choose a concrete time evolution of W(t) which does have some generality:

  • W(t) = (1-a)W(t-1) + R1(t) + R2(t) + R3(t) where 0 ≤ a ≤ 1

The reason for the 3 R terms will become apparent in a moment.  First note that the new value W(t) is partly dependent on the old one W(t-1) and partly on random Ri(t) terms.  If a=0 then there is no decay, and a putative flap of a butterfly’s wings contributing to W(t-1) carries on undiminished to perpetuity.  In Section C I describe how the decaying case a>0 is plausible.

R1(t) is to be the component which represents changes in major causal influences, such as the sun and carbon dioxide.  R2(t) is to be a component which represents a strong contribution with observably high variance, for example the Longwave Cloud Forcing (LCF).  Some emulators might ignore this, but it could have a serious impact on how accurately the emulator follows the black box.  R3(t) is a putative component which is negatively correlated with R2(t) with coefficient -r, with the potential (dependent on exact parameters) to mitigate the high variance of R2(t).  We shall call R3(t) the “mystery component”, and its inclusion is justified in Section C.

Equation (1) can be “solved”, i.e. the recursion removed, but first we need to specify time limits.  We assume that the black box was run and calibrated against data from time 0 to the present time P, and then we are interested in future times P+1, P+2,… up to F. The solution to Equation (1) is

  • W(t) = ∑i=0t (1-a)i(R1(t-i) + R2(t-i) + R3(t-i)) + (1-a)t W(0)

The expectation of W(t) depends on the expectations of each Rk(t), and to make further analytical progress we need to make assumptions about these.  Specifically, assume that

  • E[R1(t)] = bt+c, E[R2(t) = d], E[R3(t)] = 0

Then a modicum of algebra derives

  • E[W(t)] = b(at + a-1 + (1-a)t+1)/a2 + (c+d)(1 – (1-a)t)/a + (1-a)t W(0)

In the limit as a tends to 0, we get the special case

  • E[W(t)] = bt(t+1)/2 + (c+d)t + W(0)

Next we consider variance, with the following assumptions:

  • Var[Rk(t)] = sk2, Cov[R2(t),R3(t)] = -r s2 s3, all other covariances, within or across time, are 0, so
  • Var[W(t)] = (s12+s22+s32-2r s2 s3)(1 – (1-a)2t)/(2a-a2)

and as a tends to zero the last two parentheses tend to t (implying variance increases linearly with t).

Summary of section B:

  • A good emulator can mimic the output of the black box.
  • A fairly general iterative emulator model (1) is presented.
  • Formulae are given for expectation and variance of the emulator as a function of time t and various parameters.
  • The 2 extra parameters, a, and R3(t), over and above those of Pat Frank’s emulator, can make a huge difference to the evolution.
  • The “magic” component R3(t) with anti-correlation -r to R2(t) can greatly reduce model error variance whilst retaining linear growth in the absence of decay.
  • Any decay rate a>0 completely changes the propagation of error variance from linear growth to convergence to a finite limit.
  • Plausibility Of New Parameters

The decaying case a>0 may at first sight seem implausible.  But here is a way it could arise.  Postulate a model with 3 main variables, M(t) the temperature, F(t) the forcing, and H(t) the heat content of land and oceans.  Let

  M(t) = b + cF(t) + dH(t-1)

(Now by the Stefan-Boltzmann equation M should be related to F1/4 , but locally it can be linearized by a binomial expansion.)  The theory here is that temperature is fed both by instantaneous radiative forcing F(t) and by previously stored heat H(t-1).  (After all, climate scientists are currently worrying about how much heat is going into the oceans.)  Next, the heat changes by an amount dependent on the change in temperature:

  H(t-1) = H(t-2) + e(M(t-1)-M(t-2)) = H(0) + e(M(t-1)-M(0))

Combining these two equations we get

  M(t) = b + cF(t) + d(H(0) + e(M(t-1)-M(0)) = f + cF(t) + (1-a)M(t-1)

where a = 1-de, f = b+dH(0)-deM(0).  This now has the same form as Equation (1); there may be some quibbles about it, but it shows a proof of concept of heat buffering leading to a decay parameter.

For the anti-correlated R3(t), consider reference [2]. Roy Spencer, who has serious scientific credentials, had written “CMIP5 models do NOT have significant global energy imbalances causing spurious temperature trends because any model systematic biases in (say) clouds are cancelled out by other model biases”.  This means that in order to maintain approximate Top Of Atmosphere (TOA) radiative balance, some approximate cancellation is forced, which is equivalent to there being an R3(t) with high anti-correlation to R2(t).  The scientific implications of this are discussed further in Section I.

Summary of Section C:

  • A decay parameter is justified by a heat reservoir.
  • Anti-correlation is justified by GCMs’ deliberate balancing of TOA radiation.
  • Emulator Parameters

Dr. Pat Frank’s emulator falls within the general model above.  The constants from his paper, 33K, 0.42, 33.3 Wm-2, and +/-4 Wm-2, the latter being from errors in LCF, combine to give 33*0.42/33.3 = 0.416 and 0.416*4 = 1.664 used here. So we can choose a = 0, b = 0, c+d = 0.416 F(t) where F(t) is the new GHG forcing (Wm-2) in period t, s1=0, s2=1.664, s3=0, and then derive

  • W(t) = (c+d)t + W(0) +/- sqrt(t) s2

(I defer discussion of the meaning of the +/- sqrt(t) s2, be it uncertainty or error or something else, to Section D.  Note that F(t) has to be constant to directly use the theory here.)

But by using more general parameters it is possible to get a smaller value of the +/- term.  There are two main ways to do this – by covariance or by decay, each separately justified in Section C.

In the covariance case, choose s3 = s2 and r = 0.95 (say).  Then in this high anti-correlation case, still with a = 0, Equation (7) gives

  • Var[W(t)] = 0.1s22t  (instead of s22t)

In the case of decay but no anti-correlation, a > 0 and s3 = 0 (so R3(t) = 0 with probability 1).  Now, as t gets large, we have

  • Var[W(t)] = (s12+s22)/(2a-a2)

so the variance does not increase without limit as in the a =0 case.  But with a > 0, the mean also changes, and for large t Equation (4) implies it is

  • E[W(t)] ~ bt/a + (b+c+d-b/a)/a

Now if we choose b = a(c+d) then that becomes (c+d)(t+1), which is fairly indistinguishable from the (c+d)t in Equation (8) derived from a=0, so we have derived a similar expectation but a smaller variance in Equation (10).

To streamline the notation, now let the parameters a, b, c, d, r be placed in a vector u, and let

  • E[W(t)] = mw(t;u),  Var[W(t)] = sw2(t;u)

(I am using a subscript ‘w’ for statistics relating to W(t), and ‘m’ for those relating to M(t).)  With 4 parameters (a, b, c+d, r) to set here, how should we choose the “best”?  Well, comparisons of W(t) with M(t) and X(t) can be made, the latter just in the calibration period t = 1 to t = P.  The nature of comparisons depends on whether or not just one, or many, observations of the series M(t) are available.

Case 1: Many series

With a deterministic black box, many observed series can be created if small perturbations are made to initial conditions and if the evolution of the black box output is mathematically chaotic.  In this case, a mean mm(t) and a standard deviation sm(t) can be derived from the many series.  Then curve fitting can be applied to mw(t;u) – mm(t) and sw(t;u) – sm(t) by varying u.  Something like Akaike’s Information Criterion (AIC) might be used for comparing competing models.  But in any case it should be easy to notice whether sm(t) grows like sqrt(t), as in the a=0 case, or tends to a limit, as in the a>0 case.

Case 2: One series

If chaotic evolution is not sufficient to randomize the black box, or if the black box owner cannot be persuaded to generate multiple series, there may be only one observed series m(t) of the random variable M(t).  In this case Var[M(t)] cannot be estimated unless some functional form, such as g+ht, is assumed for mm(t), when (m(t)-g-ht)2 becomes a single observation estimate of Var[M(t)] for each t, allowing an assumed constant variance to be estimated.  So some progress in fitting W(t;u) to m(t) may still be possible in this case.

Pat Frank’s paper effectively uses a particular W(t;u) (see Equation (8) above) which has fitted mw(t;u) to mm(t), but ignores the variance comparison.  That is, s2 in (8) was chosen from an error term from LCF without regard to the actual variance of the black box output M(t).

Summary of section D:

  • Pat Frank’s emulator model is a special case of the models presented in Section B, where error variance is given by Equation (7).
  • More general parameters can lead to lower propagation of error variance over time (or indeed, higher).
  • Fitting emulator mean to black box mean does not discriminate between emulators with differing error variances.
  • Comparison of emulator to randomized black box runs can achieve this discrimination.
  • Error and uncertainty

In the sections above I have made scant reference to “uncertainty”, and a lot to probability theory and error distributions.  Some previous Commenters repeated the mantra “error is not uncertainty”, and this section addresses that question.  Pat Frank and others referred to the following “bible” for measurement uncertainty

https://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf ; that document is replete with references to probability theory.  It defines measurement uncertainty as a parameter which is associated with the result of a measurement and that characterizes the dispersion of the values that could reasonably be attributed to the measurand.  It acknowledges that the dispersion might be described in different ways, but gives standard deviations and confidence intervals as principal examples.  The document also says that that definition is not inconsistent with two other definitions of uncertainty, which include the difference between the measurement and the true value. 

Here I explain why they might be thought consistent, using my notation above.  Let M be the measurement, and X again be the true value to infinite precision (OK, perhaps only to within Heisenberg quantum uncertainty.)  Then the JCGM’s main definition is a parameter associated with the statistical distribution of M alone, generally called “precision”, whereas the other two definitions are respectively a function of M-X and a very high confidence interval for X.  Both of those include X, and are predicated on what is known as the “accuracy” of the measurement of M.  (The JCGM says this is unknowable, but does not consider the possibility of a different and highly accurate measurement of X.)  Now, M-X is just a shift of M by a constant, so the dispersion of M around its mean is the same as the dispersion of M-X around its mean.  So provided that uncertainty describes dispersion (most simply measured by variance) and not location, they are indeed the same.  And importantly, the statistical theory for compounding variance is the same in each case.

Where does this leave us with respect to error versus uncertainty?  Assuming that X is a single fixed value, then prior to measurement, M-X is a random variable representing the error, with some probability distribution having mean mm-X and standard deviation sm.  b = mm-X is known as the bias of the measurement, and +/-sm is described by the JCGM 2.3.1 as the “standard” uncertainty parameter.  So standard uncertainty is just the s.d. of error, and more general uncertainty is a more general description of the error distribution relative to its mean.

There are two ways of finding out about sm: by statistical analysis of multiple measurements (if possible) or by appealing to an oracle, such as the manufacturer of the measurement device, who might supply information over and beyond the standard deviation.  In both cases the output resolution of the device may have some bearing on the matter. 

However, low uncertainty is not of much use if the bias is large.  The real error statistic of interest is E[(M-X)2] = E[((M-mm)+(mm-X))2] = Var[M] + b2, covering both a precision component and an accuracy component.

Sometimes the uncertainty/error in a measurement is not of great consequence per se, but feeds into a parameter of a mathematical model and thence into the output of that model.  This is the case with LCF feeding into radiative forcings in GCMs and then into temperature, and likewise with Pat Frank’s emulator of them.  But the theory of converting variances and covariances of input parameter errors into output error via differentiation is well established, and is given in Equation (13) of the JCGM.

To illuminate the above, we now turn to some examples, principally provided by Pat Frank and Commenters.

Example 1: The 1-foot end-to-end ruler

In this example we are given a 1-foot ruler with no gaps at the ends and no markings, and the manufacturer assures us that the true length is 12”+/-e”; originally e = 1 was chosen, but as that seems ridiculously large I shall choose e = 0.1 here.  So the end-to-end length of the ruler is in error by up to 0.1” either way, and furthermore the manufacturer assures us that any error in that interval is equally likely. I shall repeat a notation I introduced in an earlier blog comment, which is to write 12+/-_0.1 for this case, where the _ denotes a uniform probability distribution, instead of a single standard deviation for +/-.  (The standard deviation for a random variable uniform in [-a,a] is a/sqrt(3) = 0.577a, so b +/-_ a and b +/- 0.577a are loosely equivalent, except that the implicit distributions are different.  This is covered in the JCGM, where “rectangular” is used in place of “uniform”.)

Now, I want to build a model train table 10 feet long, to as high an accuracy as my budget and skill allow.  If I have only 1 ruler, it is hard to see how I can do better than get a table which is 120+/-_1.0”.  But if I buy 10 rulers (9 rulers and 1 ruler to rule them all would be apt if one of them was assured of accuracy to within a thousandth of an inch!), and I am assured by the manufacturer that they were independently machined, then by the rule of addition of independent variances, the uncertainty in the sum of the lengths is sqrt(10) times the uncertainty of each.

So using all 10 rulers placed end to end, the expected length is 120” and the standard deviation (uncertainty) gets multiplied by sqrt(10) instead of 10 for the single ruler case, an improvement by a factor of 3.16.  The value for the s.d. is 0.0577 sqrt(10) = 0.183”.   

To get the exact uncertainty distribution we would have to do what is called convolving of distributions to find the distribution of the sum_1^10 (X_i-12).  It is not a uniform distribution, but looks a little like a normal distribution under the Central Limit Theorem. Its “support” is not of course infinite, but is the interval (-1”,+1”), but it does tail off smoothly at the edges.  (In fact, recursion shows that the probability of it being less than (-1+x), for 0<x<0.2, is (5x)10/10!   That ! is a factorial, and with -1+x = -0.8 it gives the small probability of 2.76e-7, a tiny chance of it being in the extreme 1/5 of the interval.)

Now that seemed like a sensible use of the 10 rulers, but oddly enough it isn’t the best use.  Instead, sort them by length, and use the shortest and longest 5 times over.  We could do this even if we bought n rulers, not equal to 10.  We know by symmetry that the shortest plus longest has a mean error of 0, but calculating the variance is more tricky.

The error of the ith shortest ruler, plus 0.1, times 5, say Yi, has a Beta distribution (range from 0 to 1) with parameters (i, 101-i).  The variance of Yi is i(n+1-i)/((n+1)2(n+2)), which can be found at https://en.wikipedia.org/wiki/Beta_distribution .  Now

  Var[Y1 + Yn] = 2(Var[Y1]+Cov[Y1,Y100 ]) by symmetry.

Unfortunately that Wikipedia page does not give that covariance, but I have derived this to be

  • Cov[Yi,Yj] = i(n+1-j) / [(n+2)(n+1)2] if i <= j, so
  • Var[Y1 + Yn] = 2(n+1) / [(n+2)(n+1)2] = 2 / [(n+2)(n+1)]

Using the two rulers 5 times multiplies the variance by 25, but removing the scaling of 5 in going from ruler to Yi cancels this.  So (14) is also the variance of the error of our final measurement.

Now take n = 10 and we get uncertainty = square root of variance = sqrt(2/132) = 0.123”, which is less than the 0.183” from using all 10 rulers.  But if we were lavish and bought 100 rulers, it would come down to sqrt(2/10302) = 0.014”.

Having discovered this trick, it would be tempting to extend it and use (Y1 + Y2 + Yn-1 + Yn)/2.  But this doesn’t help, as the variance for that is (5n+1)/[2(n+2)(n+1)2], which is bigger than (14). 

I confess it surprised me that it is better to use the extremal rulers rather than the mean of them all. But I tested the mathematics both by Monte Carlo and by calculating the variance of the sum of n sorted rulers via (13) with the sum of n unsorted rulers, and for n=10 they agreed exactly.  I think the effectiveness of the method is because the variance of the extremal rulers is small because those lengths bump up against the hard limit from the uniform distribution.

That inference is confirmed by Monte Carlo experiments with, in addition to the uniform, a triangular and a normal distribution for Yi, still wanting a total length of 10 rulers, but having acquired n=100 of them.  The triangular has the same range as the uniform, and half the variance, and the normal has the same variance as the uniform, implying that the endpoints of the uniform represent +/-sqrt(3) standard deviations for the normal, covering 92% of its distribution.

In the following table 3 subsets of the 100 are considered, pared down from a dozen or so experiments.  Each subset is optimal, within the experiments tried, for one or more distribution (starred).  A subset a,b,c,… means that the a-th shortest and longest rulers are used, and the b-th shortest and longest etc. The fraction following the distribution is the variance of a single sample.  The decimal values are variances of the total lengths of the selected rulers then scaled up to 10 rulers.

               v  a  r  i  a  n  c  e  s      

dist\subset    1         1,12,23,34,45 1,34

U(0,1)    1/12 0.00479*  0.0689        0.0449

N(0,1/12) 1/12 0.781     0.1028*       0.2384

T(0,1)    1/24 0.0531    0.0353        0.0328*

We see that by far the smallest variance, 0.00479, occurs if we are guaranteed a uniform distribution, by using a single extreme pair, but that strategy isn’t optimal for the other 2 distributions.  5 well-spaced pairs are best for the normal, and quite good for the triangular, though the latter is slightly better with 2 well-spaced pairs.

Unless the manufacturer can guarantee the shape of the error distribution, assumption that it is uniform would be quite dangerous in terms of choosing a strategy for the use of the available rulers. 

Summary of Section E:

  • Uncertainty should properly be thought of as the dispersion of a distribution of random variables, possibly “hidden”, representing errors, even though that distribution might not be fully specified.
  • In the absence of clarification, a +/-u uncertainty value should be taken as one standard deviation of the error distribution.
  • The assumption, probably through ignorance, that +/-u represents a sharply bounded uniform (or “rectangular”) distribution, allows clever tricks to be played on sorted samples yielding implausibly small variances/uncertainties.
  • The very nature of errors being compounded from multiple sources supports the idea that a normal error distribution is a good approximation.
  • Uniform Uncertainty (compared to Trapezium Uncertainty)

As an interlude between examples, in this section we study further implications of a uniform uncertainty interval, most especially for a digital device.  By suitable scaling we can assume that the possible outputs are a complete range of integers, e.g. 0 to 1000.  We use Bayesian statistics to describe the problem.

Let X be a random variable for the true infinitely precise value which we attempt to measure.

Let x be the value of X actually occurring at some particular time.

Let M be our measurement, a random variable but including the possibility of zero variance.  Note that M is an integer.

Let D be the error, = M – X.

Let f(x) be a chosen (Bayesian) prior probability density function (p.d.f.) for X, P[X’=’x].

Let g(y;x) be a probability function (p.f.) for M over a range of integer y values, dependent on x, written g(y;x) = P[M=y | X’=’x]  (the PRECISION distribution).

Let c be a “constant” of proportionality, determined in each separate case by making relevant probabilities add up to 1.  Then after measurement M, the posterior probability for X taking the value z is, by Bayes’ Theorem,

  • P[X’=’x | M=y]  =  P[M=y | X’=’x] P[X’=’x] / c = g(y;x) f(x) / c

Usually we will take f(x) = P[X ‘=’ x] to be an “uninformative” prior, i.e. uniform over a large range bound to contain x, so it has essentially no influence.  In this case,

  • P[X’=’x | M=y] = g(y;x)/c where c = int g(y;x)dx (the UNCERTAINTY distribution).

Then P[D=z | M=y] = P[X=M-z | M=y] = g(y;y-z)/c.  Now assume that g() is translation invariant, so g(y;y-z) = g(0;-z) =: c h(z) defines function h(), and int h(z)dz = 1.  Then

  • P[D=z | M=y] = h(z), independent of y (ERROR DISTRIBUTION = shifted u.d.).

In addition to this distribution of error given observation, we may also be interested in the distribution of error given the true (albeit unknown) value.  (It took me a long time to work out how to evaluate this.)

Let A be the event {D = z}, B be {M = y}, C be {X = x}.  These events have a causal linkage, which is that they can simultaneously occur if and only if z = y-x.  And when that equation holds, so z can be replaced by y-x, then given that one of the events holds, either both or none of the other two occur, and therefore they have equal probability.  It follows that:

  P[A|C] = P[B|C] = P[C|B]P[B]/P[C] = P[A|B]P[B]/P[C]

  • P[D = z = y-x | X = x] = P[D = y-x | M = y] P[M = y]/P[X = x]

Of the 3 terms on the RHS, the first is h(y-x) from Equation (17), the third is f(x) from Equation (15), and the second is a new prior.  This prior must be closely related to f(), which we took to be uninformative, because M is an integer value near to X.  The upshot is that under these assumptions the LHS is proportional to h(y-x), so

  • P[D = y-x | X = x] = h(y-x)/∑i h(i-x)

Let x’ be the nearest integer to x, and a = x’-x, lying in the interval [-1/2,1/2).  Then y-x = y+a-x’ = a+k where k is an integer.  Then the mean m and variance s2 of D given X=x are:

  • m = ∑k (a+k)h(a+k) / ∑k h(a+k); s2 = ∑k (a+k-m)2h(a+k) / ∑k h(a+k)

A case of obvious interest would be an uncertainty interval which was +/-e uniform. That would correspond to h(z) = 1/(2e) for b-e < z < b+e and 0 elsewhere, where b is the bias of the error. We now evaluate the statistics for the case b = 0 and e ≤ 1.  The symmetry in a means that we need only consider a > 0.  –e < a+k < e implies that -e-a < k < e-a.  e < ½ implies there is an a slightly bigger than e such that no integer k is in the interval, which is impossible,so e is at least ½.  Since h(z) is constant over its range, in (20) cancellation allows us to replace h(a+k) with 1.

  • If a < 1-e then only k=0 is possible, and m = a, s2 = 0.
  • If a > 1-e then k=-1 and k=0 are both possible, and m = a – ½ , s2 = ¼ .

When s2 is averaged over all a we get 2(e-1/2)(1/4) = (2e-1)/4.

It is not plausible for e to be ½, for then s2 would be 0 whatever the fractional part a of x was. Since s2 is the variance of M-X given X=x, that implies that M is completely determined by X.  That might sound reasonable, but in this example it means that as X changes from 314.499999 to 314.500000, M absolutely has to flip from 314 to 315, and that implies that the device, despite giving output resolution to an integer, actually has infinite precision, and is therefore not a real device.

For e > ½, s2 is zero for a in an interval of width 2-2e, and non-zero in two intervals of total width 2e-1.  In these intervals for a (translating to x), it is non-deterministic as to whether the output M is 314, say, or 315.

In Equations (21) and (22) there is a disconcerting discontinuity in the expected error from 1-e at a = (1-e)- to (1/2-e) at a = (1-e)+.  This arises from the cliff edge in the uniform h(z).  More sophisticated functions h(z) do not exhibit this feature, such as a normal distribution, a triangle distribution, or a trapezium distribution such as:

  (23)  h(z) =

{ 2(z+3/4) for -3/4<z<-1/4

{ 1 for -1/4<z<1/4

{ 2(3/4-z) for 1/4<z<3/4

For this example we find

  (24)  if 0<a<1/4, m = a and s2 = 0,

           if 1/4<a<3/4, m = 1/2-a, s2 = 4(a-1/4)(3/4-a) <= 1/4

Note that the discontinuity previously noted does not occur here, as m is a continuous function of a even at a=1/4. The averaged s2 is 1/12, less than the 1/8 from the U[-3/4,3/4] distribution. 

All the above is for a device with a digital output, presumed to change slowly enough to be read reliably by a human.  In the case of an analogue device, like a mercury thermometer, then a human’s reading of the device provides an added error/uncertainty.  The human’s reading error is almost certainly not uniform (we can be more confident when the reading is close to a mark than when it is not), and in any case the sum of instrument and human error is almost certainly not uniform.

Summary of section F:

  • The PRECISION distribution, of an output given the true state, induces an ERROR distribution given some assumptions on translation invariance and flat priors.
  • The range of supported values of the error distribution must exceed the output resolution width, since otherwise infinite precision is implied.
  • Even when that criterion is satisfied, the assumption of a uniform ERROR distribution leads to a discontinuity in mean error as a function of the true value.
  • A corollary is that if your car reports ambient temperature to the nearest half degree, then sometimes, even in steady conditions, its error will exceed half a degree.
  • Further examples

Example 2: the marked 1-foot ruler

In this variant, the rulers have markings and an indeterminate length at each end.  Now multiple rulers cannot usefully be laid end to end, and the human eye must be used to judge and mark the 12” positions.  This adds human error/uncertainty to the measurement process, which varies from human to human, and from day to day.  The question of how hard a human should try in order to avoid adding significant uncertainty is considered in the next example.

Example 3: Pat Frank’s Thermometer

Pat Frank introduced the interesting example of a classical liquid-in-glass (LiG) thermometer whose resolution is (+/-)0.25K.  He claimed that everything inside that half-degree interval was a uniform blur, but went on to explain that the uncertainty was due to at least 4 things, namely the thermometer capillary is not of uniform width, the inner surface of the glass is not perfectly smooth and uniform, the liquid inside is not of constant purity, the entire thermometer body is not at constant temperature.  He did not include the fact that during calibration human error in reading the instrument may have been introduced.  So the summation of 5 or more errors implies (except in mathematically “pathological” cases) that the sum is not uniformly distributed.  In fact a normal distribution, perhaps truncated if huge errors with infinitesimal probability are unpalatable, makes much more sense.

The interesting question arises as to what the (hypothetical) manufacturers meant when they said the resolution was +/-0.25K.  Did they actually mean a 1-sigma, or perhaps a 2-sigma, interval?  For deciding how to read, record, and use the data from the instrument, that information is rather vital.

Pat went on to say that a temperature reading taken from that thermometer and written as, e.g., 25.1 C, is meaningless past the decimal point. (He didn’t say, but presumably would consider 25.5 C to be meaningful, given the half-degree uncertainty interval.)  But this isn’t true; assuming that someone cares about the accuracy of the reading, it doesn’t help to compound instrumental error with deliberate human reading error.  Suppose that the error variance of the instrument actually corresponds to 2-sigma, as the manufacturer wanted to give a reasonably firm bound, then the variance is ((1/2)(1/4))2 = 1/64.  If t2 is the error variance of the observer, then the final variance is 1/64+t2

The observer should not aim for a ridiculously low t, even if achievable, and perhaps a high t is not so bad if the observations are not that important.  But beware: observations can increase in importance beyond the expectations of the observer.  For example we value temperature observations from 1870 because they tell us about the idyllic pre-industrial, pre-climate change, world!  In the present example, I would recommend trying for t2 = 1/100, or as near as can be achieved within reason.  Note that if the observer can manage to read uniformly within +/-0.1 C, then that means t2 =  1/300.   But if instead she reads to within +/-0.25, t2 = 1/48 and the overall variance is multiplied by (1+64/48) = 7/3 ~ 1.52, which is a significant impairment of accuracy precision. 

The moral is that it is vital to know what uncertainty variance the manufacturer really believes to be the case, that guidelines for observers should then be appropriately framed, and that sloppiness has consequences.

Summary of Section G:

  • Again, real life examples suggest the compounding of errors, leading to approximately normal distributions.
  • Given a reference uncertainty value from an analogue device, if the observer has the skill and time and inclination then she can reduce overall uncertainty by reading to a greater precision than the reference value.
  • The implications for Pat Frank’s paper

The implication of Section B is that a good emulator can be run with pseudorandom numbers and give output which is similar to that of the black box.  The implication of Section D is that uncertainty analysis is really error analysis and good headway can be made by postulating the existence of hidden random variables through which statistics can be derived.  The implication of Section C is that many emulators of GCM outputs are possible, and just because a particular one seems to fit mean values quite well does not mean that the nature of its error propagation is correct.  The only way to arbitrate between emulators would be to carry out Monte Carlo experiments with the black boxes and the emulators.  This might be expensive, but assuming that emulators have any value at all, it would increase this value.

Frank’s emulator does visibly give a decent fit to the annual means of its target, but that isn’t sufficient evidence to assert that it is a good emulator.  Frank’s paper claims that GCM projections to 2100 have an uncertainty of +/- at least 15K.  Because, via Section D, uncertainty really means a measure of dispersion, this means that Equation (1) with the equivalent of Frank’s parameters, using many examples of 80-year runs, would show an envelope where a good proportion would reach +15K or more, and a good proportion would reach -15K or less, and a good proportion would not reach those bounds.  This is just the nature of random walks with square root of time evolution. 

But the GCM outputs represented by CMIP5 do not show this behaviour, even though, climate being chaotic, different initial conditions should lead to such variety.  Therefore Frank’s emulator is not objectively a good one.  And the reason is that, as mentioned in Section C, the GCMs have corrective mechanisms to cancel out TOA imbalances except for, presumably, those induced by the rather small increase of greenhouse gases from one iteration to the next.

However the real value in Frank’s paper is first the attention drawn to the relatively large annual errors in the radiation budget arising from long wave cloud forcing, and second the revelation through comments on it that GCMs have ways of systematically squashing these errors.

Summary of Section H:

  • Frank’s emulator is not good in regard to matching GCM output error distributions.
  • Frank’s paper has valuable data on LCF errors.
  • Thereby it has forced “GCM auto-correction” out of the woodwork.
  1. The implications for GCMs

The “systematic squashing” of the +/-4 W/m^2 annual error in LCF inside the GCMs is an issue of which I for one was unaware before Pat Frank’s paper. 

The implication of comments by Roy Spencer is that there really is something like a “magic” component R3(t) anti-correlated with R2(t), though the effect would be similar if it was anti-correlated with R2(t-1) instead, which might be plausible with a new time step doing some automatic correction of overshooting or undershooting on the old time step.  GCM experts would be able to confirm or deny that possibility.

In addition, there is the question of a decay rate a, so that only a proportion (1-a) of previous forcing carries into the next time step, as justified by the heat reservoir concept in Section C.  After all, GCMs presumably do try to model the transport of heat in ocean currents, with concomitant heat storage.

It is very disturbing that GCMs have to resort to error correction techniques to achieve approximate TOA balance.  The two advantages of doing so are that they are better able to model past temperatures, and that they do a good job in constraining the uncertainty of their output to the year 2100.  But the huge disadvantage is that it looks like a charlatan’s trick; where is the vaunted skill of these GCMs, compared with anyone picking their favourite number for climate sensitivity and drawing straight lines against log(CO2)?  In theory, an advantage of GCMs might be an ability to explain regional differences in warming.  But I have not seen any strong claims that that is so, with the current state of the science.

Summary of Section I:

  • Auto-correction of TOA radiative balance helps to keep GCMs within reasonable bounds.
  • Details of how this is done would be of great interest; the practice seems dubious at best because it highlights shortcomings in GCMs’ modelling of physical reality.
Advertisements

184 thoughts on “What do you mean by “mean”: an essay on black boxes, emulators, and uncertainty

  1. Given the many assumptions made in the GCMs about physical processes and their many interactions with each other, does there come a point where developing truly useful mathematical definitions for terms such as ‘range of error’, ‘uncertainty’, ‘error propagation’ becomes impossible, for all practical purposes?

    • I think that physics depends on mathematics and we have to do the best we can with the maths available. Its utility, especially in regard to “practical purposes”, can always be debated.

      RJB

  2. Dr. Booth:
    You’ve provided much to consider here. Some points I don’t understand:

    “So standard uncertainty is just the s.d. of error, and more general uncertainty is a more general description of the error distribution relative to its mean.” What is the difference between “standard” and “general” uncertainty and is “standard” uncertainty so easily defined by the standard deviation of the error?

    “In the absence of clarification, a +/-u uncertainty value should be taken as one standard deviation of the error distribution.” Why should we assume that “a +/-u uncertainty value should be taken as one standard deviation of the error distribution”?

    I’m not challenging either of these, just looking for help understanding both.

    • He is confusing standard uncertainty (u) and expanded uncertainty (+/-U), as they are defined in the GUM. (u) is standard deviation, while (U) is (u) multiplied by a coverage factor.

    • JRF: I am just interpreting the JCGM at that point, and “standard uncertainty” is a term they use for a +/-1 s.d. interval for the error. I use “general uncertainty” to include a better description of the error distribution.

      If someone says “the uncertainty is +/-3 widgets”, then unless they are more specific the most reasonable assumption is that they are using standard uncertainty, which is a +/-1 sigma (s.d.) bound. Does that help?

      RJB

      • This is why the GUM specifies that measurement uncertainty statements include either not just the +/- U but also the standard uncertainty and the coverage factor (K). The following is an example provided by NIST.

        ms = (100.02147 ± 0.00070) g, where the number following the symbol ± is the numerical value of an expanded uncertainty U=k * uc with U determined from a combined standard uncertainty (i.e., estimated standard deviation ) uc= 0.35 mg and a coverage factor k=2. Since it can be assumed that the possible estimated values of the standard are approximately normally distributed with approximate standard deviation uc, the unknown value of the standard is believed to lie in the interval defined by U with a level of confidence of approximately 95%.

        • Which goes to the subtitle of the document: “Guide to the expression of uncertainty in measurement”: it is a standard way of expressing uncertainty.

  3. This is a lot to digest in any small amount of time, but isn’t one of the most important points, or even the most important point, summarized right here?

    The real error statistic of interest is E[(M-X)2] = E[((M-m_m)+(m_m-X))^2] = Var[M] + b^2, covering both a precision component and an accuracy component.

    For climate models we do not have a credible estimate of the “uncertainty” of M-X, first, because important drivers of climate are involved and no matter how stable the governing differential equations, these will not damp away (they may also be misrepresented with a set of differential equations missing terms or with erroneous values of coefficients), and second, we don’t have a handle on “b”. I would hate to have our economic future decided by “b”.

  4. If the GCMs do have a negatively correlated feedback parameter that constrains the models to match the past and not fly apart in the future… How would that parameter distinguish between natural and anthropogenic forcing?

    As the natural forcing dwarfs the anthropogenic forcing (especially in the past) surely this negatively correlated feedback parameter must make the anthropogenic forcing irrelevant.

    Unless the negatively correlated feedback parameter was very finely chosen.

  5. >In the absence of clarification, a +/-u uncertainty value should be
    >taken as one standard deviation of the error distribution.

    This is not how uncertainty is treated in the GUM, quoting:

    “uncertainty (of measurement)
    parameter, associated with the result of a measurement, that characterizes the
    dispersion of the values that could reasonably be attributed to the measurand”

    “2.3.1
    standard uncertainty
    uncertainty of the result of a measurement expressed as a standard deviation”

    “2.3.4
    combined standard uncertainty
    … positive square root of a sum of [individual standard uncertainty] terms…”

    “2.3.5
    expanded uncertainty
    quantity defining an interval about the result of a measurement that may be expected
    to encompass a large fraction of the distribution of values that could reasonably be
    attributed to the measurand”

    Standard uncertainty has the symbol u [lowercase]; it is standard deviation, and does not have +/- attached to it.

    Expanded uncertainty is standard uncertainty (u) multiplied by a coverage factor (k), and has the symbol U [uppercase]:

    U = k * u

    Because of the expansion by the statistical coverage, U is expressed as +/-[value].

    The coverage factor is associated with Student’s t, and in practice is nearly always simply assumed to be k = 2 for “95% coverage” (in this respect, k = 3 would correspond to 99% coverage). However, even though it is standard practice to use k = 2, if the variability (i.e. sampling) distribution is not normal, which is often the case, the real coverage percentage cannot be assumed.

    • “expanded uncertainty quantity defining an interval about the result of a measurement”
      What does “the result of a measurement” mean? Is this an expression differentiating between the taking of a measurement and the values obtained by taking the measurement, or something more exotic? If the former, why label it “the result of a measurement” rather than just “a measurement”?

      “Standard uncertainty has the symbol u [lowercase]; it is standard deviation, and does not have +/- attached to it.”
      Does that mean value expressed in this “standard uncertainty” is twice the numerical +/= uncertainty?
      If that question isn’t clear:
      there has to be a measurement m to which the uncertainty, however it is expressed, is related.
      While“standard uncertainty” has no +/- attached to it,
      it must represent some range in which the ‘real’ value exists.
      Is that range evenly distributed around the measurement m?
      If so, then it could also be expressed as
      m +/- 0.5u,
      otherwise it has to actually be
      m +/- (the value of u).
      Which is correct?

      • A result is a numerical value, at the end of a measurement procedure; the GUM gives guidance about performing a formal uncertainty analysis (UA) on the measurement procedure in order to quantify the uncertainty to be attached to the result. As is quite common in standards writing, the authors are strictly adhering to the terms defined in order to minimize misunderstandings, as the expense of more words. But to most people a “result of a measurement” is simply a measurement. The GUM identifies the measurement procedure as:

        Y = f(X1, X2, X3, …), where the Xs might be multiple secondary measurements needed to obtain Y.

        >Does that mean value expressed in this “standard uncertainty” is twice
        >the numerical +/= uncertainty?

        No, remember that the GUM is intended to be a standard way of expressing uncertainty; prior to its arrival, there were lots of different ways, each with their own terminology and largely incompatible with each other. Many of these would apply (+/-) to standard deviations, so confusion here is quite understandable.

        An easy way to think of standard uncertainty is as the root-sum-square of individual uncertainty components (or sources of error), all expressed as standard deviations (s). The s values can come from statistical analysis, and/or from estimations of the ranges of errors along with their probability distributions. Then (for many measurement procedures):

        u = sqrt[ s1^2 + s2^2 + s3^s3 + … + sn^2]
        U = k * u

        >While“standard uncertainty” has no +/- attached to it,
        >it must represent some range in which the ‘real’ value exists.

        Keep in mind that this elusive animal may not even lie within Y +/- U, and the GUM does not require this. The terminology discussion is too long for here, studying the GUM terminology a bit should help.

        >Is that range evenly distributed around the measurement m?
        >If so, then it could also be expressed as
        >m +/- 0.5u,
        >otherwise it has to actually be
        >m +/- (the value of u).
        >Which is correct?

        For me, I have come to view (+/-U) as the “fuzzyness” attached to measurement. The GUM does not require any particular probability distribution, these are aspects of an individual measurement procedure that are investigated as part of an uncertainty analysis.

        An example UA: a liquid metal thermometer is calibrated at a range of temperatures in bath by a cal lab to get a series of T_therm versus T_bath data points. Standard regression gives the calibration curve T_bath = a * T_therm + b, with lots of regression statistics, especially the standard deviation of T_bath as a function of T_therm. This is a GUM Type A standard uncertainty, s1.

        There is also the thermometer scale, which is graduated in 0.5C increments, and the human task of reading the scale is an error source. For the UA, it is estimated that the temperature can be anywhere in an interval of +/-0.25C, with triangular probability distribution. Following the GUM, a Type B uncertainty is calculated from this interval, giving s2.

        s1 and s2 are then combined to get the standard uncertainty u. Note that for this example, s2 could be much larger than s1, so that the task of reading the thermometer is the dominate error source. Also, note that s2 appears twice, once in the cal lab during calibration, and once again in use.

  6. CM:
    Correct me if I’m misunderstanding:
    “uncertainty” is an undefined dispersion of values associated with a measurement.
    “standard uncertainty” is the standard deviation (of error?) associated with a measurement (and is, therefore, a statistic?)
    “expanded uncertainty” encompasses a large fraction of “uncertainty” (but not all?) but is multiplied by a selected k of interest if a normal distribution but k becomes unknown if the distribution is not normal (and, therefore “uncertainty” becomes truly uncertain).

    So, “uncertainty” can be a statistic based on a standard deviation in a normal distribution and is only truly uncertain in a distribution that is not normal (?).

    I was thinking that “uncertainty”, or “Uncertainty” was based on more than just statistical variation. But, I will say my confusion on this topic is quite high.

    • JRF:

      Uncertainty analysis (UA) is a large topic and it takes time to get ones’ head wrapped around the GUM, my explanation was a very bare-bones. The GUM is written around the task of performing a formal UA for a given measurement, which can be thought of as some process that produces a numerical result, such as using a liquid thermometer to measure water temperature.

      The metrological vocabulary used in the GUM Annex B came from the “International vocabulary of metrology — Basic and general concepts and associated terms (VIM)”, also produced by the JCGM.

      Another quote from the GUM, which may help:

      ‘2.2.1 The word “uncertainty” means doubt, and thus in its broadest sense
      “uncertainty of measurement” means doubt about the validity of the result of a
      measurement. Because of the lack of different words for this general concept of
      uncertainty and the specific quantities that provide quantitative measures of the
      concept, for example, the standard deviation, it is necessary to use the word
      “uncertainty” in these two different senses.’

      >“uncertainty” is an undefined dispersion of values associated with a measurement.

      Yes, but an uncertainty analysis is an attempt to quantify the dispersion. +/-U is a way of expressing the fuzziness associated with a numeric measurement.

      >“standard uncertainty” is the standard deviation (of error?) associated
      >with a measurement (and is, therefore, a statistic?)

      Yes, the “standard” adjective refers to standard deviation, and u should be thought of as standard deviation. However, u values can come from many different sources, and the GUM has a lot of text for quantifying them.

      In a UA, the combined standard uncertainty is calculated as the root-sum-square of the individual standard deviations.

      An example: a thermometer calibration done by reading it while immersed in a fluid, over a range of temperatures. The calibration is then obtained as a linear regression X-Y fit of temperature versus temperature. Each individual temperature measurement has its own u, and additional uncertain comes from the regression statistics, such as the standard deviation of the slope of the line. All of these are then used to get the combined uncertainty.

      u is statistical, but not always. The GUM divides u as Type A or Type B:

      ‘2.3.2
      Type A evaluation (of uncertainty)
      method of evaluation of uncertainty by the statistical analysis of series
      of observations’

      ‘2.3.3
      Type B evaluation (of uncertainty)
      method of evaluation of uncertainty by means other than the statistical
      analysis of series of observations’

      The distributions of “series of observations” can be normal, or non-normal (even unknown). A Type B uncertainty is typically a judgement that a measurement X can vary between X-a and X+a, i.e. an interval, with an assumed probability distribution. Going back to the thermometer, if it has 0.5 degree gradations, the Type B uncertainty associated with reading the scale could be expressed with a = 0.5 or a = 0.25 (a judgement call), and a uniform or triangular distribution (another judgement call). The GUM tells how to calculate a standard deviation from these estimates.

      Another judgement call is what value of k to use; the common pitfall is to assume that k=2 automatically means 95% of all measurements will be within +/-U. This is only true if the distribution is normal. It is common for UAs done for laboratory accreditation purposes to be required to quote U with k=2, without regard to any real statistical distribution.

      • CM:
        Thank you for your extended comments! When Dr. Pat Frank posted his article about uncertainty (and I’m not sure whether to capitalize or not) I thought, “This is good, this is important” and I continue to think that because, in my mind, he introduced (and others reiterated) the importance of incorporating “ignorance” into the discussion of models and analysis, “ignorance” being defined as that which we don’t know but which can be described, to some degree, mathematically) Dr. Booth’s article seemed to move away from “ignorance” and more to statistics (and I may be wrong about that, I definitely worry about my own ignorance!). Type A UA, as you describe is a statistical exercise (?) but you (or the citation) describes Type B as involving judgement calls. Are the judgement calls accounting for variation, “ignorance”, both?

        I’m almost harping on “ignorance” because I don’t think modelers (of anything) consider their “ignorance” enough when evaluating their models, particularly if the number of variables is high and even unknowns must be estimated.

        Thanks, again!

        • CM, I appreciate your time and I won’t impose further. I’m digging into the GUM for further education!

          • No problem, having done a number of formal UAs for work, I had to become familiar with the innards of the GUM on at least a basic level. I am by no means an expert, nor am I a statistician.

    • It is a complicated issue. The GUM basically deals with measuring one thing with one device. If you assume that measurement errors are random, and you take a number of independent measurements (that is, for example, using different people) then the “true value” will be surrounded by small errors. If they are random, there will be as many short measurements as there are long measurement and they will have a “normal” distribution. The mean of that distribution will be the “true value”.

      That doesn’t necessarily mean accurate, but the measurements of that device should be pretty repeatable. Accuracy and precision are whole different subject in the GUM.

      The GUM also doesn’t deal with how to handle trending temperature measurements from different locations. That is a whole different area of statistics.

  7. The deliberate balancing of TOA radiation in the GCM’s as discussed here raises a new point about using GCM’s at all to diagnose the impact of past and future greenhouse gas emissions and concentrations. Now it seems even much worse than we thought.

    • I realize this thread is getting stale, but nevertheless I’m replying to my own comment here, having concluded that no, the GCM’s do not actively drive the calculations to achieve a prescribed TOA radiative balance. It kept recurring to me, “That can’t be right.”
      But there is certainly tuning in the development of the model which drives the TOA balance toward more or less stable results over time. And there also is the necessity for a conservation-of-energy “fix” to counter any residual imbalance from the operation of the model itself. See the research article linked both here and farther below in another comment. The article addresses these these considerations for the GFDL’s new climate model CM4.0.

      https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019MS001829

      It remains plain to me that a GCM, even this new one, has nowhere near the resolving power to diagnose or project the temperature response to incremental forcing attributed to CO2, or from human causes in total.

  8. Author’s comment: I have noticed that in translation from “Word” to “WordPress”, the section letters got turned into bullet points. Fortunately the section summaries can help people to keep track of the lettering of sections.

    Richard Booth

    • Rich,
      So apt that WordPress created an error about an essay on error. Next, how do you quantify the WordPress error, to better cope with it in the future?
      Jokes aside, yours is an important essay. Pat Frank and I have corresponded for years.
      Both have experience with analytical chemistry. That is a disciplice in which you live or die by your ability to both quantify analytical uncertainty limits and maintain your daily output inside those limits.
      You arrive at conclusions. One of these suggests that it is not possible to calculate valid limits unless you have identified the existence of, plus the weight of, EVERY perturbing variable; and that you have the math and logic skills to process the weights into an acceptable summary form.
      If you accept that proposition, it follows that you cannot assign valid, overall uncertainties to GCM outputs. Those who model GCMs must know this. I have to conclude that they have devised ways to quell their rebellious consciences and motor on, knowing they are spreading scientific porkies. Geoff S

  9.  “where is the vaunted skill of these GCMs, compared with anyone picking their favourite number for climate sensitivity and drawing straight lines against log(CO2)?”

    Exactly. We can model future temperatures in exactly that way and use lots of different sensitivity figures to give us the range of possible temperatures. We do not need these huge models at all. But the modelers claim that the “actual” sensitivity is an emergent property of the models, which is nonsense for the reasons you mention – sensitivity to startup conditions (fundamentally unknowable for the model) and artificial limiting of the model in particular.

    Unfortunately the sensitivity figure is vital to the claims about Climate Change and admitting it is unknown simply collapses the whole scare. And so we have the pretence that models that cannot tell us the figure can actually do so. But if they can, then we can predict future temperatures without the models. But we cannot, which proves the models are not able to tell us what we need to know. In other words, as long as we need the models, we should not use the models.

  10. Dr. Booth, in response I offer this from Freeman Dyson, his discussion with Enrico Fermi:

    “When I arrived in Fermi’s office, I handed the graphs to Fermi, but he hardly glanced at them. He invited me to sit down, and asked me in a friendly way about the health of my wife and our new- born baby son, now fifty years old. Then he delivered his verdict in a quiet, even voice.
     
    “There are two ways of doing calculations in theoretical physics”, he said. “One way, and this is the way I prefer, is to have a clear physical picture of the process that you are calculating. The other way is to have a precise and self- consistent mathematical formalism. You have neither.”
     
    I was slightly stunned, but ventured to ask him why he did not consider the pseudoscalar meson theory to be a self- consistent mathematical formalism. He replied,
     
    “Quantum electrodynamics is a good theory because the forces are weak, and when the formalism is ambiguous we have a clear physical picture to guide us. With the pseudoscalar meson theory there is no physical picture, and the forces are so strong that nothing converges. To reach your calculated results, you had to introduce arbitrary cut-off procedures that are not based either on solid physics or on solid mathematics.”
     
    In desperation I asked Fermi whether he was not impressed by the agreement between our calculated numbers and his measured numbers. He replied, “How many arbitrary parameters did you use for your calculations?” I thought for a moment about our cut-off procedures and said, “Four.” He said,
     
    “I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”
     
    With that, the conversation was over. I thanked Fermi for his time and trouble, and sadly took the next bus back to Ithaca to tell the bad news to the students.

    I looked at your model. You have no “clear physical picture of the process that you are calculating”. You also have no “precise and self- consistent mathematical formalism”.

    Instead, you have five arbitrary parameters. As a result, I fear that it is no surprise that you can make the elephant wiggle his trunk.

    My best to you,

    w.

    • Willis, by my model, I presume you mean my Equation (1). I have not in fact fitted anything to that, so have not tried to make the veritable elephant wiggle her trunk. The point about my model (1) is that it generalizes Pat Frank’s and with some parameters output pretty much the same mean values but will lead to very different evolution of error/uncertainty over time (which may have some chance of emulating GCM errors).

      And anyway, in Section C “Plausibility of New Parameters” I do describe a physical picture, namely storage of heat, which can lead to a decay parameter ‘a’ and bounded uncertainty.

      Best to you too, and it has been good to see you writing more on WUWT again.
      Rich.

      • “the veritable elephant wiggle her trunk”

        It is a well-known fact that only elephants of the male sex (not gender) wiggle their trunks. The females are watching for the wiggle.

        p.s. I made that up.

      • See – owe to Rich February 7, 2020 at 4:04 pm

        Willis, by my model, I presume you mean my Equation (1). I have not in fact fitted anything to that, so have not tried to make the veritable elephant wiggle her trunk.

        Rich, thanks for your kind words. I had a gallbladder operation so I was recuperating for a bit, but I’m back to full strength. Well, at least something near full strength.

        Regarding your post, I do mean Equation (1).

        By my count, the free parameters are k, the “0” and the”2″ in the ∑ limits, the ∑ variable “i”, the “11” in the middle, S, g, and the “9” in the final denominator.

        That’s eight freely chosen or tuned parameters, and you are in the middle of an elephant circus …

        Best regards,

        w.

        • Willis, ah, too many Equation (1)’s! The title of this posting, and the ensuing content, has nothing about the sun or carbon dioxide. It is about matters arising from Pat Frank’s very interesting paper of about 6 months ago.

          The equation you quote is from my paper, which I was merely mentioning in my “coming out” speech as part of my credentials, not as subject matter here. Still, since you raise the point, I did try to write a WUWT posting on that paper in April 2018, but one of Anthony’s contacts dissuaded him from publishing. Who knows, it might even have been you.

          Anyway, there are not 8 free continuous and effective parameters in that equation, there are 5, and only 4 if b_2 is set to 0 as commonly happens in the paper. To see this, consider the 4 free continuous parameters b_0, b_1, b_2, S. Once those are chosen, we can subtract out the terms with known values L(n-i) and C(n-g), leaving x = k+11(b_0+b_1+b_2)-S log2(C(9)). But that is only one free parameter: whatever you choose for the first 4, k can be chosen to give whatever value of x you want to use.

          I didn’t explain that in the paper, because it is well known to statisticians as “confounding of parameters”. As for g, it is either 0 or 1, with pretty similar results, and isn’t a continuous parameter.

          Rich.

          • Thanks, Rich. You say:

            “Anyway, there are not 8 free continuous and effective parameters in that equation, there are 5, and only 4 if b_2 is set to 0 as commonly happens in the paper. To see this, consider the 4 free continuous parameters b_0, b_1, b_2, S. Once those are chosen, we can subtract out the terms with known values L(n-i) and C(n-g), leaving x = k+11(b_0+b_1+b_2)-S log2(C(9)). But that is only one free parameter: whatever you choose for the first 4, k can be chosen to give whatever value of x you want to use.”

            True. Sorry, I missed that. However, that still leaves k, the “0” and the ”2″ in the ∑ limits, g, b_0, b_1, b_2, and the “9” in the final denominator.

            My point is simple. Whether you have four or eight tunable parameters, I’d be absolutely shocked if you could NOT fit it to the data. That’s the point of what “Johnny” van Neumann said. With that many free parameters and a totally free choice of the equation, you can fit it to anything.

            Finally, I’ve never understood how the length of a sunspot cycle could possibly affect global temperatures when the amplitude of said cycles doesn’t affect the temperatures. What is the possible physical connection between the two?

            Best regards,

            w.

          • Willis (Feb 9 1:48pm): I’ll reply with a specific point and a general point.

            Specific: k – no, that is not extra because x = k+11(b_0+b_1+b_2)-S log2(C(9)) covers it. 0 and 2 in the limits – no, they are not extra because they merely serve to enumerate b_0, b_1, b_2 which are already in there. 9 – no that is not extra because, again, x covers it.

            General: the theory of hypothesis testing in the General Linear Model. Von Neumann’s quip is amusing but actually unfair. We use mathematical statistics to determine whether a parameter has a significant effect – we don’t just look at the data, as often happens on blogs, and say “look at how well that fits”. Consider tide predictions, where I believe there are many more than 4 parameters in the models; presumably von Neumann would complain about those, and yet because of the wealth of data each one has a certain amount of value and a certain statistical significance.

            With global temperatures averaged over 14 11-year intervals, there isn’t that much information relatively, so it is hard to find significant effects. The mathematical analysis in the paper shows that S is easily significant because of the overall upward trend, b_0 isn’t significant and gets dropped, b_1 is significant and gets retained, and b_2 is just outside significance and either gets dropped on strict criteria or retained on aesthetic criteria. In the strict case the free parameters are k, S and b_1 so those 3 parameters in any case meet von Neumann’s pedagogy.

            Rich.

        • See – owe to Rich February 10, 2020 at 2:46 am

          Willis (Feb 9 1:48pm): I’ll reply with a specific point and a general point.

          Specific: k – no, that is not extra because x = k+11(b_0+b_1+b_2)-S log2(C(9)) covers it. 0 and 2 in the limits – no, they are not extra because they merely serve to enumerate b_0, b_1, b_2 which are already in there. 9 – no that is not extra because, again, x covers it.

          Thanks, Rich. The variables k, 11, and S are, as you pointed out, confounded parameters. However, they are only confounded if you specify the rest. But the ones that you specify include “9”. So that one is indeed one of the tunable parameters. And we have to include the confounded parameter (made up of k, 11, and S as you stated elsewhere). I included it as “k”, and although you can give it any name it is a tunable parameter.

          So that still leaves what I’ll call C (the confounding of k, 11, and S), g, b_0, b_1, b_2, and the “9” in the final denominator. With six tunable parameters, how well your model fits the data is
          MEANINGLESS. Seriously. I know you did it with solar cycle lengths as input (but without any explanation how a cycle that lasts a year longer has some magical effect).

          But I could do the same with say global population or the price of postage stamps or money spent on pets or a hundred other input variables.

          So what?

          Seriously, so what? I know this is hard to accept, just as it was hard for Freeman Dyson to accept. And I’m sorry to be the one to burst your bubble.

          But if you can’t fit a simple temperature curve given the free choice of equation, variables, and six tunable parameters, you should hang up your tools and go home. It’s nothing more than an futile exercise in tuning.

          General: the theory of hypothesis testing in the General Linear Model. Von Neumann’s quip is amusing but actually unfair. We use mathematical statistics to determine whether a parameter has a significant effect – we don’t just look at the data, as often happens on blogs, and say “look at how well that fits”. Consider tide predictions, where I believe there are many more than 4 parameters in the models; presumably von Neumann would complain about those, and yet because of the wealth of data each one has a certain amount of value and a certain statistical significance.

          Let me start by saying that Neumann’s statement was not an “amusing quip”. We know that because hearing it caused one of the best scientists of the century, Freeman Dyson, to throw away a year’s work by him and his students. So no, you can’t pretend it’s just something funny that “Johnny” said. It is a crucial principle of model building.

          Next, tides are something that I know a little about. I used to run a shipyard in the Solomon Islands. The Government there was the only source of tide tables at the time, and they didn’t get around to printing them until late in the year, September or so. As a result, I had to make my own. The only thing I had for data was a printed version of the tide tables for the previous year.

          What I found out then was that for any location, the tides can be calculated as a combination of “tidal constituents” of varying periods. As you might imagine, the strongest tidal constituents are half-daily, daily, monthly, and yearly. These represent the rotations of the earth, sun, and moon. There’s a list of some 37 tidal constituents here, none of which are longer than a year.

          But the reason Neumann wouldn’t object to them is that they are backed by a clear physical theory. You’re overlooking the first part of Fermi’s discussion with Dyson, viz:

          Then he delivered his verdict in a quiet, even voice. “There are two ways of doing calculations in theoretical physics”, he said. “One way, and this is the way I prefer, is to have a clear physical picture of the process that you are calculating. The other way is to have a precise and self- consistent mathematical formalism. You have neither.”

          For the tides, we indeed have an extremely clear physical picture of the process we’re calculating. So the question of tuning never arises.

          With global temperatures averaged over 14 11-year intervals, there isn’t that much information relatively, so it is hard to find significant effects. The mathematical analysis in the paper shows that S is easily significant because of the overall upward trend, b_0 isn’t significant and gets dropped, b_1 is significant and gets retained, and b_2 is just outside significance and either gets dropped on strict criteria or retained on aesthetic criteria. In the strict case the free parameters are k, S and b_1 so those 3 parameters in any case meet von Neumann’s pedagogy.

          With six tunable parameters fitting only fourteen data points, you have almost half as many parameters as data points. I’m sorry, but that is truly and totally meaningless. You desperately need to be as honest as Dyson was. He didn’t complain and claim that maybe it was four tunable parameters, not five. Instead:

          I thanked Fermi for his time and trouble, and sadly took the next bus back to Ithaca to tell the bad news to the students.

          You need to do what Dyson did, accept the bad news, put your model on the shelf, and move on to a more interesting problem of some kind.

          Sadly indeed,

          w.

          • Willis Feb 10 10:13am:

            Willis, I’ll prepend your comments with ‘W’, mine with ‘R’, followed by my unmarked replies.

            R: Specific: k – no, that is not extra because x = k+11(b_0+b_1+b_2)-S log2(C(9)) covers it. 0 and 2 in the limits – no, they are not extra because they merely serve to enumerate b_0, b_1, b_2 which are already in there. 9 – no that is not extra because, again, x covers it.

            W: Thanks, Rich. The variables k, 11, and S are, as you pointed out, confounded parameters. However, they are only confounded if you specify the rest. But the ones that you specify include “9”. So that one is indeed one of the tunable parameters. And we have to include the confounded parameter (made up of k, 11, and S as you stated elsewhere). I included it as “k”, and although you can give it any name it is a tunable parameter.

            W: So that still leaves what I’ll call C (the confounding of k, 11, and S), g, b_0, b_1, b_2, and the “9” in the final denominator.

            No, it is not S which is confounded with k, but log2(C(9)), which is a constant, and if I choose a different time index instead of 9, say 3, then in place of k I use k’ = k-S log2(C(9))+S log2(C(3)) and get the same value x = k’+11(b_0+b_1+b_2)-S log2(C(3)).

            And as I explained earlier, g only takes 2 possible values, with minor effects on the fit. g=0 means no lag between CO2 rise and temperature rise, and g=1 means an 11-year lag. In the paper I mention that 11 years is roughly consistent with other published estimates, so I could simply have taken g=1 as my given parameter.

            The continuously tunable parameters are k, b_0, b_1, b_2, S, which is 5, but as noted later they get reduced to 3: k, b_1, S.

            W: With six tunable parameters, how well your model fits the data is MEANINGLESS. Seriously. I know you did it with solar cycle lengths as input (but without any explanation how a cycle that lasts a year longer has some magical effect).

            You haven’t read the explanation in Section 6.2, which does nevertheless leave SCLs as a proxy deserving of further research.

            W: But I could do the same with say global population or the price of postage stamps or money spent on pets or a hundred other input variables.

            Nice joke, but I do think the sun and CO2 have more to do with climate than those.

            W: So what?
            Seriously, so what? I know this is hard to accept, just as it was hard for Freeman Dyson to accept. And I’m sorry to be the one to burst your bubble.

            W: But if you can’t fit a simple temperature curve given the free choice of equation, variables, and six tunable parameters, you should hang up your tools and go home. It’s nothing more than an futile exercise in tuning.

            Well, those 6 were really 5 and turned out to be 3 (again, see below). Given the amount of noise in temperature data, it would have been remarkable if all 5 had shone through with statistical significance.

            R: General: the theory of hypothesis testing in the General Linear Model. Von Neumann’s quip is amusing but actually unfair. We use mathematical statistics to determine whether a parameter has a significant effect – we don’t just look at the data, as often happens on blogs, and say “look at how well that fits”. Consider tide predictions, where I believe there are many more than 4 parameters in the models; presumably von Neumann would complain about those, and yet because of the wealth of data each one has a certain amount of value and a certain statistical significance.

            W: Let me start by saying that Neumann’s statement was not an “amusing quip”. We know that because hearing it caused one of the best scientists of the century, Freeman Dyson, to throw away a year’s work by him and his students. So no, you can’t pretend it’s just something funny that “Johnny” said. It is a crucial principle of model building.

            R: Yes, but it was being applied to theoretical physics, where we expect things to be more cut and dried, with some undoubted surprises along the way like quantum mechanics.

            W: Next, tides are something that I know a little about. I used to run a shipyard in the Solomon Islands. The Government there was the only source of tide tables at the time, and they didn’t get around to printing them until late in the year, September or so. As a result, I had to make my own. The only thing I had for data was a printed version of the tide tables for the previous year.

            W: What I found out then was that for any location, the tides can be calculated as a combination of “tidal constituents” of varying periods. As you might imagine, the strongest tidal constituents are half-daily, daily, monthly, and yearly. These represent the rotations of the earth, sun, and moon. There’s a list of some 37 tidal constituents here, none of which are longer than a year.

            I rest my case m’lud.

            W: But the reason Neumann wouldn’t object to them is that they are backed by a clear physical theory. You’re overlooking the first part of Fermi’s discussion with Dyson, viz:
            Then he delivered his verdict in a quiet, even voice. “There are two ways of doing calculations in theoretical physics”, he said. “One way, and this is the way I prefer, is to have a clear physical picture of the process that you are calculating. The other way is to have a precise and self- consistent mathematical formalism. You have neither.”

            W: For the tides, we indeed have an extremely clear physical picture of the process we’re calculating. So the question of tuning never arises.

            But the estimation of 37 parameters is tuning, and as I said, there is such a vast quantity of data available that statistical analysis can show the relative and non-zero importance of each of those.

            R: With global temperatures averaged over 14 11-year intervals, there isn’t that much information relatively, so it is hard to find significant effects. The mathematical analysis in the paper shows that S is easily significant because of the overall upward trend, b_0 isn’t significant and gets dropped, b_1 is significant and gets retained, and b_2 is just outside significance and either gets dropped on strict criteria or retained on aesthetic criteria. In the strict case the free parameters are k, S and b_1 so those 3 parameters in any case meet von Neumann’s pedagogy.

            W: With six tunable parameters fitting only fourteen data points, you have almost half as many parameters as data points. I’m sorry, but that is truly and totally meaningless.

            Your comprehension rather failed you there. 3 does not equal six. It isn’t even an adjacent integer.

            W: You desperately need to be as honest as Dyson was. He didn’t complain and claim that maybe it was four tunable parameters, not five. Instead: I thanked Fermi for his time and trouble, and sadly took the next bus back to Ithaca to tell the bad news to the students. You need to do what Dyson did, accept the bad news, put your model on the shelf, and move on to a more interesting problem of some kind.

            I shall be honest when the time comes, but not in response to your rather feeble criticisms and appeal to the omniscience of von Neumann. That time will be when, and if, new data falsifies my model. I think that the current and next solar cycles may indeed yield temperatures which require changes to my model parameters beyond the point of breaking. For example, b_1, currently standing at an 80:1 chance of its value occurring at random, might decline to 10:1, considered non-significant. Or the Durbin-Watson test on the residuals may become so significant that the model becomes irretrievable. If that happens, the past good fit of the model will have to be put down to spurious correlation. Or perhaps I’ll get lucky and a new Hiatus will occur this decade!

            Rich.

  11. I haven’t had time to digest all of this.

    One comment on the 10 ruler (or 100 ruler) case. Your case relies upon a statistical distribution of how to achieve the best serial measurement. That certainly appears ok at first brush. However, my understanding of Pat Frank’s paper is that you only have one ruler. The ruler has a given uncertainty and is used in serial measurements. I don’t believe your calculations covered that case adequately.

    It is simple logic that if a ruler is 0.1 inch short and you use it serially 10 times, your measurement will be short by one inch. Likewise, if it is 0.1 inch long and used 10 times, your measurement will be one inch long. This doesn’t even cover random errors like pencil width, parallax, etc.

    Even if the uncertainty is expressed as a standard deviation, that only applies to what the measurement may be for one measurement. What is the uncertainty distribution after using it serially for 10 times? This what his paper was about. If you use a piece of data with an uncertainty at the input of a GCM, it will flow through in some fashion to the output. If you then take that output and feed it back in to another run, the uncertainty will compound again. Each time you run the GCM, feeding the output of the last run into the input of the next, the uncertainty will compound.

    If you have programmed it in such a way as to cancel this type of error, what you have done is chosen what you want the output to be.

    • …… And when your calculations ultimately produce a potential “Temperature” variance far beyond what one observes in the real world, the only conclusion that can be deduced is that the model is fatally flawed.

      “It doesn’t matter how beautiful your theory is, it doesn’t matter how smart you are. If it doesn’t agree with experiment, it’s wrong” – Feynman

        • The “experiment” is the collection of real world raw data. Not tree ring proxy or any other proxy. When the product of a models output is re-entered as viable data for the next model run, and the uncertainty values exceed known parameters, the only possible conclusion is the model is worthless.

    • Jim,
      I was also going to object to this part of the analysis, but I would also point out that it is not reasonable to expect a manufacturer to specify an error distribution. In fact, I think you will find that outside the area of high-precision lab equipment, they don’t. All they will tell you is that any products outside the specified parameters (e.g. +/- 0.1 inch) are rejected (not sold). But they don’t make any attempt to describe the error distribution; it is unknown and unknowable. Without a full analysis of the GCM codes to account for all the floating point errors (both representational and rounding) and other potential sources of error (coding “bugs”) we can’t assume any particular error model. Any analysis of error propagation must deal with this issue, and I don’t believe that Dr. Booth has.

      • Paul, a point I make is that when numerous errors combine, their sum tends to normality by the Central Limit Theorem. It is then reasonable to use standard deviations of that in the error propagation. But correlation of errors does need to be addressed in the propagation; nevertheless apart from my R_2(t) versus R_3(t) I follow Pat Frank in assuming no correlation.

        RJB

        • This assumption about the CLT only applies when errors can be shown to be random,i.e., a normal distribution. Further assumptions are that you are measuring the same thing with the same device. In other words, repeatable measurements.

          Each temperature measurement that is recorded is a non-repeatable measurement. It is a one time, one location measurement. You can not reduce uncertainty of a single measurement by using measurements at different times or locations with different devices because you can’t assume a normal distribution of errors. Therefore, it is the only measurement you will ever have with whatever uncertainty budget applies to single measurements. In essence you have a mean with a variance for each measurement. When you average measurements you must also combine the variences.

          • Jim, the concept of reducing uncertainty through repeated statistically independent samples does not depend on the presence of a normal distribution. It is simple mathematics from calculating variances of means of random variables.

            But anyway my comment at Feb 8 1:49am didn’t involve a mean. Rather, it was pointing out that if an overall error arises from several independent sources, then by the CLT the error distribution tends to normality, whereas Paul Penrose seemed to be saying it was completely unknown. But that doesn’t _reduce_ the error variance, which is the sum of all the component variances.

            RJB

          • See –> You missed the point. When taking a measurement, the population of the multiple random readings of the same thing must approximate a normal distribution if you wish to simply average the readings to find a “true value”. From the GUM:

            “3.1.4 In many cases, the result of a measurement is determined on the basis of series of observations obtained under repeatability conditions (B.2.15, Note 1).

            NOTE 1 The experimental standard deviation of the arithmetic mean or average of a series of observations (see 4.2.3) is not the random error of the mean, although it is so designated in some publications. It is instead a measure of the uncertainty of the mean due to random effects. The exact value of the error in the mean arising from these effects cannot be known.

            NOTE 2 In this Guide, great care is taken to distinguish between the terms “error” and “uncertainty”. They are not synonyms, but represent completely different concepts; they should not be confused with one another or misused.”

            The CLT will allow you to take samples from a population and use sample means to determine the mean of that population. The sample mean will tend toward a normal distribution but that does nothing to change the precision, variance, or uncertainty of the population. A lot of people believe the “error of the mean” of a sample mean distribution means the mean of the population gains all of these. It does not. It only describes how close the sample mean is to the mean of the populaation.

            Too many people use the CLT and error of the mean to justify adding digits of precision to averages. You can not do that. Significant digits are still important.

          • Jim (Feb 9 1:24pm): I deferred replying to your comment as it took a little more thought than some of the others.

            The JCGM does not say anything about measurements having to be from a normal distribution in order for their mean and standard deviation to be useful. Sections 4.2.2 and 4.2.3 define s^2(q_k) and s^2(q*) as the variances of the observations and of the mean (I am using q* as more convenient here than q with a bar on top). If there are n observations then s^2(q*) = s^2(q_k)/n, and the (standard) uncertainty of the mean is defined to be u(q*) = s(q*), which does decrease (on average) as n grows, in a sqrt(1/n) fashion.

            So I don’t see in what sense I “missed the point”.

            Rich.

          • Rich –> Look at what you are dividing by 1/sqrt n. You are dividing thepopulation standard deviation squared. That means you end up with a standard deviation that is smaller than the population mean. It tells you that the sample mean is closer and closer to the population mean. In other words, the sample mean distribution becomes tighter and tighter around population mean. When n = infinity, the sample mean would be the same as the population mean.

            Please note, this calculation has nothing to do with the accuracy, precision, or variance of the population. It only tells you how close you have approximated the mean.

        • Dr. Booth,
          The statement, “when numerous errors combine, their sum tends to normality by the Central Limit Theorem” is only true if you are talking about true random errors. But what if what you are calling “errors” are really biases and are not random at all? In that case I don’t think your statement is true any longer. And if, as I assert, these “errors” are unknown, then it is as equally likely that they are biases versus random noise. In the face of this unknown, don’t we have to proceed along the worst case path, which is to say, the “errors” propagate forward as a simple sum?

  12. Dr. Booth, in your published article Section 4.10, you apply your model to the warming between 1980 and 2003 studied by Benestad (2009). The period from 1980 on also coincides with the satellite observations of global temperature. HADCRUT4 depends on local sparse uncertain ground and sea temperature measurements, whereas the satellite observations are the only true measure of global temperature (although themselves subject to different interpretations of the data: eg the UAH and RSS data series).

    Would you consider repeating your effort in section 4.10 using the satellite observations instead of HADCRUT4? Preferably both RSS and UAH? If the results stay much the same, you have at least verified your approach with two (or three) independent sets of observations.

    • Lance, I would love to be able to do that, but the statistical signal is too weak to turn into a significant result over just 40 years. With the HadCRUT4 data, the warming from 1910 to 1940 and then slight cooling gives the model part of its traction.

      But anyway that paper isn’t what this WUWT article is about.

      Thanks for the idea,
      RJB

  13. “The implication of comments by Roy Spencer is that there really is something like a “magic” component”
    It isn’t. The implication is simply that GCMs are based on conservation laws. In particular they conserve energy. So it has to add up. Where they make some local error, it is the overall conservation requirement that ensures TOA balance. Not “GCMs have to resort to error correction techniques”.

    ” the practice seems dubious at best because it highlights shortcomings in GCMs’ modelling of physical reality.”

    No, it is GCM’s correct modelling of physical reality, in which energy is conserved. A reality totally missing from Pat Frank’s toy model.

    • Nick, do you admit that the GCMs are subject to errors of +/-4W/m^2 in LCF (Longwave Cloud Forcing)? If they are, which other parameters of the GCMs automatically adjust to correct that error? Or, does LCF not even enter into the GCMs, for example because they use a fixed amount of radiation input, plus a bit for the annual increase in GHG LW downwelling?

      I admit I am showing my ignorance about GCMs here 🙂

      RJB

      • “are subject to errors of +/-4W/m^2 in LCF”
        That number has been grossly mis-used; it is actually a spatial variability. But insofar as the GCMs do get cloud opacity wrong; they simply give a consistent solution for a more or less cloudy world. Energy is still conserved.

        • “they simply give a consistent solution for a more or less cloudy world. Energy is still conserved.”

          In other words, they make stuff up.

        • “Energy is still conserved.”

          You keep saying this, and it isn’t true. The mathematically written-out Navier Stokes (NS) equations are based on conservation of mass, momentum, and energy. Those aren’t the equations that any general circulation model use. Instead, they use a bastardized algebraic version of the NS equations, with any of a variety of discrete approximations representing the partial derivatives, where the spatial components are represented by computational grid-points surrounding the Earth.

          Modern computers have allowed modelers to use ~10E5 grid points to model the entire atmosphere. Even so, the spatial resolution is on the order of hundreds of kilometers, which doesn’t amount to anything a reasonable person would consider “resolution.” It is insufficient to resolve a thunderstorm. Heck, it is insufficient to resolve a hurricane, really.

          On top of that, these models are expected to integrate reliably 100 years into the future, and do so in less that 100 years run time. So even with the gigantic spacing of the grid points, the relatively small minimum time step (less than 5 minutes) demanded of an explicit solution method dictates that either implicit or “spectral” solution methods be employed. Each permits longer time steps, with the trade off of loss of accuracy for each step. Implicit methods, for example, are solved iteratively. The iteration at each grid point and time step is stopped when a pre-set error value is achieved, otherwise the computation would go on forever.

          The errors at every single one of those 10E5 grid points are small, but finite. And they serve as an erroneous set of initial conditions for the next time step. The errors do include errors in each of the “conserved” quantities, including energy. And they accumulate over time.

          All of this is besides the fact that, in order to handle turbulence (which dominates atmospheric physics), modelers have to employ Reynolds averaging. Even if one believed that the discretized NS equations were the same as the parent partial differential equations (they aren’t), the introduction of turbulence models to close the Reynolds averaged equations renders them non-physical. They are not physics-based any longer.

          Get over it.

          • “Instead, they use a bastardized algebraic version of the NS equations, with any of a variety of discrete approximations representing the partial derivatives”

            The NS equations are always expressed with algebra. And all CFD (and all applied continuum mechanics) will use discrete approximations. The main deviation of GCMs is in using the hydrostatic approximation for vertical pressure. That means dropping terms in acceleration and vertical viscous shear. You can check whether that is justified, and make corrections (updraft etc) when not.

            As with all CFD, you can express the discretisation in conserved form. That is, you actually do the accounting for mass, momentum and energy in each cell, with boundary stresses etc. You aren’t then relying on partial derivatives.

            As for integrating 100 years into the future, well, it does. That is the point of the fading effect of initial conditions. GCMs, like CFD, express the chaotic attractor. As long as it maintains stability, it will keep doing that; the ability to do so doesn’t wear out.

            “(less than 5 minutes) demanded of an explicit solution method dictates that either implicit or “spectral” solution methods be employed”
            Spectral is really just a better organised explicit. I think the time step is more than 5 minutes (the Courant condition is based on gravity waves which are a bit slower than sound), and basically, that is what they use. AFAIK, they don’t use implicit.

            “And they accumulate over time.”
            Actually that is what has sometimes upset Willis. They use energy “fixers” which basically add a global conservation equation. That makes the system slightly overdetermined, but stops the accumulation.

            “the introduction of turbulence models to close the Reynolds averaged equations”
            Well, all CFD does that. It makes the momentum more diffusive, as it should be, possibly not to the correct extent. But it doesn’t break the conservation.

          • Nick Stokes February 8, 2020 at 12:10 am

            “And they accumulate over time.”

             
            Actually that is what has sometimes upset Willis. They use energy “fixers” which basically add a global conservation equation. That makes the system slightly overdetermined, but stops the accumulation.

            STOP WITH THE HANDWAVING ACCUSATIONS!!! Provide a quote where I said I was “upset” by the accumulation of energy imbalance in the models. You can’t, because I never said that, it’s just another of your endless lies about me.

            What I actually said, AS I HAD TO REMIND YOU BEFORE, was that I was upset that Gavin didn’t put a Murphy Gauge on the method to determine if the error was either big or small. Here’s the quote:

            Willis Eschenbach January 18, 2020 at 12:24 pm

            Nick, please learn to read before attacking at random. I didn’t say he was a lousy programmer for how he dealt with the energy imbalance. I specifically said that was OK.

            I said he was a lousy programmer for not putting a Murphy Gauge on the amount of energy re-distributed, so he could see when and where it went off the rails.

            Stop your god-damned lying about what I said, Nick. That’s twice you’ve tried the same lie. It is destroying what little is left of your reputation.

            w.

          • “The NS equations are always expressed with algebra.”

            Well, no, they’re always expressed as a set of non-linear partial differential equations in their native form. Solution techniques are usually (though not always) expressed in algebraic form involving discrete approximations of the derivatives, partial or otherwise.

            And all CFD (and all applied continuum mechanics) will use discrete approximations.” I never distinguished GCMs from CFD. It’s CFD and its pretensions with which I have the basic problem. It’s a tautology to say that all CFD will use discrete approximations. Discrete approximations – and the rise of the digital computer to calculate them – are the only reason there is such a thing as CFD, which is the art of approximating solutions of continuous differential equations through the use of discrete (algebraic) equations. Not all applied continuum mechanics uses discrete approximations, however, not even fluid dynamics. The application of finite calculus to fluid dynamics is popular because math is hard. It isn’t impossible in all cases. (see https://cds.cern.ch/record/485770/files/0102002.pdf, for example)

            “As with all CFD, you can express the discretisation in conserved form.”

            I thought that was the whole point. My point is that any numerical method run on a digital computer is subject to error, and integration of differential equations is subject to error in initial conditions, truncation error (pertaining to truncation of the series representing the solution function, whether Taylor, binomial, “spectral”, or whatever), and roundoff error (related but not limited to machine word length). For climate models, we can also add a class of error that does exist but is completely unquantified (and ignored): bit errors not detected by standard error correcting techniques. With the stupendous number of computations required for a 100-year climate run, these must have a substantial effect. These errors occur in every computed dependent variable at each time step, and though their order can be estimated, their magnitude and sign are completely unknown, and all of them contribute to the error in energy. There isn’t any way to correct for energy error in any way that can be proven consistent with reality.

            “As for integrating 100 years into the future, well, it does.”

            I can numerically integrate the equations of motion for the planets in the solar system (a much simpler problem) for 100 years, too. And the results will be wrong. But they will be results. In fact, I can keep it going for 100 million years. The results will be wronger, but they will still be results. It is thought that we can get realistic results for planetary positions over the span of a million years, but nobody really knows. Now, we can integrate the equations of motion of an ICBM and show that it can hit a target 6,000 nautical miles away with a 50% probability of hitting within 500 feet. Test flights verify that ability. But they’re 30 minutes in duration. Getting to the Moon takes 2 1/2 days. We can’t do that without course corrections – measuring actual state vectors in flight via radar and stellar updates, and computing a new trajectory to correct the one that we thought was correct in the first place, just so we don’t miss by a hundred kilometers. It strains credulity to think that an integration of phenomena whose physics are far less well understood than those of celestial mechanics could give any kind of meaningful results, particularly one involving a vastly larger number of computations at an accuracy order far lower than that of the integration schemes used for space flight. It doesn’t just strain it, it tears it limb from limb.

            On CFD as providing chaotic attractors, that’s one of my beefs with the discrete time derivative. Attractors involve a fixed point, a point at which f(x) = x. Finite difference approximations of the time derivative make that possible computationally, when it might not be possible for an analytic solution. It’s an problem I’ve been studying, but can provide no definitive answers yet.

            Spectral methods just replace the linear spatial interpolation functions, which are series-derived (Taylor or other) with series using orthogonal basis functions. The time derivative is still series-derived, and the time steps AFAIK are taken by implicit methods to avoid the Courant limit. And yes, they are much longer than 5 minutes, or we’d have 100 year simulations that took 1,000 years to run.

            The part about the turbulence models was off topic, but I am always frustrated when people assert that CFD is “physics.” And not all CFD uses them. Direct Numerical Simulation doesn’t, but it also would be incapable of running a 100 year simulation of the climate with enough nodes to capture turbulence at all significant levels.

          • Willis,
            “Provide a quote where I said I was “upset” by the accumulation of energy imbalance in the models. “

            OK, from here

            “A decade or more ago, I asked Gavin how they handled the question of the conservation of energy in the GISS computer ModelE. He told me a curious thing. He said that they just gathered up any excess or shortage of energy at the end of each cycle, and sprinkled it evenly everywhere around the globe …
            As you might imagine, I was shocked … rather than trying to identify the leaks and patch them, they just munged the energy back into balance.”

            Well, you said “shocked”; I said “upset”. I can remember a rather more extensive disapproval, which I can’t currently locate.

          • Michael Kelly,
            “It isn’t impossible in all cases. (see…”
            Well, I said applied continuum mechanics, which this really isn’t. And I don’t think there is much done in structural mechanics, elasticity etc that isn’t discretised.

            “for 100 years, too. And the results will be wrong”
            Well, not that wrong. The planets will still be in much the same orbit. Kepler’s laes etc will still be pretty much followed. You’ll get some phase discrepancies. Kind of like getting the climate right but the weather wrong.

            “are taken by implicit methods to avoid the Courant limit. And yes, they are much longer than 5 minutes, or we’d have 100 year simulations that took 1,000 years to run

            Actually implicit methods are slower. One of my beefs here is that implicit methods aren’t really an improvement. They require an iterative solver, which basically reenacts internally the many steps that an explicit method would use.

            “And yes, they are much longer than 5 minutes”
            Well, somewhat, because the layered atmosphere isn’t as stiff as a block of air would be (on compression is can rise). Here is a practical discussion:
            “The experimental cases have GCM time steps of 600, 900, and 3600 s (fscale is higher with a smaller time step).”
            I think 3600s was coarse resolution.

          • Nick Stokes February 9, 2020 at 11:14 pm

            Willis,

            “Provide a quote where I said I was “upset” by the accumulation of energy imbalance in the models. “

            OK, from here

            “A decade or more ago, I asked Gavin how they handled the question of the conservation of energy in the GISS computer ModelE. He told me a curious thing. He said that they just gathered up any excess or shortage of energy at the end of each cycle, and sprinkled it evenly everywhere around the globe …
            As you might imagine, I was shocked … rather than trying to identify the leaks and patch them, they just munged the energy back into balance.”

            Well, you said “shocked”; I said “upset”. I can remember a rather more extensive disapproval, which I can’t currently locate.

            Nice try, but no cigar. You said I was upset by the accumulation of energy. But as your quote itself proves, now that you’ve bothered to quote it, I was NOT shocked by the accumulation.

            Instead, I was shocked by the fact that rather than try to identify any possible leaks, they made no effort to see if anything was wrong or to even to see if the amount accumulating was unreasonably large.

            Instead they just sprinkled it around the planet. The accumulation and the sprinkling didn’t bother me as you falsely claimed. The lack of any attempt to monitor the process did, and I went on to talk about Murphy gauges.

            w.

    • TOA balance does not imply a discrete climate state, much less a physically accurate climate state.

      Climate models make very large errors in the distribution of energy among the climate sub-states.

      From Zanchettin, et al., (2017) Structural decomposition of decadal climate prediction errors: A Bayesian approach. Scientific Reports 7(1), 12862

      [L]arge systematic model biases with respect to observations … affect all of mean state, seasonal cycle and interannual internal variability. Decadal climate forecasts based on full-field initialization therefore unavoidably include a growing systematic error

      Model drifts and biases can result from the erroneous representation of oceanic and atmospheric processes in climate models, but more generally they reflect our limited understanding of many of the interactions and feedbacks in the climate system and approximations and simplifications inherent to the numerical representation of climate processes (so-called parameterizations).

      All these errors in climate sub-states are present despite the imposed TOA balance.

      But for you, Nick, “limited understanding” evidently achieves a “correct modeling of physical reality.

      There is zero reason to think GCMs deploy a “correct modelling of physical reality.”

      Second, my emulation equation is not a “toy model.” Toy model is just another instance of you being misleading (again), Nick.

      Eqn. 1 demonstrates that GCMs project air temperature merely as a linear extrapolation of CO2 forcing. The paper explicitly disclaims any connection between the emulator and the physical climate.

      Thus, “This emulation equation is not a model of the physical climate. It is a model of how GCMs project air temperature.

      You knew that Nick, when you chose to mislead.

    • Nick,
      The presence of conservation laws and all energies adding up is not really relevant.
      Permit a simpler analogy, whole rock analysis in analytical chemistry. The sum of all of the analyses of the chemical elements has to add up to 100% of the weight of the rock. (Conservation of Mass?). This says nothing much about the size and location of errors. In practice, the larger errors tend to go with the abundant elements, like oxygen, silicon, aluminium. These elements are typically in the tens of % range. Then, there are trace elements like (say) mercury, where we are in the parts per billion range. The error assigned to (say) oxygen analysis is far removed from the error associated with mercury analysis. Large errors in trace mercury analysis have next to no effect on the total mass balance.
      Back to the GCM case, the errors associated with the larger energy components will generally dominate the overall error analysis. Given the central role of Top of Atmosphere energy balance, I show again this figure from about 2011.
      http://www.geoffstuff.com/toa_problem.jpg from Kopp & Lean
      http://onlinelibrary.wiley.com/doi/10.1029/2010GL045777/full

      Here we have the classic problem of subtracting two large numbers (energies in and out) to get a tiny difference whose even smaller variation has significance for the problem at hand. But, the TOA energy balance shown by the responses of various satellite instruments in the figure above is heavily dependent on the subjective act of adjustment and aligning of the satellite data in the absence of an absolute comparator.
      Which leads to a more general question, is it valid to apply classic error analysis methods to numbers that are invented or subjectively adjusted as opposed to measured? For example, how does one calculate the useful error of historic gridded surface sea temperatures when a large % of them are invented by interpolation?
      Geoff S

      • Geoff,
        “The sum of all of the analyses of the chemical elements has to add up to 100% of the weight of the rock.”
        The difference is that the whole of mechanics can be derived from the conservation laws. There isn’t anything else. And that is the basis for solution. It is really built in.

        “For example, how does one calculate the useful error of historic gridded surface sea temperatures when a large % of them are invented by interpolation?”
        The entire field outside the points actually measured is “invented” by interpolation. We know about temperature by sampling, as we do throughout science. Why are you analysing those rock samples? Because you want to know the properties of a whole mass of rock. Fortunes depend on it. You “invent by interpolation”. Suppose you do mine and crush it. How do you know the value of what you produced? Again, you analyse samples. Hopefully with good statistical advice. It’s all you can do.

        • Stokes
          You said, “It’s all you can do.” No! If the analyses prove to be wrong then you can try to determine why they are wrong. You might want to alter the sampling procedure or alter your model of mineralization. You might want to look for a different assayer or chemist. You might want to look at your ore-processing stream to see if you are losing things of value.

          It isn’t sufficient to say that because your approach is based conservation laws is has to be right, and then ignore surprises.

          • “If the analyses prove to be wrong then you can try to determine why they are wrong. “

            It isn’t an issue of whether the analysis of those samples is accurate. The issue is that you then have to make inferences about all the rock you didn’t sample. Same as with temperature, and just about any practical science. How do you know the strength of materials in your bridge? You measure samples. Or look up documentation, which is based on samples of what you hope are similar materials. How do you know the salinity of the sea? You measure samples. The safety of your water supply? Samples.

          • Stokes
            As usual, you either missed the point or chose to construct a strawman. You said, “How do you know the strength of materials in your bridge? You measure samples.” The issue is that if your bridge fails, you ask why. It may well turn out that that the sampling was done incorrectly. It is also possible that the formulas used were wrong or calculated incorrectly. In any event, a good engineer doesn’t resort to defending the design by claiming that it was based on “conservation laws.” They try to determine why the bridge failed and make appropriate corrections to avoid repeating the mistake(s).

      • Geoff
        Further to your remarks, when an analysis doesn’t add up to 100% (all the time!) an assumption that is made is that the error is proportional to the calculated oxide percentage. Thus, every oxide reported is adjusted in proportion to the raw percentage. That is not an unreasonable assumption, but it may not be true. Most of the error may actually be associated with just one oxide. Therefore, if the assumption of proportionality is not true, then error is introduced into all the calculated amounts.

        The same thing could be true about conservation of energy for TOA. If the GCM step-output is scaled back to 100%, without determining WHY it is in error, then an error is retained and propagated instead of being expunged. So, while Stokes and others think that they are keeping the calculations on track, they may just be propagating an unknown error and blithely go on their way convinced that they are doing “science based” modeling. In reality, they are pretending that the uncertainty doesn’t propagate and it becomes a self-fulling belief because they constrain the variation over time with a rationalization.

        • Thank you Clyde,
          You are showing that you know what I mean.
          The analytical chemist, faces with a discovery of unacceptable errors, either withdraws the resulkts thought to be wrong, or embarks on new investigations using a variety of available techniques.
          The GCM modeller seems to merely double down and try to bullshit a way through the wrong results. I admit that the modeller has fewer options about going to other investigations, but the alternative seems to be ignored. The alternative is to withdraw the results known to be in error.
          It verges on the criminal to continue to push results known to be so wrong.
          Geoff S

  14. Rich, “… this means that Equation (1) with the equivalent of Frank’s parameters, using many examples of 80-year runs, would show an envelope where a good proportion would reach +15K or more, and a good proportion would reach -15K or less,

    If your equation (1) means that, then it has no correspondence with my emulation eqn. (1), nor with the uncertainty calculation following from my eqn. (5).

    The predictive uncertainty following from model calibration error says nothing whatever about the magnitude of model output.

    • Rich, “But the GCM outputs represented by CMIP5 do not show this behaviour, …

      Yet once again, the same bloody mistake: that predictive uncertainty is identical to physical output error.

      Correcting this mistake is, for some people, evidently a completely hopeless project.

      The ±15 C predictive uncertainty says nothing whatever about the magnitude of GCM outputs, Rich. It implies nothing about model behavior.

      • Pat, let’s face it, we are never going to agree about “uncertainty” (or perhaps even what we mean by “mean”!). I have stated pretty clearly, supported I believe by the JCGM, how uncertainty relates to model error. That then says a lot about model behaviour, and it is model behaviour we are interested in, because global policy is mistakenly based on it.

        If the GCMs did not use “conservation of energy” to correct for error, then I have absolutely no doubt that they would indeed wander off to +/-15 degrees or more by 2100. Your paper has been invaluable, and I give that credit in Section H, in extracting the admission of auto-correction, which means the models probably don’t care too much about LCF errors except perhaps in respect of regional distribution of temperature (GCM experts: I’d still like to hear more on that subject).

        Rich.

        • Rich, “,… then I have absolutely no doubt that they would indeed wander off to +/-15 degrees or more by 2100.

          But that’s not the meaning of my analysis, nor what I or my paper are saying.

          You insist upon an incorrect understanding of predictive uncertainty. That mistake is fatal to your whole argument.

          I’ll have more to say later.

    • Pat,
      I don’t see how people keep getting this point wrong. What your “uncertainty envelope” means is that in order to have any confidence in the model outputs, they would have to be outside that envelope. Why? Because any results within the envelope are just as likely to be caused by random errors of various kinds, versus valid expressions of the underlying physical theories. Of course, if they were outside the envelope they would be equally unbelievable. So the only conclusion is that the model outputs beyond a few months (days?) from the starting time are simply useless.

      • Paul–> What you are saying is what was mentioned above. One must resolve the reasons for the uncertainties until the “output”uncertainty is small enough to allow one to make the conclusion the the calculated result exceeds the uncertainty interval. If the interval continues to be too large, keep searching and revising until you can legitimately reduce it further.

    • Pat, can you prove mathematically that my Equation (1) has no correspondence with your (1) or your (5)? My explanation of equivalence is in Section D Equation (8).

      Rich.

  15. “Instead, sort them by length, and use the shortest and longest 5 times over. We could do this even if we bought n rulers, not equal to 10. We know by symmetry that the shortest plus longest has a mean error of 0, …”

    The statements are true only if the error about the nominal value is symmetrical. A manufacturing error could introduce a bias such that the arithmetic mean was 12″ but the mode was some other length.

    • I once had a problem with a manufacturer of ground steel bar – which was consistently oversized.
      The diameter distribution data – although well capable within limits – was always biased to the oversize end of the limits ? Curious.
      When I visited the factory and spoke to the operating staff I was told “The boss tells us he sells by the kilogram and we must work toward the upper limit” – Ah-Haa – stupid but understandable.

      The assumption of errors stacking up to a normal distribution does not hold if there is some socio-political bias in the way the end to end rulers are stacked – i.e. a bias to long or short.

      I fear climate modelling is fraught with such biases and the distribution curve moves and expands in one direction – increasing with every addition.

      “It is beyond coincidence that all these errors should be in the same direction.”
      Dr. Matt Ridley – Angus Millar lecture to the Royal Society – Edinburgh – Nov 1st 2011

      • Ken (Feb 8 12:22am), actually the socio-political bias does not prevent errors stacking up to a normal distribution. It just means that the arithmetic mean (see, owe to Rich that he is saying what he means by “mean”) is not the target diameter that you expected it to be. So the amount of uncertainty, i.e. dispersion of error, may be as expected, while the mean is not. This points up the importance of verification, which is what you admirably did. But if the bias didn’t take the results outside the tolerance quoted, if any, by the manufacturer, then you couldn’t sue them.

        RJB

    • Clyde, very true. But my analysis was predicated on a uniform distribution, which is symmetrical about its central value, whether biased or not. I did this because some previous Commenters had said that given an uncertainty interval one should assume any value within that is equally likely. All sorts of bizarre conclusions flow from that assumption, which ahead of time, one migh not have expected mathematically.

      RJB

  16. “I talk to the trees/ But they don’t listen to me … “ Lyrics from “Paint Your Wagon”, 1969.
    “I talk to the trees/That’s why they put me away …” Lyrics by Spike Milligan, soon after.
    The Mann, Bradley & Hughes “hockeystick”paper of 1998 used properties of tree rings to derive an alleged temperature/time history of the Northern hemisphere. It is plausible that these authors knew that temperature was not the only variable able to influence such tree ring properties. Moisture, fertilizer, insect damage are 3 further properties of influence.
    In a proper error analysis, estimates of error of each variable are made, then combined to give an overall error. For moisture, there was some historic data and some prior work linking moisture levels to ring properties. There was much less historic data about fertilization effects, both natural and man-made. There was effectively no useful historic data to quantify insect effects on ring growth.
    It follows that it was improper to calculate error envelopes. Those shown in this much-criticized hockey stick paper must have an unscientific element of invention.
    http://www.geoffstuff.com/mbh98.jpg
    Geoff S

  17. I found the paper most interesting.
    I compare the results of our weather forecasting models to the “climate” ones. Weather has much higher resolution and gives fairly good results 1 to 2 weeks ahead. Beyond this they say little that is very useful, and accuracy falls very rapidly. Temperature expectations become +- several degrees and may well be hugely out. Paths of weather fronts are fairly unpredictable a week in the future. If I reduce the input data sampling to 100km squares the results are useless. Why is a similar model attempt, with even lower resolution, supposed to predict the “climate” many years into the future? One may say climate is better understood, but this is untrue. Huge effort has gone into weather prediction for many years, largely to get the error terms under control. The fact that the climate models do not agree with reality, says that they are useless. Why does anyone still believe anything they say, particularly after reading the analysis above!

    • David –> I see you quoted “climate”. This is a bugaboo of mine. The earth doesn’t have a climate! It has a myriad of climates that result in biomes. The GCM’s would better be classified as GTM’s, Global Temperature Models.

  18. But to criticize any GCM with this logic flies into to face of the reality of the broader CFD (Computational Fluid Dynamics) based design methods, which are used in exactly the same way to do actual engineering and where results from the runs can be tested with reality again and again. They all proof without doubt that errors get squashed, not multiplied or propagated as suggested by simplifying the model to something simple.

    So my advice is to work with people versed in CFD to explain how the amount of solutions are generally limited (steady states) which influences error rates and uncertainty both, in the long run for sure.

    • I think your advise is good, however, you talk of users of CPD testing against reality, creators of GCMs do not, can not test their creations against reality. Here lies the problem.

      Testing the global temperature anomaly output from a GCM against a measured!! global temperature anomaly is pointless since Australia could be a little warmer and Canada a little cooler and the measured temperature remains constant. The solution to this is compare every grid cell temperature from the model with every real grid cell in reality.

      Impossible and therefore a waste of time, money and intellectual capital.

  19. Steve
    I have long advocated that instead of distilling all temperature measurements into a single number, the measurements should be aggregated and averaged for each and every Köppen climate zone. That would tell us if warming is uniform (which it almost certainly isn’t), and would make it clear what climate zones are most poorly instrumented, and therefore have the greatest uncertainty, and provide us with a more sensitive measure of the regional changes for those zones that have the highest quality measurements. Yet, climatologists insist on on using a single number, and don’t acknowledge the very large annual variance associated with a global average.

    • Clyde –> You have hit a nail on the head. Let me also add that we should also begin to include humidity levels. Enthalpy is what is important. One can have a two biomes with similar temperatures but massive difference in humidity and consequently heat. Heat is what we should be dealing with, not temperature.

  20. For those interested, I found this recent research article by Held, et al. “Structure and Performance of GFDL’s CM4.0 Climate Model” in the Journal of Advances in Modeling Earth Systems, November 2019.

    https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019MS001829

    See figure 15 in particular – “Root mean square errors (RMSE) in net, net shortwave, and outgoing longwave radiation (in W/m2) at top‐of‐atmosphere (TOA) for the annual mean and the individual seasons…”

    See also figure 4. In a nearby paragraph there is mention of “a uniformly distributed energy‐conservation fix.”

    • David, thanks for spotting that. It sounds very interesting because I would love to get to the bottom of how the energy conservation thing is enacted in practice. However, at the moment I get “This site is currently unavailable. We apologize for the inconvenience while we work to restore access as soon as possible.” I hope I have better luck later.

      RJB

      • “I would love to get to the bottom of how the energy conservation thing is enacted in practice”
        It’s fairly simple. I mentioned it here. You have a whole lot of equations that are meant to conserve energy locally. There are always small errors, which will mostly cancel. But sometimes there is a bias, which leads to a drift in total energy. So you add an equation requiring total energy to be conserved. As good a way as any of enforcing that is to measure the discrepancy and redistribute it as a addition (or removal) of small amounts of energy uniformly. The corrections are much smaller than the local error level, but they counter the bias. Something similar is done with mass, which means the mass of the various components.

        • Nick (Feb 9 10:42am): thanks, without calling upon your name I was hoping you might contribute. The extra detail you give here is useful: the discrepancy in energy is redistributed over all the places/variables which relate to energy. But how doeas a GCM decide what the correct total energy flux should be? It can’t be a fixed amount, slowly increasing each year with added CO2, because then feedback effects from, say, Arctic ice melting would be overruled and become ineffective.

          Any clarification on that would be welcome.

          Rich.

          • Rich
            “But how doeas a GCM decide what the correct total energy flux should be? I”
            It doesn’t. That is a different issue. The fixer just ensures that the total energy in the system is conserved. That means that if there has been an outflux, the amount remaining is what it should be (ie after deducting what was emitted). As to what that outflux should be, it is determined by the radiative transfer equations.

          • Nick (Feb 9 1:51pm): OK, can we take an example for the flux? Suppose the radiative transfer equations, presumably including albedo and LCW (Longwave Cloud Forcing), say that there is a net influx of 3.142 W/m^2, which may have arisen mostly out of LCF variation, and should correspond to temporary global warming. You are saying, I think, that if the GCM instead adds up to 2.782 W/m^2 then the difference of 0.360 W/m^2 gets ploughed back in to make the RTEs correct. Now, what time interval are we talking about? And more importantly, how will that net 3.142 W/m^2 affect the state of the world on the next time step?

            I am assuming that +/-4 W/m^2 cannot randomly add to the overall energy budget, since if it did the GCMs would indeed wander to +/-15K by the end of this century. Does it just add a small amount of energy to the land and oceans?

            Rich.

          • Rich
            “Now, what time interval are we talking about? And more importantly, how will that net 3.142 W/m^2 affect the state of the world on the next time step?”
            The time interval is probably every timestep, which would be 10-30 minutes. The role of LCF or whatever is not special; it just does a general accounting check for global energy. I would expect the discrepancy is much less than you describe.

            In terms of global effect, well, we are trying to solve conserved equations, so the effect should be to get it right. The better view is to ask what would be the effect of not correcting. In CFD, energy either runs down and everything goes quiet, or the opposite. Actually in my experience it is more often mass conservation that fails, or that is noticed first.

            A worry might be that the added energy is not put in the right place. But the discrepancy correction is a slow process, and the general processes that move energy around (mix) are fast – think weather. So it really doesn’t matter.

          • Nick Feb 10 12:43pm:

            I should have realized that GCM time steps, or “ticks” as in computer parlance, would be relatively small. Suppose for arithmetic convenience we take there to be 40000 ticks in a year, so a tick is 13.14 minutes. Then extrapolating Pat Frank’s +/-4 W/m^2 per year, taken as gospel truth for the moment, down in scale to 1 tick, gives us +/-4/sqrt(40000) = +/-0.02 W/m^2 per tick.

            So, could LCF in a GCM actually randomly walk at +/-0.02 W/m^2 at each tick? If so, after 81 years it would reach +/-36 W/m^2 which I think we’ll agree is a large range.

            Or, is there something constraining how far LCF can go? Lindzen would say “yes, the iris hypothesis” and Eschenbach would say “yes, tropical clouds”. (Or are those short-wave cloud effects rather than long-wave?)

            As for whether the +/-4 W/m^2 is gospel truth I’ll study David Dibbell’s link for further information on that.

            Rich.

        • Stokes
          I reminded of the TV show with Neil deGrasse Tyson where he was trying to illustrate the difference between climate and weather by walking a dog on long leash on a beach. Actually, he had it backwards because where the dog (weather) could go was controlled by Tyson (climate) and the length of the leash. If the dog had been free to run where it wanted, and Tyson had to chase after the dog it would have been a better analogy.

          However, adjusting the GCMs to conserve energy is a bit like trying to keep the dog from breaking the leash and making Tyson chase after it.

      • Dr. Booth, I checked just now and the link works. But just in case there is some other reason it is not working for you, here is a link to a pdf of the article.
        https://www.dropbox.com/s/iat07c0369paba1/Held_et_al-2019-Journal_of_Advances_in_Modeling_Earth_Systems.pdf?dl=0

        Best to you.
        DD
        P.S. – It seems intuitive to me to grasp Pat Frank’s analysis and conclusions about uncertainty and reliability, even as model outputs are stable. In this new-and-improved climate model, from figure 15, the RMSE of the annual mean of the outgoing longwave TOA is about 6 W/m^2, over a hundred times what would be required for the model to “see” the reported or projected annual increase in anthropogenic forcing.

  21. Let me offer this comment on error and the central limit theory. Taking samples of a population and using the central limit theory to determine a more and more accurate value for the mean does *NOT*, and let me emphasize the *NOT*, in any way affect the variance and standard deviation of the overall population. The variance and standard deviation of the population will remain the same no matter how many times you sample the population and use the central limit theory to calculate a more accurate mean.

    If that overall population is made up of data points that have their own, individual variance and standard deviation then those variances add directly to determine the overall variance. No amount of calculating a more accurate mean using the central limit theory will change that simple fact.

    If you have a linear relationship of y = x1 + x2 + x3, and each of these contributing factors are independent random variables with individual standard deviations and variances then the variance of y is the sum of the variances of x1, x2, and x3. There is no dividing by the population size or multiplying by the population size or anything like that. The variances simply add. Then you take the square root of the sum of the variances to determine the standard deviation of y.

    The ten ruler examples of Dr. Booth are not combining things with their own variance and standard deviation. Each ruler is a specific length whether you know exactly what it is or not. You can then use standard statistical methods to determine the variance and standard deviation for that population of rulers using standard statistical methods. But this hypothetical really doesn’t have much to do with adding variances of independent random variables.

    If the standard deviation of each individual random variable is a measure of uncertainty, error, or a combination of both, then when combined those standard deviations will become a Root-Sum-Square of the associated variances.

    No amount of sampling or central limit theory applications can change the variance or standard deviation of each individual random variable or the standard deviation of the combined population. The sampling and central limit theory can only give you a more accurate calculation for the mean but that mean will still be subject to the same uncertainty or error calculated by the sum of the variances of each member of the population.

    This is what Pat Frank tried to show with his calculations and which so many people seem to keep getting confused. You can get your mean calculation as accurate as you want but it won’t change the uncertainty or error associated with the combined populations. It doesn’t matter if you have multiple inputs of independent random variables with individual standard deviations or if you have an iterative process where each step provides an individual output with a standard deviation that becomes the input to the next iteration. The combination of these will still see a Root-Sum_Square increase in the overall standard deviation.

    Now, let me comment on the usefulness of calculating an arbitrarily precise mean. In the real world this is a waste of time. In the real world the mean simply can’t have any more significant digits than the inputs used to calculate the mean. In the real world there is no reason to have any more significant digits in the calculated mean than the significant digits used to calculate the mean. Take, for example, a carpenter picking 8’x2″x4″ boards from a pile of 1000 to use in building a stud wall for an inside wall of a house. There is simply no use in calculating a mean out to 10 digits when he can, at best, measure only to the nearest tenth of an inch. It’s a useless exercise. In fact, the mean is a useless number to him other than in finding the pile with 8’x2″x4″ boards. He will have to sort through the pile until he finds enough members that will fit his stud wall. If he doesn’t pay attention and gets some that are too short and tries to use them then he will wind up with wavy drywall in the ceiling and perhaps even cracked drywall somewhere down the timeline. If he gets some too long then he will waste wood in cutting it to the proper length.

    If you think about it enough, this applies to the global average annual temperature record as well. You can calculate that mean to any accuracy you want and it won’t matter if you can’t measure any individual temperature average to that preciseness. And the uncertainty in each individual average used in the global average carries over into the global average by the rule of root-sum-square.

    No amount of statistical finagaling will change the fact that models like the GCM’s have uncertainty and that the total uncertainty is the root-sum-square of all the factors making up the GCM output.

  22. Tim, I mostly agree with what you say. But one has to be very careful about how the measurements M_i are combined. In my 10 1-ft rulers example (based on an old one of yours) then it is the sum of the M_i which matters, not the mean. In your building of a stud wall, I take it that the boards are being erected parallel to each other. Again, the mean of the board lengths is irrelevant. But approximate sorting of the boards by length would at least allow the carpenter to choose a sample with somewhat similar lengths, which might be practically more important than the actual length.

    But taking a global average temperature does use a mean, M* = sum_1^n M_i/n. And now the uncertainty (that is, standard deviation of the error) of M* is the quotient by sqrt(n) of the uncertainty of each M_i, so if n is large more significant digits can certainly be quoted for it than for the individual measurements.

    Rich.

    • “But taking a global average temperature does use a mean, M* = sum_1^n M_i/n. And now the uncertainty (that is, standard deviation of the error) of M* is the quotient by sqrt(n) of the uncertainty of each M_i, so if n is large more significant digits can certainly be quoted for it than for the individual measurements”

      The mean is made up of individual members whose measurements can only have so many significant digits because of the resolution of the measuring device. Trying to calculate a mean with more significant digits than the individual members of the population is a waste of time. You’ll never be able to confirm if any member of the population is actually of the length you calculate for the mean.

      This is why claiming that year X is 0.01deg or 0.001deg hotter than year Y when your temperature data is only good out to the tenth of a degree (and that is a *stretch*). You simply cannot realistically gain significant digits in the mean. That’s a pipe dream of mathematicians and computer programmers. You simply cannot average 10.1deg and 10.4deg and say that the mean is 10.25 deg. You have absolutely no way to know what the mean is past the tenth of a degree. That 10.25deg has to be rounded using whatever rules you want to follow. It will either be 10.2deg or 10.3deg. You cannot artificially gain precision by using an average, not in the real world.

      “And now the uncertainty (that is, standard deviation of the error) of M* is the quotient by sqrt(n) of the uncertainty of each M_i”

      What is your “n” variable? The population size? If so then this is wrong. The variances, i.e. the uncertainty, of each M_i simply add. There is no sqrt(n) quotient involved. The standard deviation becomes the sqrt(u_1 + u_2 + u_3 …. + u_n)

      I think you are stuck in the rut of trying to calculate the variance of a population which is the [Sum(x_i – x_avg)^2]/n. This is true when you assume the value of each member of the population is assumed to be perfectly accurate and you want to know the variance of the population. This is how you use the central limit theory to calculate the mean of a population more and more accurately, except what you are doing is actually finding the standard deviation around the mean. The more samples you take the smaller than standard deviation becomes. But that doesn’t decrease the standard deviation of the population itself, it only tells you how accurate your calculation of the mean is.

      But if you are combining independent random variables, i.e. each with their own standard deviation and variance, then you simply add the variances of each independent random variable to get the variance of the combination. And the individual temperatures you are combining to get the annual average global temperature are actually individual random variables each with their own standard deviation and variance. The standard deviation and variance of those random variables are what makes up the uncertainty associated with each. You may choose a value for each individual random variable to use in calculating a mean of the combination but that in no way lessens the uncertainty you wind up with.

      Take a pile of 1000 8’x2″x4″ boards. Assume you have a 100% accurate measurement device. Give it to a worker and have him measure each and every board. You can then take those measurements and calculate a mean. You can then use that mean and those measurements to determine the variance and standard deviation associated with that pile. Now you get another pile of 1000 8’x2″x4″ boards from a different supplier. You go through the process of determining the variance and standard deviation of the new pile.

      Now have your forklift operator combine the two piles. How do you calculate the variance and standard deviation for the combined pile? You simply add the variances of the two individual piles. The square root of that gives your standard deviation. var(y) = var(pile1) + var(pile2).

      It’s been 50 years since I had probability and statistics while getting my engineering degree but I’m pretty sure combining the variances of independent random variables hasn’t changed since then.

  23. Tim Feb 10 4:45pm:

    First, I’ll repeat my comment from Feb 10 1:54pm:

    “The JCGM does not say anything about measurements having to be from a normal distribution in order for their mean and standard deviation to be useful. Sections 4.2.2 and 4.2.3 define s^2(q_k) and s^2(q*) as the variances of the observations and of the mean (I am using q* as more convenient here than q with a bar on top). If there are n observations then s^2(q*) = s^2(q_k)/n, and the (standard) uncertainty of the mean is defined to be u(q*) = s(q*), which does decrease (on average) as n grows, in a sqrt(1/n) fashion.”

    Next I’ll make more specific points in response.

    1. The above means that the Central Limit Theorem, which covers convergence of sums of i.i.d. variables to normality, is not the correct thing to quote regarding the standard deviations of sums and means.

    2. I agree with your statement “The more samples you take the smaller that standard deviation becomes. But that doesn’t decrease the standard deviation of the population itself, it only tells you how accurate your calculation of the mean is. ”

    3. “Standard uncertainty” is equivalent to standard deviation. In any specific case the question is what is the variable of interest. If it is a sample value, then the uncertainty is s(q_k) which does not decrease with the number n of samples taken. If it is the mean value, then the uncertainty is s(q*) which does decrease, like 1/sqrt(n), with the number of samples.

    4. In the case of global mean temperatures, they are not i.i.d, because q_k depends on the location and time of the measurement. Nevertheless, it is a not unreasonable assumption that each q_k is some value m_k plus an error term e_k where all the e_k’s ARE i.i.d. In this case the uncertainties rest in the identical distributions of the e_k’s, and s(q*) = s(e*) so the uncertainty of q* again diminishes with sample size.

    5. So yes, if you “are combining independent random variables, i.e. each with their own standard deviation and variance, then you simply add the variances of each independent random variable to get the variance of the combination” then the assertion is true provided that the combination is summation. But if it is averaging, then by dividing that sum by n the variance is then divided by n^2, and that is what gives the reduction in the uncertainty of the mean.

    Rich.

    • Rich,

      You seem to recognize what the issue is in your No. 2 statement but then go on to ignore it. A more accurate calculation of the mean does *not* reduce the variance and standard deviation of the population. And it is the variance and standard deviation of the population that determines the uncertainty. It truly is that simple.

      As I tried to point out with the two piles of 2″x4″‘s, when you combine them the overall variance is the sum of their variances. No amount of calculating the mean of the combination more accurately will change that combined variance. And it is that combined variance that determines the uncertainty associated with that more accurate mean.

      It simply doesn’t matter how accurately you calculate the mean of a population. That doesn’t make it the “true value” in any way, shape, or form thus it doesn’t decrease the population uncertainty surrounding that mean as defined by the standard deviation of the population.

      Averaging the means of independent random variables does not reduce the variance and standard deviation of the combination of those independent random variables. It only determines the mean of the means. You can reduce the variance of your calculation of the mean, i.e. you can make it more accurate but you don’t reduce the variance of the population.

      Let me repeat one more time: if the standard deviation of the population is +/- u, then calculating the mean more accurately won’t change that standard deviation of the population to +/- (u/n). And it is +/- u that is the uncertainty. Standard-error-of-the-mean is not the same thing as the uncertainty of the population.

      If you are measuring one thing with one device then you can tabulate the data points and use the central limit theorem and calculate a more accurate measurement of that one thing. But you are not combining independent random variables when you do this. You *are* combining independent random variables when you calculate the mean of numerous temperature means which each have their own standard deviation. It’s why uncertainty propagates as the root-sum-square – because that uncertainty tells you that you have a random variable, not an individual measurement that represents a true value.

      • Again I am in agreement with much of what you say, in particular your “Let me repeat one more time: if the standard deviation of the population is +/- u, then calculating the mean more accurately won’t change that standard deviation of the population to +/- (u/n). And it is +/- u that is the uncertainty. Standard-error-of-the-mean is not the same thing as the uncertainty of the population.”

        But you have ignored my statement 5: it does matter how you “combine” the data points (and please, don’t quote the CLT again as it’s not relevant). Suppose I have uncorrelated variables M_1,…,M_n each with uncertainty +/-u_i (not yet assuming those are all equal). What is the uncertainty of f(M_1,…,M_n)? It is, by Equation (10) in the JCGM 5.1.2,

        sqrt( sum_{i=1}^n (df/dM_i)^2 u_i^2 )

        Now let f(M_1,…,M_n) = (M_1+…+M_n)/n, so df/dM_i = 1/n and you have uncertainty of the mean is

        sqrt( sum u_i^2/n^2 ) = u_1/sqrt(n) if all the u_i’s are equal.

        If the mean of a set of measurements is the value of interest, as with the Global Average Surface Temperature, then its uncertainty does decrease as n increases.

        Yes, we agree that the population uncertainty does not decrease, but often (not so much in the case of planks) we are actually interested in the population mean.

        Rich.

        • Rich, “If the mean of a set of measurements is the value of interest, as with the Global Average Surface Temperature, then its uncertainty does decrease as n increases.

          Except that field calibration experiments show that the error distribution around each temperature measurement is not normal. iid statistics do not apply.

          Let’s also note that the uncertainty in a measurement mean cannot be less than the resolution limit of the instruments, no matter how many individual measurements go into the mean. Likewise regarding the resolution limit of a physical model.

          I have published</a< (869.8 KB pdf) on the problem of temperature measurement error, and will have much more to say about it in the future.

          Also here.

          The people who compile the GASAT record are as negligent about accounting for error as are climate modelers.

          • “the uncertainty in a measurement mean cannot be less than the resolution limit of the instruments”

            The mathematics says it can, and will be, less than that, if the instrumental errors are not correlated and there are sufficiently many measurements. u_1/sqrt(n), derived from the JCGM, does decrease with n.

            Rich.

          • Rich,

            “The mathematics says it can, and will be, less than that, if the instrumental errors are not correlated and there are sufficiently many measurements. u_1/sqrt(n), derived from the JCGM, does decrease with n.”

            Once again you are trying to justify using the central limit theorem to say that you can increase the resolution of an instrument. You can’t!

            I tried to explain that to you with one of my examples with 1000 8’x2″x4″ boards. If I can only measure the boards to a resolution of 1/8″ then no matter how many measurements I take and average together can, in the real world at least, it give me a resolution of more than 1/8″. I don’t care how many digits you calculate the mean out to, I will never be able to find a board of that resolution because I simply can’t measure with that resolution! If I tell you a board is 7’11 7/8″ long how do you know if it is 7′ 11 13/16″ long or 7′ 11 15/16″ long? You can use it to calculate the mean down to the 1/16″, the 1/32″, or the 1/64′” but you don’t decrease the uncertainty in any way because the measurements I gave you doesn’t support anything past 1/8″. Anything past that will remain forever uncertain.

            It violates both the rules of significant digits as well as the rules for uncertainty.

          • Tim Feb11 2:24pm

            Tim, I apologize. I had failed to distinguish between two cases.

            The first case is repeated measurements of the same variable (measurand) under apparently identical conditions. Then, as you say, it is true that the mean squared error cannot be driven down below a certain limit related to the resolution, no matter how many samples are taken.

            The second case is measurements of different variables, one each, for example temperatures at different places and times. In this case I believe that the rounding errors can partially cancel out, so while the error of the sum of them increases, the error of the mean of them decreases.

            I have been working on this with the formulation in my Section F, and I intend to share some results when I have more time.

            Rich.

        • Rich,

          “Now let f(M_1,…,M_n) = (M_1+…+M_n)/n, so df/dM_i = 1/n and you have uncertainty of the mean is”

          Once again you are trying to equate the uncertainty of the mean with the standard deviation of the population.

          M_1 … M_n are not random variables with a probability distribution function. They are uncertainty intervals. You keep trying to formulate them as random variables where you can calculate the most probable outcome by finding the mean with less and less standard error of the mean.

          “If the mean of a set of measurements is the value of interest, as with the Global Average Surface Temperature, then its uncertainty does decrease as n increases.”

          The mean of the set of measurements is *NOT* the only value of interest since it is *not* the most likely value in a probability distribution function. The other value of interest is the uncertainty of that set of measurements. And that uncertainty interval does not go down by 1/sqrt(n).

  24. Re Nick Stokes recent and David Dibbell Feb 8 5:27pm:

    OK, I’ve looked at David’s reference, and here’s a thing. David says “See figure 15 in particular – “Root mean square errors (RMSE) in net, net shortwave, and outgoing longwave radiation (in W/m2) at top‐of‐atmosphere (TOA) for the annual mean and the individual seasons…”. But in the text above that figure it says “Figure 15 shows the RMS biases in the net TOA fluxes…”. Now “biases” is an interesting operative word. Recall that in my Secion E I wrote “The real error statistic of interest is E[(M-X)^2] = … Var[M] + b^2”, and b is the bias E[M]-X. So b may be a significant player in the data portrayed by Figure 15.

    If so, suppose an annual RMSE of 4 W/m^2, i.e. MSE of 16 W^2/m^4, consists of 15 for the squared bias and 1 for the variance Var[M]? Then since bias is not a random component leading to a random walk through time, but uncertainty = sqrt(Var[M]) is, the uncertainty which gets propagated is now +/-1 W/m^2 not 4, much lower than in Pat Frank’s paper. We cannot tell from Figure 15 what proportion of the RMSE is funded by the bias, but perhaps some GCM aficionado could root out the underlying data to find out. This seems important to me.

    Rich.

    • How do you know what the TOA biases are in the simulation of a future climate, Rich?

      If you don’t know the discrete TOA biases in every step of the simulated future climate — and you don’t — then how do you estimate the reliability of the predicted climate?

      • Pat, to be honest I’m just worrying about the biases and uncertainties in the past (i.e. calibration runs) for now. When I understand that, I’ll address the future. But one possibility would be that the biases only change slowly. Who knows, without the data?

        Rich.

        • The air temperature measurement bias errors change pretty across every day, Rich. Because wind speed and irradiance vary within every day, and between days.

          Take a look at K. G. Hubbard and X. Lin, Realtime data filtering models for air temperature measurements. Geophys. Res. Lett., 2002. 29(10): p. 1425 1-4;
          doi: 10.1029/2001GL013191.

          Monthly means are badly ridden with systematic error, and no one knows the magnitude of the biases for any of the individual temperature measurements that go into a mean.

          H&L 2002 above show the MMTS sensor measurements average about ±0.34 C of non-normally distributed error — and that’s for a well-maintained and calibrated sensor operating under ideal field conditions.

          Typically, the measurement errors are much larger than any random jitter (which typically arises in the electronics and wiring of a modern sensor), the magnitude of which (typically ±0.1-0.2 C) can be determined in the lab.

          With respect to GCMs, we know the uncertainties from the past because calibration runs are available. Those uncertainties that arise from within the models signify errors that are injected into simulations of the future climate. In every single step of a simulation.

          There’s no valid ignoring of them, or assuming them away, or wishing them away.

    • Rich,

      “hen since bias is not a random component leading to a random walk through time, but uncertainty = sqrt(Var[M]) is”

      What makes you think uncertainty creates a random walk? I still don’t think we have a common understanding of what uncertainty is.That is probably my fault. While I often use the terminology of a random variable to demonstrate how to handle combinations of values with an uncertainty interval, that doesn’t mean uncertainty *is* a random variable.

      Random variables are typically defined as having a population whose members can take on different values. A frequency plot of how often those different values occur creates your probability distribution function, i.e. it’s what defines a normal distribution, a poisson distribution, etc. That probability distribution function will have a standard deviation and variance associated with it. A random variable can create a random walk, simply by definition. The random variable will create values around the mean.

      An uncertainty interval is not a probability distribution function. In no way does the uncertainty interval try to define the probability of any specific value occurring in the population. It merely says the true value will probably be somewhere in the interval. You may have a nominal value associated with what you are discussing but that is not a mean which is defined as being the most likely value to be found in a probability distribution function.

      An uncertainty interval around a nominal value cannot create a random walk since it doesn’t define any specific values or the probability of those specific values happening. A narrow uncertainty interval tells you that the nominal value is close to the true value. A wide uncertainty interval tells you that the nominal value is questionable. But neither tells you what the true value is (unless the uncertainty interval is zero).

      A thermometer with an uncertainty of X +/- u where u is small, e.g. 70deg +/- 0.001deg, is pretty accurate and probably gives a good representation of the true value. A thermometer with an uncertainty of X +/- v where v is large, e.g. 72deg +/- 0.5deg, gives a temperature far more questionable. But neither uncertainty interval tells you anything about what the probability of any specific value might be. Therefore neither cannot generate a random walk. The only way the nominal value can be the true value is if u or v is equal to zero.

      Since an +/- uncertainty interval about a nominal value looks exactly like the +/- standard deviation about a mean, the general rule is to treat them the same. When combining values with independent uncertainties you add them (root-sum-square) just like you add variances of random variables.

      Using the example above of two different thermometers you can certainly average the two nominal values and get 71deg. But the uncertainty becomes sqrt( u^2 + v^2) or +/- 0.500001. The uncertainty will never go down, it will only go up. If you have ten thermometers with an uncertainty of +/- 0.5deg then when combined the uncertainty (root-sum-square) becomes sqrt(10 * .25) = +/- 1.58deg.

      This is why trying to calculate the mean out to an arbitrary number of digits is useless. The uncertainty will overwhelm whatever difference you think you are calculating. Suppose you have 10 thermometers with an uncertainty of 0.001. The uncertainty of the combination becomes the sqrt(10 * 1e-06) = +/- 0.003deg. If your difference from one year to another is less than 0.003deg then you simply don’t know if that difference is real or not. If you are combining 1000 thermometers with an uncertainty of +/- 0.01 then the combined uncertainties become +/- 0.03deg. Any difference from one period to another that is less than 0.03deg is questionable.

      How many thermometers around the world have an uncertainty of +/- 0.001deg? +/- 0.01deg?

      It’s why I say the global annual average temperature is pretty much a joke. You never get an uncertainty interval given for that global annual average temperature but it’s going to be huge. It gets even more humorous when you are trying to compare to a base consisting of records from the late 19th century and early 20th century. It doesn’t matter how accurately you calculate the mean of those nominal values having an uncertainty, you can’t decrease the uncertainty.

      • Tim Feb 11 8:29am: I did mention random walks in my Section H, but should probably have included it in Section B too. From that section:

        (1) W(t) = (1-a)W(t-1) + R1(t) + R2(t) + R3(t) where 0 ≤ a ≤ 1

        I am happy to take a=0 for now. This white box model of how a black box GCM works is iterative, with random errors around the means of R_i(t). The standard deviations of those errors are the (standard) uncertainties. The model can be run with Monte Carlo values for those errors, and the evolution of W(t) has an uncertainty, i.e. standard deviation of its output over many runs, which grows proportionally to sqrt(t), as Pat has been maintaining (though he has just referred to uncertainty rather than to Monte Carlo outputs).

        HTH, Rich.

        • Rich,

          You are still conflating random errors with uncertainty. Random errors imply a probability distribution function, uncertainty does not.

          If I tell you that a temperature measurement is 72deg with an uncertainty of +/- 0.5deg, exactly what makes you think there is a probability distribution function associated with the uncertainty? All I am telling you is that I am uncertain what the true value is. That implies no gaussian, poisson, or any other kind of probability distribution function. I don’t know if the nominal value of 72 is the most likely value of a distribution nor do I know if it is the true value. When I combine that measurement with another measurement from a completely independent thermometer that has its own uncertainty range then calculating the mean of the two, to any number of digits you like, *still* doesn’t tell me that the mean that is calculated is the true value either. All you can say is that the true value lies somewhere in the combination of the two uncertainty intervals, i.e. sqrt(u_1 + u2). That uncertainty interval certainly can’t be decreased merely by dividing by the sqrt(2).

          Again, as you keep agreeing, the standard error of the mean is meaningless when applied to the entire population.

          I’m not sure why you think Monte Carlo runs will help anything. If the GCM out put is determinative and linear, which they apparently are, then varying the inputs will only give you a sensitivity measurement for the model, it won’t help define an uncertainty interval.

          If the model is *not* determinative and linear, and the output can vary over several runs for the same inputs, then the model has a basic uncertainty built into it that has to be added into any uncertainty calculations.

          • Tim Feb11 2:13pm

            I’ll address your points (prepended by T:) individually.

            T: You are still conflating random errors with uncertainty. Random errors imply a probability distribution function, uncertainty does not.

            Please read my Section E again for my position on uncertainty, derived from the JCGM. It is a measure of dispersion of the measurement M, which is the same as the dispersion of the error M-X (where X is the true value). “Standard” uncertainty takes the standard deviation of M, or M-X, as the value for the uncertainty. This implies that there is indeed a probability distribution function underlying, because otherwise the s.d. can’t be calculated. Moreover, without s.d.’s, there is no mathematics to justify the addition rule of independent uncertainties. We don’t necessarily know the probability distribution, but we can try different examples and see what the implications are. That is what I do in Section F.

            T: If I tell you that a temperature measurement is 72deg with an uncertainty of +/- 0.5deg, exactly what makes you think there is a probability distribution function associated with the uncertainty? All I am telling you is that I am uncertain what the true value is. That implies no gaussian, poisson, or any other kind of probability distribution function. I don’t know if the nominal value of 72 is the most likely value of a distribution nor do I know if it is the true value. When I combine that measurement with another measurement from a completely independent thermometer that has its own uncertainty range then calculating the mean of the two, to any number of digits you like, *still* doesn’t tell me that the mean that is calculated is the true value either. All you can say is that the true value lies somewhere in the combination of the two uncertainty intervals, i.e. sqrt(u_1 + u2). That uncertainty interval certainly can’t be decreased merely by dividing by the sqrt(2).

            See my previous answer. And if you have written +/-0.5 without further clarification, other users of the JCGM will take that to be standard uncertainty, i.e. a s.d of 0.5. Saying +/-0.5 doesn’t say you are uncertain, but by how much. If you actually thought the error was equally likely to be anywhere in (-0.5,+0.5) then the s.d. is 1/sqrt(12) = 0.289 and you will have been misleading people who thought you meant standard uncertianty.

            T: Again, as you keep agreeing, the standard error of the mean is meaningless when applied to the entire population.

            It’s good we agree on something!

            T: I’m not sure why you think Monte Carlo runs will help anything. If the GCM out put is determinative and linear, which they apparently are, then varying the inputs will only give you a sensitivity measurement for the model, it won’t help define an uncertainty interval.
            If the model is *not* determinative and linear, and the output can vary over several runs for the same inputs, then the model has a basic uncertainty built into it that has to be added into any uncertainty calculations.

            I agree with the “determinative”, or “deterministic” as I would say, but can you explain in what respect the GCM output is linear, and is that important? I believe that small perturbations to the GCM initial conditions yield, via chaos theory, to very different outputs, which are best treated statistically, and that a good emulator for the GCM should come somewhere near to matching those statistics. But I’ll admit to +/-1 million neurons’ uncertainty on this issue!

          • Rich, “ This implies that there is indeed a probability distribution function underlying, because otherwise the s.d. can’t be calculated. Moreover, without s.d.’s, there is no mathematics to justify the addition rule of independent uncertainties.

            Empirical standard deviations are calculated regularly in the experimental sciences (and engineering), without too much concern about whether all the statistical ducks are in line, Rich. The reason is because empirical error SDs give a useful estimate of the reliability of a result.

            Mere calculation of an empirical SD says nothing about the error distribution. It certainly does not imply a normal distribution.

            Likewise the propagation of calibration error to yield an uncertainty envelope. The underlying statistical iid assumptions are typically not met, but the empirical approach nevertheless yields a useful estimate of reliability.

            I’ve finally had time to for a more detailed look at your analysis, Rich. So far, it lacks coherence. Your emulator is bereft of circumstantial relevance.

          • Rich,

            “This implies that there is indeed a probability distribution function underlying, because otherwise the s.d. can’t be calculated.”

            “Moreover, without s.d.’s, there is no mathematics to justify the addition rule of independent uncertainties.”

            What happens if you consider the uncertainty interval to be a uniform continuous probability function with every point in the uncertainty interval having the same probability? In this case the mean (b-a)/2 has the same probability of being the true value as any other point in the distribution.

            Thus the variance of each becomes (b-a)^2/12. When combining the two the variance becomes 2(b-a)^2/12 or twice the variance of each. The standard deviation becomes the sqrt(2) times the individual variances. If you combine three individual independent temperatures with the same uncertainty interval you get sqrt(3) times the individual intervals as being the combined intervals. And again, your calculated combined mean has no more chance of being the true value than any other point in the interval.

            This is no different than what Pat Frank came up with: root-sum-square of an uncertainty interval when used iteratively over and over.

            In fact, I would offer that the reality is even worse than this. If you try to average two temperatures, 60deg +/- 0.5deg and 72deg +/- 0.5deg, you come up with a far different calculation.

            S_c^2 = { (b-a)(S_1^2 + (X_1-X_c)^2] + (b-a) (S_2^2 + (X_2 – X_c)^2] } / 2(b-a)

            b-a = 1 (+/- 0.5)
            X_c = (60+72)/2 = 66
            X_1 = 60, X_2 = 72
            S_1 = S_2 = 0.5
            S_1^2 = S_2^2 = 0.25

            Factor out (b-a) and you get [ (.25+36) + (.25+36) ] / 2 = 72.5/2 = 36.25 = S_c^2

            Consider, I go to Africa and measure the heights of 1000 pygmys. I use a yardstick with a resolution of 1/4″. I note with each measurement whether the subject was slouching, was standing flat footed, or was standing on tip-toe. At the end I have 300 subjects that were slouching, 200 that were on tip-toe, and 500 that were flat footed. I then calculate the mean for all of the recorded heights. Just how sure can you be that the calculated mean is *really* the true mean when the actual height of 500 of the 1000 subjects are questionable?

            Now, I do the same thing for 1000 Watusis. Then I combine the data from the two populations and calculate a new mean. Does the variance and standard deviation of the combined population increase or decrease? Does the uncertainty of the true value of the mean increase or decrease?

            Does the mean of the combined data actually tell you anything? If I order 2000 pairs of pants sized to the mean of the combined data just how many of the subjects will those pants actually fit? Now make the pygmys your minimum temperatures and the Watusis your maximum temperatures. Does the mean of those actually tell you anything? Or is it about as useless and the pants ordered above?

            “See my previous answer. And if you have written +/-0.5 without further clarification, other users of the JCGM will take that to be standard uncertainty, i.e. a s.d of 0.5. Saying +/-0.5 doesn’t say you are uncertain, but by how much. If you actually thought the error was equally likely to be anywhere in (-0.5,+0.5) then the s.d. is 1/sqrt(12) = 0.289 and you will have been misleading people who thought you meant standard uncertianty.”

            The standard deviation is not 1/sqrt(12), it is sqrt[ (b-a)^2/12]. See above. For the case where the interval is +/- 0.5 because 1^2 = 1. If the interval was +/- 0.4 you would have a totally different situation. You would have a totally different situation if each measurement device has a different uncertainty interval.

            “I agree with the “determinative”, or “deterministic” as I would say, but can you explain in what respect the GCM output is linear, and is that important?”

            The general case of a combining variances of two or more inputs is:

            (S_y)^2 = Sum [ (df/x_i)^2 * (s_xi)^2 ]

            If f(x) = y = x_1 + x_2 then

            df/x_1 = 1
            df/x_2 = 1

            and the variances just add.

            If f(x) = y = (x_1)^2 + (x_2)^2

            then df/x_1 = 2 and df/x_2 = 2

            and the combination becomes 4(s_x1)^2 + 4(x_x2)^2

          • Tim
            I think that you are being inconsistent and careless in your use of terms such as “uncertainty,” “accuracy,” and “significant figures.”

            One can have a thermometer (or more likely thermocouple) that can be read to 0.001 degree F. If what is being measured is the temperature of an ice-water bath and the nominal temperature is 32 degrees, than one can say that it is very precise with 5 significant figures. However, if it reads 33 degrees it is not accurate! Precision and accuracy are not independent. One cannot have high accuracy with low precision, but one can have low accuracy with high precision. To make headway, there has to be agreement on the definition of the terms used. Now, to complicate things, if one has a large number of high-precision ‘thermometers,’ with variable accuracy, I don’t think that the Law of Large numbers will compensate for the variable accuracy. Indeed, a few badly calibrated ‘thermometers’ will skew the distribution of readings and possibly turn a normal distribution into a non-normal one. But, determining that may be next to impossible because a single temperature (GMST) is not what is being measured! Instead, it is tens of thousands of different temperatures (which, incidentally, have a standard deviation of several tens of degrees on an annual basis!)

            Where the improvement in precision, with precision-limited instruments, has a long history is in surveying. There, the same instrument is used measuring the same angle over and over. Thus, the random errors in reading the scale, vernier inscribing errors, and eccentricity in the ring, cancel out by the sq rt of n principle.

            In climatology, one is NOT using the same thermometer over and over, and one is NOT measuring the same temperature. There is an old joke about the NTSC TV standard standing for “Never Twice the Same Color.” Here we have a situation where no two temperatures are ever exactly the same, and no two thermometers are exactly the same with respect to accuracy and the attainable precision. What is being measured, and what is being used to measure it, appear similar. However, one has to take into account that they are really all different. Thus, one has to determine what the uncertainty in both accuracy AND precision are, to be able to say anything intelligent about what an average of all the readings means.

          • Tim Feb12 12:09pm

            “What happens if you consider the uncertainty interval to be a uniform continuous probability function with every point in the uncertainty interval having the same probability? In this case the mean (b-a)/2 has the same probability of being the true value as any other point in the distribution.”

            Correct; and I cover this case in my Section F preceding Equations (21) and (22).

            “Thus the variance of each becomes (b-a)^2/12. When combining the two the variance becomes 2(b-a)^2/12 or twice the variance of each. The standard deviation becomes the sqrt(2) times the individual variances. If you combine three individual independent temperatures with the same uncertainty interval you get sqrt(3) times the individual intervals as being the combined intervals.”

            Correct insofar as the standard deviation is now (b-a)sqrt(3/12) = (b-a)/2. But the combined intervals now span 3(b-a), which is not sqrt(12) times the s.d., which serves to prove false your next statement:

            “And again, your calculated combined mean has no more chance of being the true value than any other point in the interval.”

            Incorrect. The convolution of 3 uniform distributions is not uniform. Here is a relevant paragraph from my Example 1, for the case of 10 uniforms with a=-0.1, b=+0.1:

            “To get the exact uncertainty distribution we would have to do what is called convolving of distributions to find the distribution of the sum_1^10 (X_i-12). It is not a uniform distribution, but looks a little like a normal distribution under the Central Limit Theorem. Its “support” is not of course infinite, but is the interval (-1”,+1”), but it does tail off smoothly at the edges. (In fact, recursion shows that the probability of it being less than (-1+x), for 0<x<0.2, is (5x)10/10! That ! is a factorial, and with -1+x = -0.8 it gives the small probability of 2.76e-7, a tiny chance of it being in the extreme 1/5 of the interval.)”

            BTW I have said that I will give more detail on the reduction, or not, of uncertainty when means are used. I have been delayed in this because I managed to confuse myself over whether it was my Equation (16) or Equation (19) that was important, and of course it is (16). I think that's what is known as "full disclosure".

            Rich.

          • Clyde,

            Basically you are repeating exactly what I have been saying over and over.

            “I think that you are being inconsistent and careless in your use of terms such as “uncertainty,” “accuracy,” and “significant figures.””

            That is exactly the point. These three terms go together. No matter the accuracy of your thermometer, it will always have a resolution limit and therefore some uncertainty. And the resolution limit determines how many significant figures you can actually use.

            “I don’t think that the Law of Large numbers will compensate for the variable accuracy.”

            Of course it won’t. The law of large numbers only applies when you make multiple measurements of the same thing using the same device, exactly like your transit.

            It’s why you have to combine uncertainties when you are combining data from various instruments. No amount of calculating the mean with larger and larger quantities of values that have in inbuilt uncertainty can lessen the uncertainty.

            “But, determining that may be next to impossible because a single temperature (GMST) is not what is being measured! Instead, it is tens of thousands of different temperatures (which, incidentally, have a standard deviation of several tens of degrees on an annual basis!)”

            Exactly! It’s why it becomes impossible to say that this year is .0.001deg or 0.01deg hotter than last year. Your overall uncertainty interval is wider than that. So you really don’t know!

            “hus, one has to determine what the uncertainty in both accuracy AND precision are, to be able to say anything intelligent about what an average of all the readings means.”

            Even the ARGO floats, with thermistors capable of discerning differences in temperature of 0.001deg have calibration curves. The thermistors themselves are not perfectly linear and individual elements can age differently depending on the environment they are subjected to. In addition, the actual temperature readings are dependent on several things such as the rate of flow of water past the thermistor so if the water path encounters any changes (e.g. algae growth, etc) that throws off the calibration. Even the salinity level or pollution level of the water being measured can throw off the calibration.

            If I confuse terms it is in hopes of trying to explain how uncertainty can be understood by standard statistical methods. An uncertainty interval is not a probability function with a standard deviation but it can be treated mathematically in the same manner. Just as you add variances of a population with a normal probability curve as described by a standard deviation you can add uncertainty intervals by considering them to be standard deviations. But you can only take this so far. You can say the uncertainty intervals of two different thermometers are like a uniform continuous probability function in order to show how to combine the uncertainty intervals but they are *not* uniform continuous probability functions normalized to the interval (0,1) so you can’t combine them through convolution to get a triangle function which supposedly more accurately defines a mean value with a higher probability of occurrence.

            This whole issue gets even worse when you consider that most of these statistical methods are based on samples taken from the same population group. When you try to calculate variances and standard deviation for two totally independent population groups, e.g. minimum temp +/- u_min and maximum temp +/- u_max, it gets even more complicated than what we’ve discussed here. I tried to show that with my pygmy and Watusi population combination. You can wind up with a meaningless mean and crazy variances and standard deviations.

            Yet none of the climate studies or climate models seem to take any of this into consideration. As Pat Frank showed the climate models are simply black boxes with a linear transfer function no matter how complicated their makeup of differential equations are. And the models can’t even adequately treat the uncertainty associated with that simple setup, they just ignore it totally!

    • Dr. Booth – two points here. First, about what the “bias” is. From Figure 16 and its caption, it looks like this “bias” is the difference of annual means, gridpoint by gridpoint, subtracting the CERES values averaged over the period 2001-2015 from the CM4.0 values averaged over the period 1980-2014. (If I have misunderstood this, I welcome a correction.) Second, then, I’m not suggesting the RMSE 6 W/m^2 value from Figure 15 corresponds somehow to the +/- 4 W/m^2 value appearing in Pat Frank’s paper. They are quite different.
      DD

  25. The real elephant in the room here is the lack of solid recognition that Booth’s Eq. 1 is a valid analytic representation of an ARIMA process where only the recursive (1-a)-term is a genuine parameter specifying the system response. We don’t have the situation that von Neumann delineated. The three additive R terms simply specify the input that gives rise to the output. And it’s only in the case where a = 0 that we get an output that behaves in accordance with Frank’s presumption of a random walk with ever-expanding variance. But that mathematically unstable case is a physically unrealistic representation of any Hamiltonian system in which finite energy is preserved.

    • 1sky1, “Frank’s presumption of a random walk

      I make no such presumption.

      The variance expands as an indication of increasing ignorance. Not as a measure of increasing distance between true and predicted.

  26. 1sky1: thank you for those supportive words, but we don’t have a closed Hamiltonian system on Earth, because radiation in and out can vary.

    Rich.

      • 1sky1: Wow, that paper is deep mathematics! I’m afraid the 9-line Conclusions didn’t make me any the wiser. Do you have an “elevator speech” to explain the paper?

        Rich.

        • The point of referencing that paper was not to delve into its arcane treatment of some mathematical properties, but to point out simply that variability of energy inputs and outputs does not exclude Hamiltonian systems. Nor, in the customary thermodynamic sense, does a closed system require non-varying energy levels.

  27. I previously wrote (Feb12 5:07am) an apology, regarding uncertainties of means, about not distinguishing between two cases, the first being repeated measurements of the same variable under apparently identical conditions, and the second being single measurements of many different variables. I can now give some more detail on this.

    The second case is far the easier, as follows. We have n pairs (X_i,M_i) where X_i is the true unknown value of the measurand and M_i is the measurement. M_i is not equal to X_i because of both random variation in the measurement process and because it is quantized digital output, which by appropriate scaling we assume to be an integer. We assume that the error D_i = M_i-X_i has a probability distribution, reflecting our ignorance. Some people prefer to assign a uniform distribution, and that is the easiest case to analyze, so I assume each D_i is uniform in [-e,+e] with e at least a half because of the quantization. Note that Var[D_i] = e^2/3.

    Then given the measurements M_i, each X_i is uniform in [M_i-e,M_i+e]. The difference between the mean of the sample M_i’s and the true values X_i is the mean, D*, of the D_i’s, D* = sum_{i=1}^n D_i/n. Var[D*] = (n e^2/3)/n^2 = e^2/(3n). The standard uncertainty of the mean of the X_i’s is the square root of that, decreasing with sqrt(n) in the denominator.

    So while the uncertainty in the sum of the X’s increases with n, the uncertainty in the mean decreases.

    Returning to the first case, there is only one X. I now constrain e to be at most 1, so there are at most 2 possible values for each M_i, thereby simplifying the problem. We assume an uninformative prior for X, in which the probability that X lies between x and x+d is a tiny number ud.

    Consider X in the unit interval (-1+e, e). If -1+e < X < 1-e then each M_i must be 0. We can write:

    P[M_1=…M_n=0, -1+e<X<1-e] = 2(1-e)u

    But if 1-e < X < e then each M_i can be either 0 or 1. Because the error distribution for M_i-X is uniform, any legal value of M_i is equally likely, so M_i = 0 with probability ½, and

    P[M_1=m_1,…M_n=m_n, 1-e<X<e] = (2e-1)u/2^n

    where each m_i is 0 or 1. In this case the probability that each M_i is identical is 2/2^n, which diminishes rapidly to 0 as n grows. So for large n, we can assert that if each M_i = 0 then -1+e < X < 1-e. The variance of X in the interval (-1+e,1-e) is (1-e)^2/3.

    If on the other hand the M_i’s are a mixture of 0’s and 1’s, we know that 1-e < X < e and the variance of X in that interval is (2e-1)^2/12.

    Overall, the mean variance of X, taking into account the width of the 2 intervals, is

    V = 2(1-e)(1-e)^2/3 + (2e-1)(2e-1)^2/12 = (e-3/4)^2 + 1/48

    Here is a table of the e, V, and the variance e^2/3 arising from a single observation (uncertainties are the square roots of these). At e = ½ all the variances equal 1/12 which is the variance of the output resolution; Section F explains why such an e is implausible as it implies infinite precision which gets discarded. At e = ¾, sqrt(V) is one half of the s.d. of the output resolution, contradicting statements that it is impossible to go below that bound.

    __e___ e^2/3__ V
    0.500 0.0833 0.0833
    0.750 0.1875 0.0208
    0.866 0.2500 0.0343
    1.000 0.3333 0.0833

    I have also, with even greater difficulty, worked out V_2 for the case n=2, but that may be of marginal interest. I would like to do calculations for normal errors, but that would require computer calculation, and I see this topic as a distraction from the more important question of the evolution of uncertainty.

    Rich.

    • Rich,

      How do you know what X_i actually is since any value within the uncertainty interval can be the true value?

      If the true value is at one extreme or the other then you no longer have: “Then given the MEASUREMENTS M_i, each X_i is uniform in [M_i-e,M_i+e].” (capitalization mine, tim).

      1. The true value could be at the very extreme end of the uncertainty interval so there is no +/- e but only +e or -e.
      2. You are working with a SINGLE measurement with an uncertainty interval, not multiple measurements of the same thing using the same device. There is no “measurments”.

      You simply can’t assume that X_i, i.e. the mean, is the true value.

      You keep falling back into the same old central limit theory assuming you have multiple measurements that can be combined to more accurately calculate the mean. I.e. “M_i is not equal to X_i because of both random variation in the measurement process”. This assumes you have multiple measurements that form a “random variation in the measurement process”.

      There *is* no random variation in the measurement process. There is SINGLE measurement that has an uncertainty. When you calculate the mean temperature for a day at a measuring station you use a SINGLE measurement for the maximum temperature, a measurement that has an uncertainty interval. You don’t take multiple measurements at that station at the same point in time that can be used to generate an accurate mean using the central limit theory. The same thing applies to the minimum temperature. You have a *SINGLE* measurement, not multiple measurements.

      The plus and minus interval for uncertainty is *not* based on multiple measurements so there is no actual random probability distribution of measurements. The only reason for assuming a uniform distribution is to try and develop a way to add the uncertainty intervals based on known mathematics for random probability distributions. But since there are *not* multiple measurements you can’t take the similarity past figuring out how to handle the combining of the intervals.

      There is nothing wrong with your math, only with the assumptions that an uncertainty interval is a probability distribution of multiple measurements.

      “So while the uncertainty in the sum of the X’s increases with n, the uncertainty in the mean decreases.”
      “the probability that X lies between x and x+d is a tiny number ud.”
      “Because the error distribution for M_i-X is uniform”

      All of these assume a probability distribution for multiple measurements of the same thing. An uncertainty interval associated with a single measurement is *NOT* a probability distribution, not even a uniform one. It is only useful to consider as such in order to figure out a way to combine multiple separate single measurements of different things – root-sum-square.

      You continue to ignore how to combine a minimum temperature measurement of 60deg +/- 0.5deg and 72deg +/- 0.5deg and focus instead on how you can say that 60deg or 72deg is the “true value” using the central limit theory based on the uncertainty interval of each being a probability distribution of multiple measurements.

      The fact is that when you combine two separate measurements, i.e. try to calculate a mean between the two, that mean will have a variance (i.e. uncertainty interval) that is larger then the variance of each component. And the variances add as root-sum-square. Just like Pat pointed out in his analysis.

      If you have two measurements with an uncertainty interval of +/- 0.5deg then when combined you will have an uncertainty interval of sqrt( 0.25 + 0.25) = +/- 0.7deg a value larger than that of either component. This applies every time you do an iterative step in a CGM. Combine thirty of these to get a monthly average and your uncertainty interval becomes sqrt(30 * 0.7) = +/- 5deg. How in Pete’s name could you possibly say that one year is 0.01deg hotter than the other when your uncertainty interval spans a total of 10deg? And an uncertainty interval of +/- 0.5deg is certainly not unreasonable for measurements taken in the late 19th century and early 20th century. In fact, unless modern measurement devices are regularly calibrated, +/- 0.5deg is not an unreasonable assumption for the uncertainty them either!

      This also means that when you do something like take daily averages of hundreds of stations and try to combine them that the uncertainty interval grows and grows. At some point the uncertainty overwhelms your ability to say that comparing one set of averages is X amount different that another.

  28. Tim, I’ll deal with your specific points and then say something about when my calculations are or are not applicable, labelled APP.

    T: How do you know what X_i actually is since any value within the uncertainty interval can be the true value?

    Doh! Where did I say that X_i was known?

    T: If the true value is at one extreme or the other then you no longer have: “Then given the MEASUREMENTS M_i, each X_i is uniform in [M_i-e,M_i+e].” (capitalization mine, tim).

    Given that X is unknown (now dropping the i), we treat it as a random variable within that assumed uncertainty interval. We can write a probability density equation P[X ‘=’ x] = 1/(2e). Its actual value, x, could as you say be anywhere in the interval.

    T: 1. The true value could be at the very extreme end of the uncertainty interval so there is no +/- e but only +e or -e.

    I don’t understand your notation.

    T: 2. You are working with a SINGLE measurement with an uncertainty interval, not multiple measurements of the same thing using the same device. There is no “measurments”.

    No, in my second case I have a single measurement M_i approximating each unknown X_i, with the i’s representing different times and places. For my first case, see APP below.

    T: You simply can’t assume that X_i, i.e. the mean, is the true value.

    And I never did…

    T: You keep falling back into the same old central limit theory assuming you have multiple measurements that can be combined to more accurately calculate the mean. I.e. “M_i is not equal to X_i because of both random variation in the measurement process”. This assumes you have multiple measurements that form a “random variation in the measurement process”.

    I’ll have to keep on forgiving you for inappropriate reference to the Central Limit Theorem, which deals with the tendency to normality of summed random variables, not the reduction of variance in a mean.

    T: There *is* no random variation in the measurement process. There is SINGLE measurement that has an uncertainty. When you calculate the mean temperature for a day at a measuring station you use a SINGLE measurement for the maximum temperature, a measurement that has an uncertainty interval. You don’t take multiple measurements at that station at the same point in time that can be used to generate an accurate mean using the central limit theory. The same thing applies to the minimum temperature. You have a *SINGLE* measurement, not multiple measurements.

    See APP.

    T: The plus and minus interval for uncertainty is *not* based on multiple measurements so there is no actual random probability distribution of measurements. The only reason for assuming a uniform distribution is to try and develop a way to add the uncertainty intervals based on known mathematics for random probability distributions. But since there are *not* multiple measurements you can’t take the similarity past figuring out how to handle the combining of the intervals.

    If e > 1/2 then for some values of X, more than 1 possibility for M exists. See APP.

    T: There is nothing wrong with your math, only with the assumptions that an uncertainty interval is a probability distribution of multiple measurements.

    T(R): “So while the uncertainty in the sum of the X’s increases with n, the uncertainty in the mean decreases.”
    “the probability that X lies between x and x+d is a tiny number ud.”
    “Because the error distribution for M_i-X is uniform”

    T: All of these assume a probability distribution for multiple measurements of the same thing. An uncertainty interval associated with a single measurement is *NOT* a probability distribution, not even a uniform one. It is only useful to consider as such in order to figure out a way to combine multiple separate single measurements of different things – root-sum-square.

    No, in my second case they are distributions for measurements of many different things. The only way to use probability theory properly is to assume that some distribution exists, and then find out its implications. A national standards body with expensive equipment to measure values to within very small uncertainty will be able to measure error distributions for inferior devices.

    T: You continue to ignore how to combine a minimum temperature measurement of 60deg +/- 0.5deg and 72deg +/- 0.5deg and focus instead on how you can say that 60deg or 72deg is the “true value” using the central limit theory based on the uncertainty interval of each being a probability distribution of multiple measurements.

    The first clause is correct – I haven’t looked at it.

    T: The fact is that when you combine two separate measurements, i.e. try to calculate a mean between the two, that mean will have a variance (i.e. uncertainty interval) that is larger then the variance of each component. And the variances add as root-sum-square. Just like Pat pointed out in his analysis.

    This is your main error, which I have pointed out before. Let the uncertainties of X_1 and X_2 be u_1 and u_2 reespectively. We can agree that the uncertainty of X_1+X_2 is u = sqrt(u_1^2+u_2^2). But what is the uncertainty of Y = (X_1+X_2)/1000? It is u/1000. The JCGM defines uncertainty to be a measure of dispersion of the values a measurand could reasonably take. Since Y is 1000 times smaller than the X’s, its value and uncertainty are 1000 times smaller. Now replace 1000 by 2. The uncertainty of (X_1+X_2)/2 is u/2. If u_1 = u_2, then u = u_1/sqrt(2) < u_1. QED

    T: If you have two measurements with an uncertainty interval of +/- 0.5deg then when combined you will have an uncertainty interval of sqrt( 0.25 + 0.25) = +/- 0.7deg a value larger than that of either component. This applies every time you do an iterative step in a CGM. Combine thirty of these to get a monthly average and your uncertainty interval becomes sqrt(30 * 0.7) = +/- 5deg. How in Pete’s name could you possibly say that one year is 0.01deg hotter than the other when your uncertainty interval spans a total of 10deg? And an uncertainty interval of +/- 0.5deg is certainly not unreasonable for measurements taken in the late 19th century and early 20th century. In fact, unless modern measurement devices are regularly calibrated, +/- 0.5deg is not an unreasonable assumption for the uncertainty them either!

    See my reply to the previous paragraph.

    T: This also means that when you do something like take daily averages of hundreds of stations and try to combine them that the uncertainty interval grows and grows. At some point the uncertainty overwhelms your ability to say that comparing one set of averages is X amount different that another.

    No, as before, with n independent measurements, uncertainties of sums increases with n, uncertainties of means decreases. Now for:

    APP: This is about the applicability of my calculations for multiple measurements M_i on a single measurand X. It assumes that independent measurements are possible, and that may not be the case. Let us take a particular value of e, 0.7, to demonstrate. We have digital output as integers and e = 0.7 means that in addition to the systematic rounding by up to +/-0.5 of the true value X, there is a further interval of +/-0.2. So, if X is truly 31.42, the device before rounding can register anywhere between 31.22 and 31.62 equally likely. Let's call that value Y_1. Those values between 31.22 and 31.5 will record output as 31, and those between 31.5 and 31.62 will output 32.

    Now, suppose there is good reason to believe that X has not changed. For example we might be in laboratory conditions where we are tightly constraining it. What if we take a new measurement, say one minute later? It all depends on the nature of the device as to whether the new value Y_2 before rounding, is independent of Y_1. If Y fluctuates rapidly, then independence of Y_1 and Y_2 seems reasonable. For example, remember the old speedometers with analogue needles which would wobble noticeably. On the other hand, Y might be pretty stable over short periods, but affected by lunar tide, or cosmic particles, or Earth's magnetic field, etc. etc., and be more variable over a longer time. In this case Tim is right and a new M_2 is almost certain to agree with the previous M_1.

    So my earlier demonstration, of a modest reduction in uncertainty with multiple observations of the same quantity, does depend on independence and that depends on the physics of the particular instrument.

    Rich.

    • “Doh! Where did I say that X_i was known?”

      When you assume an equal “+/- e” then you have assumed that the mean is the true value and, therefore, that it is known.

      “we treat it as a random variable”

      It is not a “random variable”. Being a random variable assumes there is are multiple measurements whose values take the form of a probability distribution.

      Rich: “Then given the MEASUREMENTS M_i, each X_i is uniform in [M_i-e,M_i+e]”
      tim: “The true value could be at the very extreme end of the uncertainty interval so there is no +/- e but only +e or -e.”

      “I don’t understand your notation.”

      What don’t you understand? You are the one that used the terminology. I am just pointing out that X_i, the true value, doesn’t have to be uniform with a uniform negative and positive interval.

      Uncertainty is not error. Pat has said that often enough that it should be burned in everyone’s brain. Error you can resolve with multiple measurements of a standard, you can’t do the same with one measurement that has uncertainty.

      “No, in my second case I have a single measurement M_i approximating each unknown X_i, with the i’s representing different times and places. For my first case, see APP below.”

      How can any single M-i approximate a true value value X-i? That’s the whole issue in a nutshell. If I give you a single measurement of 72deg +/- 0.5deg, how do you know that nominal measurement, 72deg, approximates the true value? If that were the case then why does the uncertainty interval even exist?

      If you measure the temperature at different places with different thermometers then how do you add the uncertainty intervals together? You can’t do it with the central limit theory because that only holds for multiple measurements of the same thing, i.e. object, time, and place.

      “If e > 1/2 then for some values of X, more than 1 possibility for M exists”

      Again, you are assuming a probability function, see the word “possibility”.

      “No, in my second case they are distributions for measurements of many different things. The only way to use probability theory properly is to assume that some distribution exists, and then find out its implications. A national standards body with expensive equipment to measure values to within very small uncertainty will be able to measure error distributions for inferior devices.”

      Error is *not* uncertainty. If you are measuring a standard to determine calibration then you are doing a straight comparison to determine an error bias. That has nothing to do with uncertainty. The minute your calibrated instrument leaves the laboratory it will begin to lose calibration from aging, environment differences, etc. It will develop an uncertainty interval that only grows over time.

      ” But what is the uncertainty of Y = (X_1+X_2)/1000?”

      You keep falling back into the same trap, over and over. There is no “1000”. The population size of a single measurement is “1”. Y = whatever the nominal value of the measurement is. There is no X-1 and X_2. Y = X-1 in every case. And Y has an uncertainty interval that you can’t that you can’t resolve from that one measurement.

      “No, as before, with n independent measurements, uncertainties of sums increases with n, uncertainties of means decreases. Now for:”

      ONLY IF YOU ARE MEASURING THE SAME THING MULTIPLE TIMES! When I tell you that the temperature here, right now, is 40degF that is based on one measurement by one device at one point in time and at one location. Where do you keep coming up with independent measurements? And if the guy down the six miles down the road says his thermometer reads 39degF, that is one measurement by one device at one point in time and at one location.

      Now tell me how averaging those two independent measurements with individual uncertainty intervals can be combined to give a mean that has a *smaller* uncertainty interval than either. These are not measuring the same thing with the same device in the same environment at the same time. It simply doesn’t matter how accurately you think you can calculate the mean of these two independent measurements, the overall uncertainty will grow, it will not decrease.

      Combining two independent populations is just not as simple as calculating the mean and dividing by the total population. I tried to explain that with the two independent populations of pygmys and Watusis. An example you *still* have not addressed.

      “This is about the applicability of my calculations for multiple measurements M_i on a single measurand X.”

      You simply don’t have multiple measurements on a single measurand. You don’t even have the same measuring device! Take all 20 temperature stations within a ten-mile radius of my location. The maximum temperature is measured at each. Each measurement is a population of one. The measurements are all taken at different times, in different locations, using different instruments. Each single measurement has a different uncertainty interval depending on the instrument model, age, location, etc.

      Now, you can certainly calculate the mean of those measurements. But the total uncertainty will be the root-sum-square of all the individual uncertainties. The total uncertainty will *not* decrease based on the size of the population. Not directly or by the square root.

    • [W]hen you do something like take daily averages of hundreds of stations and try to combine them…the uncertainty interval grows and grows.

      This contention flies in the face of the fact that the variance of time-series of temperature in any homogeneous climate area DECREASES as more COHERENT time-series are averaged together. Each measurement–when considered as a deviation from its own station mean–is a SINGLE REALIZATION, but NOT a POPULATION of one, as erroneously claimed. Sheer ignorance of this demonstrable empirical fact, along with the simplistic presumption that all measurements are stochastically independent, is what underlies the misbegotten random-walk conception of climatic uncertainty argued with Pavlovian persistence here.

      • As usual, 1sky1, you ignore the impact of non-normal systematic measurement error. Is yours a Pavlovian blindness, too?

        Every measurement has a unique deviation from the physically true temperature that is not known to belong to any normally-distributed population.

        Those deviations arise from uncontrolled environmental variables, especially wind speed and solar irradiance. Messy, isn’t it.

        Like Rich, 1sky1, you live in a Platonic fantasyland.

      • sky:

        “This contention flies in the face of the fact that the variance of time-series of temperature in any homogeneous climate area DECREASES as more COHERENT time-series are averaged together.”

        The measurements are not time coherent. What makes you think they are? They are maximum temperatures and maximum temperatures can occur at various times even in an homogeneous climate area. They are minimum temperatures and minimum temperatures can occur at various times even in an homogeneous climate area. Even stations only a mile apart can have different cloud coverage and different wind conditions, both of which can affect their readings.

        “Each measurement–when considered as a deviation from its own station mean”

        How do you get a station mean? That would require multiple measurements of the same thing and no weather data collection station that I know of does that.If a measurement device has an uncertainty interval associated with it no amount of calculating a daily mean from multiple measurements can decrease that uncertainty interval.

        “is a SINGLE REALIZATION, but NOT a POPULATION of one, as erroneously claimed. ”

        Of course it is a population of one. And it has an uncertainty interval.

        “Sheer ignorance of this demonstrable empirical fact”

        The empirical fact is that stations take one measurement at a time, separated in time from each other. Each stations measures a different thing, like two investigators of which one measures the height of a pygmy and the other the height of a Watusi. How do you combine each of those measurements into a useful mean? How does combining those two measurements decrease the overall uncertainty associated with the measurements?

        “along with the simplistic presumption that all measurements are stochastically independent, is what underlies the misbegotten random-walk conception of climatic uncertainty argued with Pavlovian persistence here.”

        What makes you think the temperature measurements are made at random? That *is* the definition of stochastic, a random process, specifically that of a random variable.

        They *are* independent. The temperature reading at my weather station is totally independent of the temperature reading at another weather station 5 miles away! They are simply not measuring the same thing and they are not the same measuring device. And they each have their own uncertainty interval that are independent of each other.

        If you had actually been playing attention, uncertainty does *not* result in a random walk. Uncertainty is not a random variable that provides an equal number of values on each side of a mean. And it is that characteristic that causes a random walk. Sometimes you turn left and sometimes you turn right. An uncertainty interval doesn’t ever tell you which way to turn!

      • Frank clings doggedly to the unfounded notion that “non-normal systematic measurement error” somehow overturns everything that is known analytically about stochastic processes in the ensemble sense and in their individual realizations. Truly systematic error introduces the well-known feature of systematic bias, which can be readily identified and removed. But what we have with sheltered temperature measurements in situ is sporadic (episodic) bias, which itself is a random process. Such bias becomes gaussian “noise” in the case of AGGREGATED station data.

        Gorman, once again, continues to express his blind faith, which flies in the face of demonstrable station data analyses.

      • Frank clings doggedly to the unfounded notion “non-normal systematic measurement error” somehow overturns everything that is known analytically about stochastic processes, both in the ensemble sense and in the case of individual realizations. Truly systematic measurement error introduces a well-known bias, which can readily removed. But in the case of sheltered temperature measurements in situ, what we have is sporadic episodes of bias, which itself produce a random process. That process becomes gaussian “noise” when station data are aggregated over a sufficiently large number. The independence of measurement uncertainty at different stations thus leads to a reduction in total variance of the data in the aggregate case.

        Gorman’s ex ante argumentatiom is patently unaware of all of this and fails to come to grips with the well-known cross spectral coherence of nearby stations.

        • 1sky1 “Truly systematic measurement error introduces a well-known bias, which can readily removed.

          Do you understand the concept and impact of uncontrolled variables, 1sky1?

          1sky1 “But in the case of sheltered temperature measurements in situ, what we have is sporadic episodes of bias, which itself produce a random process.

          Undemonstrated anywhere.

          Hubbard and Lin (2002) doi: 10.1029/2001GL013191 combined thousands of single-instrument measurements and found non-normal distributions of error. Under ideal conditions of repair, calibration, and siting.

          That process becomes gaussian “noise” when station data are aggregated over a sufficiently large number.

          Hand-waving. You don’t know that, and neither does anyone else.

          … the well-known cross spectral coherence of nearby stations

          You’re in for a surprise.

    • Tim Feb 15 1:35pm

      This is getting wearisome. You went up in my estimation last October, Tim, but have rather declined since. The problem is that you seem to be denying some basic mathematics, so this may well be the last time I respond to you. I’m going to mark new comments with ‘RN’ below.

      “Doh! Where did I say that X_i was known?”

      When you assume an equal “+/- e” then you have assumed that the mean is the true value and, therefore, that it is known.

      RN: No, +/-e (uniform) means there is an interval (X-e,X+e) in which M must lie. In addition, M must be an integer under the assumption of digital output appropriately scaled. After M=m is observed, we know that X is in the interval (m-e,m+e). Obviously we don’t know what X is.

      “we treat it as a random variable”

      It is not a “random variable”. Being a random variable assumes there is are multiple measurements whose values take the form of a probability distribution.

      RN: False. Suppose I choose to throw a fair die once? Perhaps it has been thrown many times in the past to establish its fairness. Or perhaps never, but comes from a sample which has through trials been shown statistically fair. Before I throw it, it is a random variable with probability 1/6 of each face coming topmost. After I throw it, it is a measurement of that random variable, and is now a fixed value.

      Rich: “Then given the MEASUREMENTS M_i, each X_i is uniform in [M_i-e,M_i+e]”
      tim: “The true value could be at the very extreme end of the uncertainty interval so there is no +/- e but only +e or -e.”

      “I don’t understand your notation.”

      What don’t you understand? You are the one that used the terminology. I am just pointing out that X_i, the true value, doesn’t have to be uniform with a uniform negative and positive interval.

      RN: True, X_i doesn’t have to be uniform, but science proceeds by making assumptions and testing them where possible. For simplicity, I have mostly been assuming uniformity, as indeed ahve you in many of your comments.

      Uncertainty is not error. Pat has said that often enough that it should be burned in everyone’s brain. Error you can resolve with multiple measurements of a standard, you can’t do the same with one measurement that has uncertainty.

      RN: Yes, uncertainty is not error. It is the distribution, or often just the dispersion of, or often just the standard deviation of the dispersion of, the random variable which represents the error M-X before measurement – see the JCGM.

      “No, in my second case I have a single measurement M_i approximating each unknown X_i, with the i’s representing different times and places. For my first case, see APP below.”

      How can any single M-i approximate a true value value X-i? That’s the whole issue in a nutshell. If I give you a single measurement of 72deg +/- 0.5deg, how do you know that nominal measurement, 72deg, approximates the true value? If that were the case then why does the uncertainty interval even exist?

      RN: I know it approximates the true value because you, or someone else, swore blind that the thermometer reads accurately to within a certain error, some of whose parameters have been determined. If your thermometer is really only accurate to within 2 degrees, don’t tell me it’s accurate to within half a degree. And the uncertainty “interval” exists to reflect exactly those statements about its error.

      If you measure the temperature at different places with different thermometers then how do you add the uncertainty intervals together? You can’t do it with the central limit theory because that only holds for multiple measurements of the same thing, i.e. object, time, and place.

      RN: False. (More forgiveness, I’m just back from Sunday church and you are inappropriately using “central limit theory” again. It’s OK, I know what you mean anyway.) No, the theory of the distribution of the sum, or of a mean, of n random variables does not depend on them being multiple measurements of the same thing. Probably best for you to read a good statistics book. The theory does have to take account of correlation between them, and gives a simpler result if there isn’t any, which incidentally is more plausible if they are at a different time or place.

      “If e > 1/2 then for some values of X, more than 1 possibility for M exists”

      Again, you are assuming a probability function, see the word “possibility”.

      RN: Yes, I am, quite justifiably. The theory of uncertainty for a sum of measurements in the JCGM does not proceed mathematically without assumption that a probability function, even if unknown, exists. Oh, I see that’s in my next quoted comment anyway.

      “No, in my second case they are distributions for measurements of many different things. The only way to use probability theory properly is to assume that some distribution exists, and then find out its implications. A national standards body with expensive equipment to measure values to within very small uncertainty will be able to measure error distributions for inferior devices.”

      Error is *not* uncertainty. If you are measuring a standard to determine calibration then you are doing a straight comparison to determine an error bias. That has nothing to do with uncertainty. The minute your calibrated instrument leaves the laboratory it will begin to lose calibration from aging, environment differences, etc. It will develop an uncertainty interval that only grows over time.

      RN: Perhaps it will, but if we are to use the instrument to good effect we need an estimate, or perhaps worst case, of its uncertainty when we use it. Otherwise all bets are off, and we might as well say “Oh, we can’t measure global warming, we just believe in it”. Oh, lots of people do that anyway… And in any case the laboratory should have tested the instrument under varying environmental conditions, and supplied an uncertainty value appropriately. In fact, they might know it has great accuracy between 20 and 25 degC, but have to publish a worse figure incase someone uses it at -40 or +50.

      ” But what is the uncertainty of Y = (X_1+X_2)/1000?”

      You keep falling back into the same trap, over and over. There is no “1000”. The population size of a single measurement is “1”. Y = whatever the nominal value of the measurement is. There is no X-1 and X_2. Y = X-1 in every case. And Y has an uncertainty interval that you can’t that you can’t resolve from that one measurement.

      RN: Now you are not only denying basic mathematics/statistics, but denying the existence of the number 1000! My X_1 and X_2 are, for example, the LCF values for the years 2011 and 2012, the uncertainties of which Pat Frank combines in the way I described. And the point about the 1000 is that scaling matters. So uncertainty of a mean is not the same as the uncertainty of a sum.

      “No, as before, with n independent measurements, uncertainties of sums increases with n, uncertainties of means decreases. Now for:”

      ONLY IF YOU ARE MEASURING THE SAME THING MULTIPLE TIMES! When I tell you that the temperature here, right now, is 40degF that is based on one measurement by one device at one point in time and at one location. Where do you keep coming up with independent measurements? And if the guy down the six miles down the road says his thermometer reads 39degF, that is one measurement by one device at one point in time and at one location.

      Now tell me how averaging those two independent measurements with individual uncertainty intervals can be combined to give a mean that has a *smaller* uncertainty interval than either. These are not measuring the same thing with the same device in the same environment at the same time. It simply doesn’t matter how accurately you think you can calculate the mean of these two independent measurements, the overall uncertainty will grow, it will not decrease.

      RN: Again, you use the word “combine”, but the combining function is important. Mean and sum are not the same function, because mean scales down by the sample size. It’s elementary mathematics.

      Combining two independent populations is just not as simple as calculating the mean and dividing by the total population. I tried to explain that with the two independent populations of pygmys and Watusis. An example you *still* have not addressed.

      RN: True, better things to do with my time I’m afraid.

      “This is about the applicability of my calculations for multiple measurements M_i on a single measurand X.”

      You simply don’t have multiple measurements on a single measurand. You don’t even have the same measuring device! Take all 20 temperature stations within a ten-mile radius of my location. The maximum temperature is measured at each. Each measurement is a population of one. The measurements are all taken at different times, in different locations, using different instruments. Each single measurement has a different uncertainty interval depending on the instrument model, age, location, etc.

      RN: In the APP section, the type of measurand was not specified. Temperature may not be a good example for reasons you cite. That’s why I suggested an old style wobbly speedometer converted to digital output, but I expect there are better examples.

      Now, you can certainly calculate the mean of those measurements. But the total uncertainty will be the root-sum-square of all the individual uncertainties. The total uncertainty will *not* decrease based on the size of the population. Not directly or by the square root.

      RN: But in the APP section I did not suggest taking the mean of them. Why did you think I did? Again, your comment is wide of the mark.

      Farewell and adieu,
      Rich.

      • Rich, “[uncertainty is] just the standard deviation of the dispersion of, the random variable which represents the error …

        Not in the real world of measurement science.

        You quote the JCGM where it is elaborating the assumptions that suit your views, Rich. Your use of their authority is purely circular. You choose the part where they assume what you do, then you cite them as authority for your assumption.

        In B.N. Taylor and C.E. Kuyatt., Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results 1994, National Institute of Standards and Technology: Washington, DC. p. 20.

        Under “D.1.1.6 systematic error [VIM 3.14]

        [The] mean that would result from an infinite number of measurements of the same measurand carried out under repeatability conditions minus the value of the measurand

        NOTES
        1 Systematic error is equal to error minus random error
        (my bold).
        2 Like the value of the measurand, systematic error and its causes cannot be completely known.

        Systematic error is not random error. The sign or magnitude of any instance of systematic error is not known.

        In the Note under 3.2.3 in the JCGM “The uncertainty of a correction applied to a measurement result to compensate for a systematic effect is not the systematic error, often termed bias, in the measurement result due to the effect as it is sometimes called. It is instead a measure of the uncertainty of the result due to incomplete knowledge of the required value of the correction. The error arising from imperfect compensation of a systematic effect cannot be exactly known. The terms “error” and “uncertainty” should be used properly and care taken to distinguish between them.

        JCGM under 3.3 Uncertainty
        3.3.1 The uncertainty of the result of a measurement reflects the lack of exact knowledge of the value of the measurand (see 2.2). The result of a measurement after correction for recognized systematic effects is still only an estimate of the value of the measurand because of the uncertainty arising from random effects and from imperfect correction of the result for systematic effects.

        NOTE The result of a measurement (after correction) can unknowably be very close to the value of the measurand (and hence have a negligible error) even though it may have a large uncertainty. Thus the uncertainty of the result of a measurement should not be confused with the remaining unknown error.

        Under 3.3.2 “Of course, an unrecognized systematic effect cannot be taken into account in the evaluation of the uncertainty of the result of a measurement but contributes to its error.

        Under 3.3.4 “Both types of evaluation [i.e., of random error and of systematic error — P] are based on probability distributions (C.2.3), and the uncertainty components resulting from either type are quantified by variances or standard deviations.

        Under 5.1.4 “The combined standard uncertainty u[c(y)] [i.e., consisting of both random and systematic components — P] is the positive square root of the combined variance u²[c( y)], which is given by … Equation (10) … based on a first-order Taylor series approximation of Y = f (X₁, X₂, …, Xɴ), express what is termed in this Guide the law of propagation of uncertainty.”

        You’re just being tendentious, Rich. Sticking to the incomplete view that allows your preferred conclusion.

        • Pat Feb 16 11:00am

          I am amazed that you think that I am being “tendentious” and ignoring important parts of the JCGM! In the spirit of “what do you mean by ‘mean'” I shall once again supply some mathematics to elucidate, again with X = true (unknowable) value, M or M_i for measurements, D = M-X for error. For now I’ll just consider the case where M is an analogue (continuous) reading, because rounding does complicate matters. My position is that D is a random variable and that its distribution is the most general concept available for uncertainty. But I accept that the JCGM simplifies that into a bias element b = E[D] and an uncertainty element s = sqrt(Var[D]). So now I’ll annotate what you wrote, with ‘P’ prefixes for your paragraphs, and you’ll find little disagreement.

          P: Under “D.1.1.6 systematic error [VIM 3.14]
          [The] mean that would result from an infinite number of measurements of the same measurand carried out under repeatability conditions minus the value of the measurand.”

          Infinite number of measurements would be M_1,…,M_n as n goes to infinity. Mean = sum M_i/n. “Minus the value of the measurand” is subtracting X, giving sum (M_i-X)/n = sum D_i/n. Providing that the M_n are relatively independent (not highly correlated), that sum tends to b under the Weak Law of Large Numbers. So systematic error equals b.

          P: NOTES
          1 Systematic error is equal to error minus random error.
          2 Like the value of the measurand, systematic error and its causes cannot be completely known.

          Note 1 implies that random error is error minus systematic error, which is D-b, with expectation E[D]-b = 0. So the term “random error” is for the departure of D from its mean, and standard uncertainty is its standard deviation, which is also its root mean square since it has mean 0. All seems reasonable – and I now notice that Section 3.2.2 confirms all that.

          P: Systematic error is not random error. The sign or magnitude of any instance of systematic error is not known.

          Indeed, the deduction made above is that the JCGM takes b to be the “systematic error”, and D-b to be the “random error”.

          P: In the Note under 3.2.3 in the JCGM “The uncertainty of a correction applied to a measurement result to compensate for a systematic effect is not the systematic error, often termed bias, in the measurement result due to the effect as it is sometimes called. It is instead a measure of the uncertainty of the result due to incomplete knowledge of the required value of the correction. The error arising from imperfect compensation of a systematic effect cannot be exactly known. The terms “error” and “uncertainty” should be used properly and care taken to distinguish between them.”

          That confirms that “systematic error” is what I was calling bias (the more usual name in statistics), which is b = E[D]. I agree with everything there.

          P: JCGM under 3.3 Uncertainty
          “3.3.1 The uncertainty of the result of a measurement reflects the lack of exact knowledge of the value of the measurand (see 2.2). The result of a measurement after correction for recognized systematic effects is still only an estimate of the value of the measurand because of the uncertainty arising from random effects and from imperfect correction of the result for systematic effects.

          Yes, agree.

          P: NOTE The result of a measurement (after correction) can unknowably be very close to the value of the measurand (and hence have a negligible error) even though it may have a large uncertainty. Thus the uncertainty of the result of a measurement should not be confused with the remaining unknown error.”

          Yes, agree.

          P: Under 3.3.2 “Of course, an unrecognized systematic effect cannot be taken into account in the evaluation of the uncertainty of the result of a measurement but contributes to its error.”

          Yes, assuming “systematic effect” equates to “systematic error”, which is bias, then that affects the bias (b) and the total error (D), but not the “random error” and therefore not the uncertainty.

          P: Under 3.3.4 “Both types of evaluation [i.e., of random error and of systematic error — P] are based on probability distributions (C.2.3), and the uncertainty components resulting from either type are quantified by variances or standard deviations.”

          This is exactly what I have been saying but Tim Gorman appears to have been denying.

          P: Under 5.1.4 “The combined standard uncertainty u[c(y)] [i.e., consisting of both random and systematic components — P] is the positive square root of the combined variance u²[c( y)], which is given by … Equation (10) … based on a first-order Taylor series approximation of Y = f (X₁, X₂, …, Xɴ), express what is termed in this Guide the law of propagation of uncertainty.”

          Yes, with two caveats. The first is that, as you will know, that section is for uncorrelated errors. By the way, you are actually quoting from 5.1.2 not 5.1.4. The second caveat regards your parenthesis about random and systematic components. The “combined” adjective refers to combining the uncertainties u(x_i) of all the inputs to a function f. Each one of those can be either a Type A or a Type B uncertainty, which corresponds to the way in which the uncertainty value was derived, but does not differentiate between random and systematic components.

          In fact, a systematic error does not per se contribute to an uncertainty term u(x_i). It is only the attempt to correct for the systematic error, i.e. to reduce the bias, which adds uncertainty. But typically the added uncertainty there is small. It can be shown under reasonable assumptions that using n observations to determine the correction leads to uncertainty being multiplied by (1+1/(2n)).

          So, Pat, where is it that you think I have been misusing the JCGM just in ways that suit me?

          • Rich — right at the start, “ D = M-X for error. … My position is that D is a random variable and that its distribution is the most general concept available for uncertainty. But I accept that the JCGM simplifies that into a bias element b = E[D] and an uncertainty element s = sqrt(Var[D]).

            D is not a random variable. JCGM describes systematic error as unknowable. In insisting that D is a random variable, you’re immediately assuming your conclusion.

            That’s as tendentious as it is possible to get.

            Look at note 1 from Taylor and Kuyatt: “Systematic error is equal to error minus random error” Systematic error is definitively not randomly distributed.

            Systematic error from uncontrolled variables is unknown in both sign and magnitude. Measurements that contain systematic errors can behave just like good data.

            The systematic error in the prediction from an inadequate theory cannot be known, even in principle, because there are no well-constrained observables for comparison.

            And then you go right ahead and assume from the outset that it’s all just random variables. It’s too much, Rich.

            Science isn’t statistics. Physical methods and incomplete physical theories do not conform to any ideal.

            No amount of mathematics will convert bad data into good. Or sharpen the blur within a resolution limit, for that matter.

          • Rich, “Minus the value of the measurand” is subtracting X, giving sum (M_i-X)/n = sum D_i/n.”

            Not quite.

            “Minus the value of the measurand” is subtracting X from the mean of measurements, giving {sum [(M_i)/n]} – X = systematic error S, because random error D has reduced by 1/[sqrt(n=infinity)]

            Rich, “assuming “systematic effect” equates to “systematic error”, which is bias, then that affects the bias (b) and the total error (D), but not the “random error” and therefore not the uncertainty.

            Rather, ‘yes the uncertainty.’ Unknown bias contributes uncertainty to a measurement.

            Unknown bias cannot be subtracted away.

            In real world measurements and observations, the size and sign of ’b’ are unknown. Hence the need for instrumental (or model) calibration under the conditions of the experiment.

            Quoted P: Under 3.3.4 “Both types of evaluation [i.e., of random error and of systematic error — P] are based on probability distributions (C.2.3), and the uncertainty components resulting from either type are quantified by variances or standard deviations.”

            Rich, “This is exactly what I have been saying but Tim Gorman appears to have been denying.

            Rather, it’s what you denied just above, Rich, where you wrote (wrongly) that bias does not contribute to uncertainty.

            The JCGM there specifically indicated that uncertainty arises from systematic error. You denied it. Tim Gorman has repeatedly explained it.

            I believe the problem is that you’re supposing that the use of variances or standard deviations in the JCGM strictly imply the metrics of a normal distribution.

            If you think this, you are under a serious misapprehension. They imply no such thing.

            In practice, the same mathematics and the same terminology are used to evaluate non-normal systematic error, as are used for random error. See JCGM 5.1.2ff.

            Sorry for the 5.1.2 – 5.1.4 mix-up. But the point remains.

            Rich, “Yes, with two caveats. The first is that, as you will know, that section is for uncorrelated errors.

            Actually, it’s for uncorrelated input quantities — the X_i, not the D_i.

            The X_i would be multiple independent measurements of the same so-called measurand. In analytical work, statistical independence can be provided by making multiple independent samples, and then measuring each once. In a perfect system, all the X_i would be identical. This is never achieved in real labs.

            Hence the need for calibration against known standards, and application of the calibration-derived uncertainty to every single experimental measurement. That uncertainty never diminishes with repeated measurements. In sums of experimental measurements, the final uncertainty is the rms.

            When such measurements serially enter a sequential set of calculations, the uncertainties propagate through as the rss.

            Rich, “Each one of those can be either a Type A or a Type B uncertainty, which corresponds to the way in which the uncertainty value was derived, but does not differentiate between random and systematic components.

            From the JCGM, Type A evaluations of standard uncertainty components are founded on frequency distributions while Type B evaluations are founded on a priori distributions..

            Only completed Type A evaluations provide information about the shape of the error distribution, potentially meeting the assumptions that fully justify statistical analysis.

            Type B evaluations estimate uncertainty by bringing in external information. The assumptions that justify statistics need not be met.

            Going down to 4.3 Type B evaluation of standard uncertainty, we find among the sources of Type B information: data provided in calibration and other certificates;.

            Calibration is exactly what we have been discussing, with respect to evaluation of systematic errors. GCM global cloud fraction simulation error is a systematic error, and satellite observations of cloud fraction is the (imperfect) calibration standard.

            Take a look at JCGD Appendix F.2 Components evaluated by other means: Type B evaluation of standard uncertainty

            One must go all the way down to F.2.6.3, to find mention of the situation we’re discussing here.

            The very last sentence includes, “… when the effects of environmental influence quantities on the sample are significant, the skill and knowledge of the analyst derived from experience and all of the currently available information are required for evaluating the uncertainty.

            F 2.6.2 also has relevance in the discussion of the effects of unknown sample inhomogeneities (i.e., uncontrolled environmental variables).

            The effects of environmental influence quantities refers exactly to the impact of uncontrolled variables on the measurement, for which the complete uncertainty must account.

            Calibration of an instrument against a well-known standard is the requisite way of assessing the accuracy of an experimental measurement. Calibration against a well constrained observable is the standard way of assessing the accuracy of a prediction from a physical model.

            Calibration error includes both random and systematic components, and the sign and magnitude of the systematic error is not known for any single measured datum M_i, typically because the true value X_i is not known.

            In a prediction of future climate states, the sign and magnitude of the systematic error in each step of a simulation is necessarily unknown. All one has is the prior information about predictive uncertainty determined by the calibration error statistic.

            Regarding the meaning of the ±u(c) derived from systematic error of unknown sign and magnitude arising from uncontrolled variables, the JCGM says this:

            Under 4.3.7, “In other cases, it may be possible to estimate only bounds (upper and lower limits) for Xi, in particular, to state that “the probability that the value of Xi lies within the interval a− to a+ for all practical purposes is equal to one and the probability that Xi lies outside this interval is essentially zero”. If there is no specific knowledge about the possible values of Xi within the interval, one can only assume that it is equally probable for Xi to lie anywhere within it…

            Paragraph 4.3.7 is exactly what Tim Gorman has been explaining.

            Rich, “So, Pat, where is it that you think I have been misusing the JCGM just in ways that suit me?

            In your insistence that all error is random, Rich. In your denial of the impact of bias on uncertainty.

            The JCGM does not support your view.

            But your mistaken view is what allows you your preferred conclusions.

            Nor does the actual practice of science support your view. Science is theory and result. Not just theory. Result is observation and measurement. Theory stands or falls on the judgment of result.

            Observations and measurements always have an associated uncertainty that includes limits of resolution as well as external impacts. Physical science is messy that way.

            And so physical scientists have had to develop methods to estimate the reliability of data. Statistics provides a very important method of estimation, and is used even when the structure of the error violates statistical assumptions.

            Climate modeling — and consensus climatology in general — has neglected, even repudiated, the results part of science to their eventual utter downfall.

          • Pat Feb 17 9:56am

            P: Rich — right at the start, “ D = M-X for error. … My position is that D is a random variable and that its distribution is the most general concept available for uncertainty. But I accept that the JCGM simplifies that into a bias element b = E[D] and an uncertainty element s = sqrt(Var[D]).”

            P: D is not a random variable. JCGM describes systematic error as unknowable. In insisting that D is a random variable, you’re immediately assuming your conclusion.

            I’m not sure what “conclusion” I have assumed, but let that pass; I suppose your implication is that if I got that wrong then you can safely ignore anything else I write. But my reading of the JCGM, replete with probability theory as it is, says I didn’t get it wrong; I’ll revisit that in the next paragraph. By all means, Pat, concentrate on physics rather than statistics, but don’t then claim that you know enough about statistics and uncertainty to claim that your emulator is good enough, even using valid equations of propagation of uncertainty, to say anything useful about the propagation of uncertainty in GCMs (not that I hold GCMs in huge regard myself).

            So, is D a random variable? Why does the statistics of error exist, or the JCGM exist, if not to shed light on the discrepancy (error) between a presumed physical value X and a measurement M of it? Why does the JCGM talk of variance and standard deviation of error, if M-X is not fairly represented as a random variable? JCGM 3.3.4: “Both types of evaluation are based on probability distributions (C.2.3), and the uncertainty components resulting from either type are quantified by variances or standard deviations”.

            P: Look at note 1 from Taylor and Kuyatt: “Systematic error is equal to error minus random error” Systematic error is definitively not randomly distributed.

            Correct: systematic error = bias = a parameter of the error probability distribution, is unknown though some inferences from data may occur, and is not a random variable.

            P: Systematic error from uncontrolled variables is unknown in both sign and magnitude. Measurements that contain systematic errors can behave just like good data.

            Correct, but beware the “uncontrolled”.

            P: The systematic error in the prediction from an inadequate theory cannot be known, even in principle, because there are no well-constrained observables for comparison.

            I agree, but again beware the “inadequate”: the GCMs do have well-constrained observables for comparison, namely global temperature data. In any case, the crux of your paper isn’t about systematic error (= bias = mean error), it is about uncertainty (= standard deviation of error).

            P: Science isn’t statistics. Physical methods and incomplete physical theories do not conform to any ideal. No amount of mathematics will convert bad data into good. Or sharpen the blur within a resolution limit, for that matter.

            But where science uses numbers to draw conclusions, mathematics is needed to ensure that those conclusions are derived in a rational, justifiable, way. It is no good using formulae from the JCGM and then saying you don’t believe in any of the mathematics underpinning it.

            I know that science isn’t, or shouldn’t be, about consensus, but I’d be fascinated to know how many readers with science/maths degrees agree with me or with you, and how the type of degree might affect those statistics!

            Rich.

          • Rich, “I’m not sure what “conclusion” I have assumed, but let that pass; …

            When you assume random variables you assume your conclusion, which is embedded in the supposition that the statistics of normal distributions uniformly apply to those variables and ultimately to physical error.

            Rich, “I suppose your implication is that if I got that wrong then you can safely ignore anything else I write.

            When have I ignored what you write, Rich? I’ve engaged you at virtually every turn.

            Rich, “don’t then claim that you know enough about statistics and uncertainty to claim that your emulator is good enough, even using valid equations of propagation of uncertainty, to say anything useful about the propagation of uncertainty in GCMs

            I don’t claim my emulator is good enough, Rich. I demonstrated that it is good enough.

            I showed that it can accurately emulate the air temperature projections of any arbitrary CMIP3 or CMIP5 GCM. Sixty-eight examples, not counting the 19 in Figure 1.

            I don’t propagate the error in GCMs, Rich. Supposing so is to make Nick Stokes’ mistake.

            I propagate the error of GCMs; the error GCMs observably make. It’s the calibration error that exposes the resolution lower limit of GCMs as regards the tropospheric thermal energy flux. A resolution lower limit that is 114 times larger than the perturbation to be resolved.

            GCMs plain cannot resolve the impact, if any, of CO2 emissions. There’s just no doubt that the error analysis is correct.

            Rich, “Why does the JCGM talk of variance and standard deviation of error, if M-X is not fairly represented as a random variable?

            Because the same statistical formalisms are used to estimate the uncertainty of non-normal error. JCGM says that over, and yet over again, whenever it discusses systematic error.

            Quoting P: P: Systematic error from uncontrolled variables is unknown in both sign and magnitude. Measurements that contain systematic errors can behave just like good data.

            Rich, “Correct, but beware the “uncontrolled”.

            I acknowledge the uncontrolled. These are external variables that enter into the experiment or observation and modify the result. They are of unknown impact, and can be cryptic in that the experimenter or observer may not know of even their possible existence.

            That definitely describes the case for GCM cloud error.

            Rich, “I agree, but again beware the “inadequate”: the GCMs do have well-constrained observables for comparison, namely global temperature data.

            Air temperature data are not well-constrained. It’s merely that the systematic measurement error is completely neglected as a standard of practice. This error arises from wind speed effects and solar irradiance, which impact the air temperature inside the sensor housing.

            Workers in the field, at UKMet, UEA Climate Research Unit, NASA GISS and Berkeley BEST completely ignore measurement error. Mention does not appear in their work. They uniformly make your assumption that all measurement error is random — an assumption made without warrant and in the face of calibration experiments that demonstrate its contradiction. Their carelessness makes a mockery of science.

            I’ve published on air temperature measurement error here (900 kB pdf), here (1 MB pdf) and here (abstract), and have finished the analysis for another paper that is going to seriously expose the field for their incompetence.

            The global air temperature record is not known to better than ±0.5 C during the entire 20th century, and the 21st outside of the US CRN. Prior to 1900, the uncertainty becomes very large.

            The entire field lives on false precision — like the rest of consensus climatology.

            GCMs that calibrate on the air temperature record incorporate that ±0.5 C as a systematic error affecting their parameterizations — parameters that have their own physical uncertainty bounds. GCMs all reproduce the global air temperature record despite varying by factors of 2-3 in their climate sensitivity. That alone should give you fair warning that their projections are not reliable.

            Even the TOA balance is not known to better than ±3.9 Wm⁻². GCMs do not have any well-constrained observables on which to rely. But the unconstrainedness is studiedly ignored in the field, leaving folks like you, Rich, subject to false confidence.
            Rich, “In any case, the crux of your paper isn’t about systematic error (= bias = mean error), it is about uncertainty (= standard deviation of error).

            It’s about the impact of GCM calibration error on predictive reliability. GCM error in simulated cloud fraction is a systematic physical error.

            GCM systematic cloud error is the source of the long-wave cloud forcing error. That error is combined into a per-GCM global annual average uncertainty statistic, which shows that the thermal flux of the simulated atmosphere is wrong. That calibration statistic conditions every single air temperature projection.

            Rich, “It is no good using formulae from the JCGM and then saying you don’t believe in any of the mathematics underpinning it.

            I’ve never suggested that I don’t believe the mathematics. I’ve pointed out that the mathematics is used with a greater range of error types than where the statistical assumptions generally obtain.

            Your apparent position is that wherever that mathematics appears, those assumptions apply. They don’t. Empirical science does not fit within statistical constraint. Scientists need an estimate of reliability. So, they’ve dragoooed statistics for use in places and ways that probably raise the neck hairs of statisticians.

            Rich, I’d be fascinated to know how many readers with science/maths degrees agree with me or with you, and how the type of degree might affect those statistics!”

            Multiple trained people have agreed with my work. Including, in this thread alone, Tim Gorman, David Dibbell, and Geoff Sherrington.

            Elsewhere physicist Nick, ferd berple, Paul Penrose and JRF in Pensacola (my understanding is they are engineers) in this thread, meteorologist Mark Maguire, and implicitly the authors of the papers listed here, — see especially S.J. Kline.

            And that doesn’t exhaust the number.

            You’ve just set them all aside, Rich.

          • Should be ‘dragooned.’
            and
            Rich, “I’d be fascinated to know how many readers with science/maths degrees agree with me or with you, and how the type of degree might affect those statistics!

            Regrets about the mistakes.

      • Rich,

        “This is getting wearisome. You went up in my estimation last October, Tim, but have rather declined since. The problem is that you seem to be denying some basic mathematics, so this may well be the last time I respond to you. I’m going to mark new comments with ‘RN’ below.”

        When you have to resort to ad hominems you have lost the argument.

        “No, +/-e (uniform) means there is an interval (X-e,X+e) in which M must lie.”

        Why? Why isn’t the interval X+2e or X-2e? The fact that you are using the term “uniform” means you are assumng that the true value of the measurement *is* the mean.

        “False. Suppose I choose to throw a fair die once? Perhaps it has been thrown many times in the past to establish its fairness. Or perhaps never, but comes from a sample which has through trials been shown statistically fair.”

        A dice throw has no uncertainty. This is just proof that you are still confusing uncertainty as being a random variable.

        “True, X_i doesn’t have to be uniform, but science proceeds by making assumptions and testing them where possible. For simplicity, I have mostly been assuming uniformity, as indeed ahve you in many of your comments.”

        But you are assuming that uncertainty is a random variable. It isn’t. That isn’t simplicity. That is ignoring the actual characteristics of each. And I only assumed an interval of uncertainty resembles the variance of a random variable in order to try and understand how to add the uncertainty intervals of two indpendent measurements.

        “Yes, uncertainty is not error. It is the distribution, or often just the dispersion of, or often just the standard deviation of the dispersion of, the random variable which represents the error M-X before measurement”

        And now we are back to calling uncertainty a random variable with a probability function. It isn’t.

        “I know it approximates the true value because you, or someone else, swore blind that the thermometer reads accurately to within a certain error, some of whose parameters have been determined. If your thermometer is really only accurate to within 2 degrees, don’t tell me it’s accurate to within half a degree. And the uncertainty “interval” exists to reflect exactly those statements about its error.”

        Error is not uncertainty. You are still confusing those two things as well. Uncertainty has multiple components. For the liquid in glass thermometers used in the 19th century and most of the 20th century the ability to read the thermometer meant knowing how to read a convex vs concave meniscus. Do *you* know how to do that? Don’t cheat and go look it up. Parallax is also a problem. A short person many times read the thermometer differently than a tall person.These are just two contributors to uncertainty.

        “No, the theory of the distribution of the sum, or of a mean, of n random variables does not depend on them being multiple measurements of the same thing.”

        The distribution of the mean is all about taking several sets of random samples from a population and calculating the mean of each sample. You then average the means of the samples to get a more accurate mean for the population. First, the distribution of the sample means tells you nothing about the distribution of the overall population. Even a heavily skewed population will have its distribution of means tend toward a normal distribution. Second, again, with one measurement you have no population from which to draw samples. All you have is a nominal value and an uncertainty interval. The uncertainty interval does not represent a probability function for values in the interval.

        “Perhaps it will, but if we are to use the instrument to good effect we need an estimate, or perhaps worst case, of its uncertainty when we use it.”

        But that unceratinty is not a probability function.

        ““Oh, we can’t measure global warming, we just believe in it”.”

        That is *NOT* what I or most on here are saying. We *are* saying that the uncertainty associated with the global annual temperature average is so large that trying to say that Year X1 is 0.01deg hotter than Year X2 is just a joke. The uncertainty intverval just overwhelms the comparison. The proper assertion would be “we don’t know if this year was hotter than last year”.

        The climate alarmists, and you apparently, want to ignore that there is any uncertainty in your results. The climate alarmists, and you apparently, keep trying to say that you can calculate the mean more and more accurately to any required number of significant digits with no uncertainty.

        “So uncertainty of a mean is not the same as the uncertainty of a sum.”

        But, again, the uncertainty of a mean tells you nothing about the population! And since uncertainty is not a probability function then using the distribution of the mean to calculate a more accurate mean is itself a meaningless exercise.

        “RN: Again, you use the word “combine”, but the combining function is important. Mean and sum are not the same function, because mean scales down by the sample size. It’s elementary mathematics.”

        No, the mean does not scale down for uncertainty. Uncertainty is not a probability function. There are no samples from the same population you can use to scale down the mean.

        “RN: True, better things to do with my time I’m afraid.”

        In other words you *know* that you have no answer. I didn’t think you would!

        “RN: In the APP section, the type of measurand was not specified. Temperature may not be a good example for reasons you cite. That’s why I suggested an old style wobbly speedometer converted to digital output, but I expect there are better examples.”

        But the issue at hand *IS* the temperature! Specifically the uncertainty associated with the global annaul average temperature!

  29. Just to say, I’m working my way through Rich’s essay and am now down to Section D “Emulator Parameters.”

    Thus far, the head-post analysis has not withstood examination.

  30. David Dibbell Feb 11 2:10 pm

    Dr. Booth – two points here. First, about what the “bias” is. From Figure 16 and its caption, it looks like this “bias” is the difference of annual means, gridpoint by gridpoint, subtracting the CERES values averaged over the period 2001-2015 from the CM4.0 values averaged over the period 1980-2014. (If I have misunderstood this, I welcome a correction.) Second, then, I’m not suggesting the RMSE 6 W/m^2 value from Figure 15 corresponds somehow to the +/- 4 W/m^2 value appearing in Pat Frank’s paper. They are quite different.

    I’ve looked at that paper. Figures 15, 18, 19, 20, 21 all mention CM4.0 for 1980-2014. But Figure 16 doesn’t. I would hope that there they extracted the relevant CM4.0 data to match the CERES years. For the latter you say “2001-2015”, but in fact that dataset runs to March 2017. Possibly they intersected to use 2001-2014 for both CERES and CM4.0. It doesn’t really matter except in respect of the credence to attach to Figure 16.

    You say that you don’t think that their 6W/m^2 corresponds in any way to Pat Frank’s 4W/m^2. Why? The Figure 16 unit is “Annual mean outgoing longwave radiation”, but as an anomaly wrt CERES. Pat Frank’s unit is described in “The CMIP5 models were reported to produce an annual average LWCF RMSE = ± 4 Wm–2 year^–1 model^–1, relative to the observational cloud standard (Lauer and Hamilton, 2013). This calibration error represents the average annual uncertainty within any CMIP5 simulated tropospheric thermal energy flux and is generally representative of all CMIP5 models.” So for a single model, both seem to be in W/m^2/y uncertainty.

    Have I gone wrong somewhere?

    Rich.

    • Oops, I got confused between Figure 15 and 16 there. I was looking at Figure 16(c) there, which I think is a relevant source of 6W/m^2, but you were looking at Figure 15.

      Figure 16(c) is also rather interesting, because it gives a bias, -1.83 W/m^2. This means the standard deviation is sqrt(6.02^2-1.83^2) = 5.7 W/m^2. So a previous thought I had, that LCF error might be mostly bias and not standard deviation ~ uncertainty, looks to be wrong. However, that is all averaged over grid cells, and it is clear from the colours on the figures that there are strong geographical components.

      I’d like to come back to the timescale 1/40000 years, which approximates GCM “ticks”, and my calculation that in that timescale LCF error might be expected to be +/-0.02W/m^2. And I’d like to ask Nick Stokes again what effect such an error would have when propagating through ticks. He has previously said that auto-corrections are made on the basis of conservation of energy, but not on the radiative equations, of which LCF is presumably a part. Do GCM outputs after a year resemble a +/-4 W/m^2 difference (I don’t think so)?

      Please can Nick, or some other GCM expert, comment on this?

      Rich.

      • Rich, “Do GCM outputs after a year resemble a +/-4 W/m^2 difference (I don’t think so)?

        You still don’t get it. Incredible.

        Maybe that’s why you went into statistics rather than physical science, Rich. The scientific method seems far beyond your grasp.

        Nick Stokes, by the way, will only encourage you in your misguidedness. Doing so is in his interest.

        • Interesting, Pat. In the Tim Gorman school of discourse, you have just lost the argument by making an ad hominem comment. But fortunately I don’t belong to that school.

          Your paper does many fine things, but in my opinion it only scratches the surface of what the GCMs can, or cannot, tell us. I am trying to delve deeper into how error actually propagates within them, not just how it might appear to from those scratches on the surface.

          So far we have usefully discovered from Nick that some automatic error correction happens through application of conservation of energy, but apparently that doesn’t apply to the basic quantities of radiation floating around, and that is where I want to learn more. I have a nascent idea which I may share later.

          Rich.

          • That wasn’t an ad hominem argument, Rich. It was an observation based on many-times repeated experience.

            Repeated experience would show you that I could never be an Olympic runner. It would not be an ad hominem for you to tell me so.

            Error in a prediction is not corrected by adjustment out to falsely reproduce a calibration observable. The underlying physics remains incorrect.

            Nick has evidenced no understanding of how to judge predictive error and uncertainty. Nor has Ken Rice (Mr. ATTP), nor any climate modeler of my experience.

            Rich, you’d never let anyone get away with subtracting an error from a result obtained using defective statistics, and who then claims the method is predictively useful.

            Why on earth would you let climate modeling get away with the same shenanigan?

      • Rich,
        About “ticks”, please see this paper on the GFDL AM4.0/LM4.0 components of CM4.0. This gives the time steps, which are not the same for all aspects of the simulation. Search for “step”.

        https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2017MS001209

        At Table 1 and the related text, there is an interesting explanation of the choice of time step for shortwave radiation.

        About the treatment of conservation of energy, see the Supplemental Information, for which a link to a pdf is provided at the end of the paper. I have pasted the first section below.

        ***************************
        21 S1 Treatment of energy conservation in dynamical core
        22 The dissipation of kinetic energy in this model, besides the part due to explicit ver-
        23 tical diffusion, occurs implicitly as a consequence of the advection algorithm. As a re-
        24 sult, the dissipative heating balancing this loss of kinetic energy cannot easily be com-
        25 puted locally, and is, instead returned to the flow by a spatially uniform tropospheric
        26 heating. This dissipative heating associated with the advection in the dynamical core
        27 in AM4.0 is ~ 2 Wm−2.
        28 There is also another energy conservation inconsistency in that the energy conserved
        29 by the dynamical core involves a potential energy computed with the virtual tempera-
        30 ture, while the model column physics uses temperature without the virtual effect, assum-
        31 ing that the conservation of internal plus potential energy, vertically integrated, reduces
        32 to the conservation of vertically integrated enthalpy, cpT . This discrepancy averages to
        33 0.4 Wm−2. We adjust the dissipative heating correction in the dynamical core to ac-
        34 count for this discrepancy. As a result, there is good consistency, within 0.1 Wm−2, be-
        35 tween energy fluxes at the TOA and at the surface in equilibrium, with the net down-
        36 ward heat surface flux defined as Rsfc −LvE −S −LfPsnow. Here Rsfc is net down-
        37 ward LW + SW radiative flux, E surface evaporation of vapor, S upward sensible heat
        38 flux, Psnow surface precipitation flux of frozen water, Lv and Lf are the latent heat of
        39 vaporization and fusion respectively. A remaining problem is that these latent heats are
        40 assumed to be independent of temperature. Removing the latter inaccuracy in the most
        41 appropriate fashion would involve multiple changes to the code and was postponed to
        42 another development cycle.
        **************************

        DD

    • Rich,
      Your question to me is: “You say that you don’t think that their 6W/m^2 corresponds in any way to Pat Frank’s 4W/m^2. Why?”
      I actually said, “I’m not suggesting the RMSE 6 W/m^2 value from Figure 15 corresponds somehow to the +/- 4 W/m^2 value appearing in Pat Frank’s paper. They are quite different.”
      But in any case, the answer to “why?” is apparent by considering this excerpt from Pat Frank’s opening paragraph in his paper: “A directly relevant GCM calibration metric is the annual average +/- 12.1% error in global annual average cloud fraction produced within CMIP5 climate models. This error is strongly pair-wise correlated across models, implying a source in deficient theory. The resulting long-wave cloud forcing (LWCF) error introduces an annual average +/- 4 W/m^2 [dd format edit] uncertainty into the simulated tropospheric thermal energy flux.” So it is a global annual average that is being characterized.
      On the other hand, the RMSE 6 W/m^2 for outgoing longwave we’ve been referring to, for GFDL’s CM4.0, from both Figure 15 and Figure 16(c) in the referenced article, characterizes the gridpoint-by-gridpoint bias as defined in the caption to Figure 16. It is not a characterization of differences between model global annual average values and reference global annual average values. But nevertheless it is similarly revealing, as I see it, that the GCM’s are simply not capable of resolving outgoing longwave emissions, or other related fluxes and conditions, closely enough to measured values to support projections of a temperature response to greenhouse gas forcing in a stepwise simulation.
      I hope this helps by answering your question. I do see your later reply.
      DD

      • Dibbell Feb 16 4:50pm

        David, thanks for the clarification. I am certainly interested in the standard deviation = uncertainty in the annual average LWCF. However, when Googling I came across Calisto et al (2014) “Cloud radiative forcing intercomparison…” at https://www.ann-geophys.net/32/793/2014/angeo-32-793-2014.pdf and was intrigued by their Table 1. For model HadCM3 for example, the mean LWCF over 10 years was 21.2 W/m^2, but the biggest anomalies in the 10 yearly data points were apparently +0.68 and -0.64 W/m^2. That isn’t anything like +/-4 W/m^2, and even less like that value multiplied by 3 = sqrt(9) for accumulation of error over those 10 years.

        Am I comparing apples to oranges again? I’ll have to look again at the paper Pat Frank cited for that.

        Rich.

        • See – owe to Rich February 19, 2020 at 8:00 am

          Rich, in the paper you linked, take a look at figure 1 lower left panel, and figure 7, the two lower panels for sea and land. From these charts the CLT (cloud amounts) from the CMIP5 models differ from the CERES values notably. It is from such differences that the theory deficiency becomes most apparent, and from which the +/- 4 W/m^2 uncertainty follows, as I understand it in Pat Frank’s paper. The fact that the Table 1 model outputs (again, in the paper you linked) look “better” than that implies compensating errors in the models.
          DD

        • Thanks, David. Actually, I’m not sure the Table 1 values do look better: the CERES LWC[R]F has only 2 (almost 1) models outlying it, so a strong bias is noticeable there. Your comment prompted me to look again.

          So in addition to uncertainty from random errors, there is a sizeable systematic error/bias. From my comments on uncertainty versus bias further below, this means that if this bias was not corrected the models would drift even faster away from reality, linear in time, than the square root in time for uncertainty. Since this doesn’t occur, that bias must be partially cancelled by something else.

          Nick Stokes has said that auto-correction occurs for “conservation of energy”, but did not admit to any auto-correction in the radiative forcings themsleves. But perhaps there is some, to explain the results. Someone, somewhere, must have data on this.

          Thanks very much for the alert on this.

          Rich.

          • See – owe to Rich February 20, 2020 at 12:43 pm

            Rich,
            About your point, “Since this doesn’t occur, that bias must be partially cancelled by something else.” In the paper you linked, search the term “compensating error” concerning cloud-related radiative forcing. Apparent stability is evidently achieved by tuning the model parameters. Here is a quote from Lauer and Hamilton 2013, “The problem of compensating biases in simulated cloud properties is not new and has been reported in previous studies.”

            About your interest in how the conservation of energy is addressed, please see my comment and link elsewhere on this page about GFDL AM4.0/LM4.0, which you may not have explored yet. David Dibbell February 18, 2020 at 10:31 am

            DD

          • David, thanks again. There are 7 matches to ‘compens’ in the paper, each of them interesting. Yes, during model calibration there is no doubt an objective function to be minimized (roughly RMS of model-reality I would guess) which will force cancellation of biases. Willis Eschenbach is no doubt appalled at the number of free parameters available to do that, but happy that they are at least based on credible physics. Oh, Willis, sorry if I am putting words in your mouth, I know you don’t like that…

            This reduction of bias will have its own standard uncertainty associated with it, but I don’t think that Pat Frank has approached it that way; perhaps he will revisit his derivation of +/-4 W/m^2.

  31. sky:

    “This contention flies in the face of the fact that the variance of time-series of temperature in any homogeneous climate area DECREASES as more COHERENT time-series are averaged together.”

    The measurements are not time coherent. What makes you think they are? They are maximum temperatures and maximum temperatures can occur at various times even in an homogeneous climate area. They are minimum temperatures and minimum temperatures can occur at various times even in an homogeneous climate area. Even stations only a mile apart can have different cloud coverage and different wind conditions, both of which can affect their readings.

    “Each measurement–when considered as a deviation from its own station mean”

    How do you get a station mean? That would require multiple measurements of the same thing and no weather data collection station that I know of does that.If a measurement device has an uncertainty interval associated with it no amount of calculating a daily mean from multiple measurements can decrease that uncertainty interval.

    “is a SINGLE REALIZATION, but NOT a POPULATION of one, as erroneously claimed. ”

    Of course it is a population of one. And it has an uncertainty interval.

    “Sheer ignorance of this demonstrable empirical fact”

    The empirical fact is that stations take one measurement at a time, separated in time from each other. Each stations measures a different thing, like two investigators of which one measures the height of a pygmy and the other the height of a Watusi. How do you combine each of those measurements into a useful mean? How does combining those two measurements decrease the overall uncertainty associated with the measurements?

    “along with the simplistic presumption that all measurements are stochastically independent, is what underlies the misbegotten random-walk conception of climatic uncertainty argued with Pavlovian persistence here.”

    What makes you think the temperature measurements are made at random? That *is* the definition of stochastic, a random process, specifically that of a random variable.

    They *are* independent. The temperature reading at my weather station is totally independent of the temperature reading at another weather station 5 miles away! They are simply not measuring the same thing and they are not the same measuring device. And they each have their own uncertainty interval that are independent of each other.

    If you had actually been playing attention, uncertainty does *not* result in a random walk. Uncertainty is not a random variable that provides an equal number of values on each side of a mean. And it is that characteristic that causes a random walk. Sometimes you turn left and sometimes you turn right. An uncertainty interval doesn’t ever tell you which way to turn!

    (Rescued from spam bin) SUNMOD

    • Tim, 1sky1’s coherent time series remark probably refers to Hansen and Lebedeff (1987) Global Trends of Measured Surface Air Temperature JGR 92 (Dll), 13,345-13,372.

      Figure 3 shows the correlation of air temperature, with r averaging ~0.5 at 1200 km for the northern hemisphere.

      Hansen and Lebedeff didn’t consider measurement error at all.

  32. Pat Feb 17 12:14pm

    I’m starting a new comment thread to reply to just two of the statements from that comment of yours.

    First: “Unknown bias contributes uncertainty to a measurement… The JCGM there specifically indicated that uncertainty arises from systematic error.”

    I believe you are wrong, wrong, wrong, but I think we need an independent expert arbitrator (where can we find one?). The main two parameters of a probability distribution are the mean and standard deviation. Do you agree? The mean is a measure of location, and the standard deviation is a measure of dispersion, or variation, about that mean. In the case of D = M-X I am using b for the mean and s for the standard deviation (which is also the standard deviation of M under the assumption that X is an unknown but fixed value).

    The JCGM 2.2.3 says that ‘uncertainty’ “characterizes the dispersion of the values that could reasonably be attributed to the measurand NOTE 1 The parameter may be, for example, a standard deviation (or a given multiple of it), or the half-width of an interval having a stated level of confidence.” So this is s; there is no room for a component involving the mean b there.

    NOTE 3 there says “It is understood that the result of the measurement is the best estimate of the value of the measurand, and that all components of uncertainty, including those arising from systematic effects, such as components associated with corrections and reference standards, contribute to the dispersion.” The contribution from systematic effects is not, in my insufficiently humble opinion, from the systematic error (= bias = b) itself, but from perfectly valid attempts to reduce the bias via calibration. But this point is quite subtle, and as I say, we probably won’t agree on it so a third party is needed. It makes sense to me mathematically, since the systematic error is a fixed unknown value, with no dispersion/variance, but perhaps it doesn’t make sense to you from a physicist’s point of view. It’s funny, isn’t it, how difficult it is to agree on what “mean” means, or how difficult it is to interpret possibly contradictory passages from various bibles?

    Second: “Under 4.3.7, “In other cases, it may be possible to estimate only bounds (upper and lower limits) for Xi, in particular, to state that “the probability that the value of Xi lies within the interval a− to a+ for all practical purposes is equal to one and the probability that Xi lies outside this interval is essentially zero”. If there is no specific knowledge about the possible values of Xi within the interval, one can only assume that it is equally probable for Xi to lie anywhere within it…”

    It is unfortunate that the JCGM gives any weight to that case, because it is very unlikely and usually suggests that information has been wantonly thrown away, and leads to bizarre effects as in my essay. And wow, I have just noticed the following in 4.3.9: “Such step function discontinuities in a probability distribution are often unphysical. In many cases, it is more realistic to expect that values near the bounds are less likely than those near the midpoint. It is then reasonable to replace the symmetric rectangular distribution with a symmetric trapezoidal distribution having equal sloping sides (an isosceles trapezoid)”. This is amazing, because I introduced a trapezoidal distribution in my essay with no prior knowledge of this 4.3.9, which I’ve only just seen. And I did that because the cliff edge of a uniform leads, as said, to bizarre effects. So please don’t rely on 4.3.7 to be a good corner of the uncertainty world in which to reside. Repeat: “often unphysical”; does that worry a physicist?

    • Rich, “…but I think we need an independent expert arbitrator

      Why do we need an arbiter, Rich? The JCGM is right there in front of our eyes.

      Here’s what you insisted is wrong, wrong, wrong:The JCGM there specifically indicated that uncertainty arises from systematic error.”

      Here’s what the JCGM says in the note under 3.3.3: “In some publications, uncertainty components are categorized as “random” and “systematic” and are associated with errors arising from random effects and known systematic effects, respectively.

      And, “B.2.22 systematic error mean that would result from an infinite number of measurements of the same measurand carried out under repeatability conditions minus a true value of the measurand

      NOTE 1 Systematic error is equal to error minus random error.

      NOTE 2 Like true value, systematic error and its causes cannot be completely known.

      NOTE 3 For a measuring instrument, see “bias” (VIM:1993, definition 5.25). [VIM:1993, definition 3.14]

      Guide Comment: The error of the result of a measurement (see B.2.19) may often be considered as arising from a number of random and systematic effects that contribute individual components of error to the error of the result. Also see the Guide Comment to B.2.19 and to B.2.3.

      Note under B2.23 and B2.24: “Since the systematic error cannot be known perfectly, the compensation cannot be complete.

      In GCM projections of future climate, the systematic cloud fraction error cannot be corrected at all.

      Under E 3.6 c) “it is unnecessary to classify components as “random” or “systematic” (or in any other manner) when evaluating uncertainty because all components of uncertainty are treated in the same way.

      Benefit c) is highly advantageous because such categorization is frequently a source of confusion; an uncertainty component is not either “random” or “systematic”. Its nature is conditioned by the use made of the corresponding quantity, or more formally, by the context in which the quantity appears in the mathematical model that describes the measurement. Thus, when its corresponding quantity is used in a different context, a “random” component may become a “systematic” component, and vice versa.

      It’s right there in front of you, Rich. Known systematic errors can be partially compensated. Unknown systematic errors cannot be compensated at all.

      Partial or uncompensated, systematic error contributes to uncertainty in a result.

      When a deficient theory is used in a step-wise series of calculations, the systematic errors are unknown, the biases are not known to be constant, and model calibration error must be propagated into the result.

      No arbitrator needed. Or wanted. No arguments from authority. I’ll think for myself, thanks.

      • Pat, a few hours ago I posted a theorem which apparently disproves your interpretation of the JCGM. But it appears to have got lost in moderation, which will suit you!

        Rich.

        • Rich, “But it appears to have got lost in moderation, which will suit you!

          Implicit ad hominem, Rich.

          You implied I’m opportunistically unfair. A remarkably unfair view itself considering how thoroughly I’ve engaged your arguments.

          You’ve lost the argument by your own interpretation of Tim Gorman’s standard.

          I’ll reply to the rest later. Consider this for context, though: all of mathematics including statistics are complicated ways of saying a = a.

          • Pat, any such implication was not intended. I know from my end that engaging with arguments costs time and effort. So I imagined that if you had one fewer argument to address, then that would suit you. As for “Tim Gorman’s standard”, in my comment which you link I did say “But fortunately I don’t belong to that school”. In other words, though it is better to avoid ad hominem remarks, and you claimed you avoided it in your own remarks about me, the argument is still the argument and has to rest under its own merits.

            I hope that clears things up a bit, and (ad hominem) I retain a good deal of respect for you. Whilst retaining some healthy disagreement…such as on the “a = a” remark.

            Rich.

    • Pat Feb 18 4:21pm

      “Why do we need an arbiter?” You’re right that we don’t, but for the wrong reason. And the reason is that I have a proof that your interpretation of the JCGM is wrong.

      Theorem: Assume that prior to a measurement, the error which will occur may be considered to be a random variable with mean b and variance s^2. Let v = s^2 and let g(v,b) be a differentiable function which defines the “uncertainty” of the measurement. Assume that the correct formula for the uncertainty of a sum of n independent measurements with respective means b_i and variances v_i is that given by JCGM 5.1.2 with unit differential: g(v,b) = sqrt(sum_i g(v_i,b_i)^2) where v = sum_i v_i, b = sum_i b_i. Then if g(v,b) is consistent under rescaling, i.e. multiplying all measurements by k then multiplies g by k, g(v,b) is independent of b,

      Proof: Let n = 2. (Shorten v_1 to v1 etc.) Then v = v1+v2, b = b1+b2,

      g(v,b)^2 = g(v1,b1)^2 + g(v2,b2)^2

      We now differentiate this with respect to v1, using rules of the form (d/dv1)(h(x)) = (d/dx)(h(x)) dx/dv1 with h(x) replaced by g(x,y)^2, x replaced by v=v1+v2, and y replaced by b=b1+b2. Since d(v1+v2)/dv1 =dv1/dv1=1, and dv2/dv1 = 0, overall we get

      2g(v1+v2,b1+b2) dg(v1+v2,b1+b2)/dv = 2g(v1,b1)dg(v1,b1)/dv

      Because this is true for any values of the arguments, the function g(v,b)dg(v,b)/dv is a constant, say k_v, where the v is a label. Then integrating with respect to v,

      int k_v dv = l(b)+vk_v = int g(v,b)dg(v,b) = g(v,b)^2/2

      Note that because we integrated over v, the constant of integration is independent of v but might depend on b, hence the term l(b).

      In the same way from differentiating with respect to b1, we deduce that

      m(v)+bk_b = g(v,b)^2/2

      The only consistent solution is g(v,b) = sqrt(2vk_v + 2bk_b). We may arbitrarily choose the scaling constant k_v to be ½ (dimensionless). Let the units of the measurement be called the ‘tinu’. Then v has dimension tinu^2, and if k_b=0 then g(v,b) has dimension tinu. But because b has dimension tinu, there is a problem if k_b is nonzero. To preserve dimension k_b must have dimension tinu. Suppose we choose k_b = i/2 where i is 1 tinu. Then

      g(v,b) = sqrt(v+ib)

      Let g’ = g(v,b) for a particular v and b. Now change scale to milli-tinus. g should become 1000g’ in these units. But v is a million times as big whilst b is a thousand times as big in these units, so the right hand side is now sqrt(10^6 v + 10^3 ib) which is not 10^3 g’. Therefore g(v,b) is not invariant under scale.

      QED

      I did not find that theorem in a book, but had the idea for it overnight. The “dimension” problem means that one might like to use sqrt(s^2+b^2) for the uncertainty, which is at least commensurate, but it fails the other mathematical requirements. The conclusion is that the JCGM uncertainty formula is what it says it is, which is the square root of a variance. A variance cannot include any terms from bias, which is the neat four letter word for “systematic error”. Only when a correction term is used to reduce the bias, and this term perforce is based on data which have an associated uncertainty (i.e. variance), does an uncertainty term associated with systematic error come into play.

      So, Pat, are you going to disprove my theorem, prove its assumptions are unjustified, or merely ignore it? More to the point, if you have data on random error and systematic error, how exactly are you going to combine them into an uncertainty value? What is your magic function?

      Rich,

      (Rescued from spam bin) SUNMOD

      • Rich Feb 19 6:25am

        Following up on my theorem, I shall examine some practical consequences. Suppose that some radiative forcing is in error each year by an average of 1 W/m^2, and that the error accumulates from one year to the next. And suppose that that is RMSE (Root Mean Squared Error) rather than standard deviation (numerous papers do quote RMSE). Now RMSE, which I’ll call r, equals sqrt(s^2+b^2), where s is standard deviation (uncertainty) and b is bias (systematic error). What are the consequences of different proportions of s^2 and b^2 going into r^2 = s^2+b^2?

        Suppose b = 0. Then s = 1, and after, say, 81 years, the square root law of propagation of uncertainty means that the uncertainty is +/-9 W/m^2.

        Now suppose that s and b are each 1/sqrt(2) = +/-0.707. Then after 81 years the uncertainty is +/-0.707*9 ~ +/-6.4 W/m^2. But the bias accumulates linearly. b is either 0.707 or -0.707, and the calibration should have told us which; let’s assume positive. Then after 81 years the systematic error will have accumulated to +0.707*81 = +57.3 W/m^2. That is huge!

        Before calibration, in the absence of external information, very little would be known about s and b. The process of calibration allows estimates of s and b to be determined, never perfectly, but often good enough to be useful. The difference between s and b, though, is that s cannot be corrected, in the sense that a single future measurement in a scenario where s and b are still valid, is still subject to the (standard) uncertainty +/-s. But b can be corrected, and if it is anywhere near as large as s, and many measurements are going to be meaningfully summed, then the above example shows that it is vital that this be done, since otherwise error grows way beyond the limits predicted by uncertainty calculations.

        If bias is corrected by subtracting the mean error z over n observations, then the variance of z is s^2/n and so the extra uncertainty induced by the correction is +/-s/sqrt(n). My interpretation of the JCGM where it talks about uncertainties associated with systematic error is exactly this – correction of bias, not bias itself.

        I am, as usual, open to other rational explanations and formulae, provided that they stand up to mathematical scrutiny. I believe that the JCGM is based on sound mathematics, insofar as mathematics can apply at all to measurement, but that some of its statements could be tightened up to reduce confusion among practitioners.

        Rich.

    • Rich, “since the systematic error is a fixed unknown value

      Not when it is caused by uncontrolled variables. Variables, Rich, as in changing over time.

      Varying environmental impacts do produce a dispersion of error, which has an empirical standard deviation even though not normally distributed.

      Rich, “because it is very unlikely …

      No it’s not. How are you able to pronounce on the likelihood of cases?

      Rich, “and usually suggests that information has been wantonly thrown away,

      And you know that, how? What is your experience carrying out physical experiments?

      In X-ray spectroscopy, the x-ray beam can heat the monochromator crystals locally, causing a slow unknown drift in energy. It’s known to happen, and one usually has to live with it. Samples change in the beam. There are small unknown errors in spectrum calibration and normalization. There is the resolution limit of the spectrometer itself. All of that combines into a rectangular uncertainty in the energy position of an observed absorption feature. That sort of thing is a common part of measurement.

      Trapezoidal uncertainties require unfounded assumptions about the error distribution. Unprofessional wishful thinking.

      I consider them to be dishonest when they are objectively unjustified.

  33. What becomes unmistakably clear here is the chronic inability to grasp the essential difference between the laboratory case of making a chain of independent measurements or estimates of a single FIXED quantity and the in situ case of measuring or modeling the auto-correlated and spatially coherent time-series of any particular geophysical VARIABLE. Only that can explain such totally fantastic claims as:

    The measurements are not time coherent…How do you get a station mean? That would require multiple measurements of the same thing and no weather data collection station that I know of does that…
    The empirical fact is that stations take one measurement at a time, separated in time from each other. Each stations measures a different thing…The temperature reading at my weather station is totally independent of the temperature reading at another weather station 5 miles away…etc. etc.

    In reality, the measurement uncertainties of physically sheltered thermometers, which have been investigated for more than century by meteorologists, indeed produce sporadic episodes of bias,. dependent upon wind speed and insolation. So does the practice of taking the mid-range daily reading (Tmax + Tmin)/2 as the daily “mean”. Nevertheless, these well-known shortcomings do not substantially affect the utility of century-long TIME-SERIES obtained at well-maintained met stations in studying the variations of the “climate signal.”

    The very fact that cross-spectral coherency is typically high (>0.75) over hundreds of kilometers at the important multidecadal frequencies shows that, contrary to Frank’s claim, the signal-to-noise ratio is more than adequate for the intended purpose. His notion that the “uncertainty” compounds with every annual time-step, like a random-walk of independent increments, is simply contradicted in every corner of the globe where quality station data are available. It’s on that basis, not on H&L’s superficial zero-lag correlation analysis, that I comment here. It takes, however, a modicum of knowledge of stochastic processes in the geophysical setting to comprehend that.

    • 1sky1, “The very fact that cross-spectral coherency is typically high (>0.75) over hundreds of kilometers at the important multidecadal frequencies shows that, contrary to Frank’s claim, the signal-to-noise ratio is more than adequate for the intended purpose.

      As mentioned above, 1sky1, you’re in for a surprise on this claim.

      His notion that the “uncertainty” compounds with every annual time-step, like a random-walk of independent increments, is simply contradicted in every corner of the globe where quality station data are available.

      I have made no claims about measurement uncertainty compounding with any time-step. Whoever wrote that is improperly conflating my work on propagation of GCM calibration error with air temperature measurements, thus demonstrating a lack of understanding of both.

      quality station data” That’s rich. There have been none, from a climatological perspective, until the CRN system came on line.

      It’s on that basis, not on H&L’s superficial zero-lag correlation analysis, that I comment here. It takes, however, a modicum of knowledge of stochastic processes in the geophysical setting to comprehend that.

      Use of inflammatory language — superficial — does not make an argument. It makes a polemic.

      H&L computed a GCM annual average long wave cloud forcing calibration error. Global cloud fraction error is inherent in the model and LWCF error is therefore present in every single time step. Its propagation through GCM air temperature projections follows immediately.

      I’ve yet to encounter a climate modeler who knows the first thing about physical error analysis. That diagnosis includes you, 1sky1.

    • I have made no claims about measurement uncertainty compounding with any time-step.

      I was clearly addressing Gorman’s claim, which mimics Frank’s academic notion of modeling uncertainty. The only surprise here is the total lack of comprehension what high cross-spectral coherency between time-series implies vis a vis S/N ratio. It’s not polemics, but the incisiveness of that metric compared to the zero-lag “correlation of air temperature, with r averaging ~0.5 at 1200 km for the northern hemisphere,” as reported by H&L, that renders the latter quite superficial. And their paper was about “Measured Surface Air Temperature,” not model results.

      BTW, cross-spectral analysis between output and input also provides an analytically superior way of calibrating or determining the frequency response characteristics of instruments as well as of mathematical models. Alas, that is terra incognita to those who pitch their academic preconceptions about real-world physics without any serious empirical study.

  34. 1sky1″ The only surprise here is the total lack of comprehension what high cross-spectral coherency between time-series implies vis a vis S/N ratio.

    There’s no lack of comprehension, 1sky1. I know exactly what you mean. I’d agree, too, if I didn’t know better. As mentioned above, you have a surprise in store regarding that claim.

    It’s not polemics, but the incisiveness of that metric compared to the zero-lag “correlation of air temperature, with r averaging ~0.5 at 1200 km for the northern hemisphere,” as reported by H&L, that renders the latter quite superficial.

    As it has turned out, the coherence doesn’t make the error analysis superficial. I’ve done further analysis, 1sky1. It’s just as yet unpublished.

    And their paper was about “Measured Surface Air Temperature,” not model results.

    Apologies for my mistake. I mistook H&L to mean Lauer & Hamilton, rather than your intended, Hubbard and Lin.

  35. 1sky1, “BTW, cross-spectral analysis between output and input also provides an analytically superior way of calibrating or determining the frequency response characteristics of instruments as well as of mathematical models.

    BTW, cross comparison among model outputs reveals nothing about accuracy.

    Comparison of models with observations reveals nothing about accuracy either, given model tuning.

  36. While typing a final comment that required much cross-referencing, WUWT spontaneously took me off the page. Upon returning to it, I was dismayed to find the comment box empty. Sadly, this is not the first such experience here. Since I value my time, I’ll simply point to the latest non sequitur employed here to deflect substantive scientific criticism of Frank’s extravagant claims:

    [C]ross comparison among model outputs reveals nothing about accuracy.

    Such cross-comparisons were never even raised here in regard to model accuracy, which must necessarily be judged (pre-tuning or not) against the best available observations, using the most incisive analytic methods.
    His claim that “I have a surprise in store regarding that claim” remains an empty piece of rhetoric.

  37. I believe this is the last day on which comments are open here. So I’ll make some valedictory remarks. Since last August I have spent more time than I would have intended in trying to understand the theory behind Pat Frank’s paper, but for the most part it has been rewarding.

    It has been quite a challenge to stay abreast of the various comments on my essay and matters arising, and I thank many commenters for their help and insights. It is perhaps worth mentioning some of the successes and failures which I feel I have had here.

    I have failed to persuade Tim Gorman that the uncertainty of a mean is smaller than the uncertainty of the sum from which it was derived. You can lead a horse to water, but you can’t make it drink.

    I have failed to persuade Pat Frank that his view is wrong about the way the JGCM allows systematic error to enter into the formula for combining uncertainties. Though, he has not yet replied to my comments of Feb 19 6:25am and Feb 20 2:12am, so it is possible that he is thinking hard about his position on that.

    I have failed to get any information on whether my parameter ‘a’ in Equation (1) may be greater than zero. The consequences are rather important for climate and model stability.

    I have succeeded in eliciting from Nick Stokes some details on GCM corrections for conservation of energy. I think this is mainly a mathematical computation issue, with modest errors being corrected, and not a big issue.

    I have succeeded in getting help from David Dibbell on how the greater systematic errors in components of the GCMs are effectively bound to get approximately cancelled during model calibration/fitting. Nevertheless it appears that GCMs with significantly different values of sensitivity to CO2 can be made to fit, which is why AR5 was not able to narrow the uncertainty range of sensitivity. Because I have criticized Pat Frank’s treatment of cloud uncertainty and its propagation, some readers have concluded that I do this to support the GCM modellers. Nothing could be further from the truth, and I remain extremely sceptical about them.

    I have succeeded in learning more, and I hope teaching more, about the science of uncertainty, but it has more nuances than I expected and I do not claim to comprehend all the problems which can arise in its practical use.

    To conclude, I continue to believe that emulators of GCMs are a good idea, but that in order to address GCMs’ uncertainty they need to reproduce the error regime of the GCMs as well as the mean, which is no doubt an extra challenge. Just as GCMs need to be compared with each other (especially how does a high sensitivity GCM approximately match a low sensitivity one), emulators need to be compared with each other and with GCMs.

    Best wishes to all, and I am going on a well-timed holiday to the Alps!

    Rich.

Comments are closed.