Guest Post by Willis Eschenbach
I have long suspected a theoretical error in the way that some climate scientists estimate the uncertainty in anomaly data. I think that I’ve found clear evidence of the error in the Berkeley Earth Surface Temperature data. I say “I think”, because as always, there certainly may be something I’ve overlooked.
Figure 1 shows their graph of the Berkeley Earth data in question. The underlying data, including error estimates, can be downloaded from here.
Figure 1. Monthly temperature anomaly data graph from Berkeley Earth. It shows their results (black) and other datasets. ORIGINAL CAPTION: Land temperature with 1- and 10-year running averages. The shaded regions are the one- and two-standard deviation uncertainties calculated including both statistical and spatial sampling errors. Prior land results from the other groups are also plotted. The NASA GISS record had a land mask applied; the HadCRU curve is the simple land average, not the hemispheric-weighted one. SOURCE
So let me see if I can explain the error I suspected. I think that the error involved in taking the anomalies is not included in their reported total errors. Here’s how the process of calculating an anomaly works.
First, you take the actual readings, month by month. Then you take the average for each month. Here’s an example, using the temperatures in Anchorage, Alaska from 1950 to 1980.
Figure 2. Anchorage temperatures, along with monthly averages.
To calculate the anomalies, from each monthly data point you subtract that month’s average. These monthly averages, called the “climatology”, are shown in the top row of Figure 2. After the month’s averages are subtracted from the actual data, whatever is left over is the “anomaly”, the difference between the actual data and the monthly average. For example, in January 1951 (top left in Figure 2) the Anchorage temperature is minus 14.9 degrees. The average for the month of January is minus 10.2 degrees. Thus the anomaly for January 1951 is -4.7 degrees—that month is 4.7 degrees colder than the average January.
What I have suspected for a while is that the error in the climatology itself is erroneously not taken into account when calculating the total error for a given month’s anomaly. Each of the numbers in the top row of Figure 2, the monthly averages that make up the climatology, has an associated error. That error has to be carried forwards when you subtract the monthly averages from the observational data. The final result, the anomaly of minus 4.5 degrees, contains two distinct sources of error.
One is error associated with that individual January 1951 average, -14.7°C. For example, the person taking the measurements may have consistently misread the thermometer, or the electronics might have drifted during that month.
The other source of error is the error in the monthly averages (the “climatology”) which are being subtracted from each value. Assuming the errors are independent, which of course may not be the case but is usually assumed, these two errors add “in quadrature”. This means that the final error is the square root of the sum of the squares of the errors.
One important corollary of this is that the final error estimate for a given month’s anomaly cannot be smaller than the error in the climatology for that month.
Now let me show you the Berkeley Earth results. To their credit, they have been very transparent and reported various details. Among the details in the data cited above are their estimate of the total, all-inclusive error for each month. And fortunately, their reported results also include the following information for each month:
Figure 3. Berkeley Earth estimated monthly land temperatures, along with their associated errors.
Since they are subtracting those values from each of the monthly temperatures to get the anomalies, the total Berkeley Earth monthly errors can never be smaller than those error values.
Here’s the problem. Figure 4 compares those monthly error values shown in Figure 3 to the actual reported total monthly errors for the 2012 monthly anomaly data from the dataset cited above:
Figure 4. Error associated with the monthly average (light and dark blue) compared to the 2012 reported total error. All data from the Berkeley Earth dataset linked above.
The light blue months are months where the reported error associated with the monthly average is larger than the reported 2012 monthly error … I don’t see how that’s possible.
Where I first suspected the error (but have never been able to show it) is in the ocean data. The reported accuracy is far too great given the number of available observations, as I showed here. I suspect that the reason is that they have not carried forwards the error in the climatology, although that’s just a guess to try to explain the unbelievable reported errors in the ocean data.
Statistics gurus, what am I missing here? Has the Berkeley Earth analysis method somehow gotten around this roadblock? Am I misunderstanding their numbers? I’m self-taught in all this stuff and I’ve been wrong before, am I off the rails here? Always more to learn.
My best to all,
w.
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
The entire method of taking daily averages is faulty. The current method is actually the median, as in middle value, not the true mean or average. If 24 readings are taken every hour over a given day period, then those 24 should be averaged rather than a the median value. Several other factors are not included: time of day, wind and overcast affect the temperature’s heating affect. Personally, I would take the entire years data set to come up with one average: 8760 data points per station given 24 readings per day.
@Nick Stokes says:
August 18, 2013 at 5:56 am
It’s probably best to just think of a Monte Carlo
You need however to remember a basic problem with that … is climate deterministic? I think everyone agrees it is slightly chaotic and is probably the biggest source of error you are trying to work out how to handle.
Climate scientists actually need to do a little discussion with engineers who are well versed in slightly chaotic signals and signal processing. Brad above actually gave you one possibility by using an Unscented Kalman Filter (UKF … because it’s not linear) and I see a few in climate science have played with trying to adjust Lorenz (1996).
Until you people all realize you can’t solve these sorts of problems with statistics you are going to continue bang your head against the wall.
Willis you know I respect you and your tenacity but do you see the problem and that there is no way into this by statistics solely you need to deal with the slight chaos something statistics can’t do by itself. Willis spend some time talking to someone at a local university about signal processing chaotic signals. The other choice is someone involved in deep space radio comms such as flicker noise spectroscopy and the like.
LdB says:
August 18, 2013 at 8:47 am
“…Climate scientists actually need to do a little discussion with engineers who are well versed in slightly chaotic signals and signal processing….”
No meaning in applying any model onto a system if the errors in your signal are too high compared to your wanted output. Aiming the discussion on the average of arctic and antarctic, or northern and southern, or… instead of on arctic plus antarctic, northern plus southern, …, aims too much simplifications based on too much source and result measurement errors.
— Mats —
LdB, to further your discussion, and maybe not in the direction you intended (correct me if I am wrong), we filter out the very thing we should be studying. The noise. It is screaming the strength of natural variation as loud as it can but half the world is not listening. By studying it, we will be able to see that it is not random white noise. The signals of oceanic and atmospheric teleconnections and oscillations are abundantly clear.
There are so many problems in this field, you really need a time machine to get a meaningful answer I’m forced to say.
However, I would really like to second AndyL’s comment at August 18, 2013 at 6:25 am. A detailed example would be really helpful here.
This entire historical “mean global temperature” reconstruction comes across as an exercise in GIGO data processing:
1/ Only a small part of the planet was instrumented back in, say, 1800. This mean that massive interpolation, which is nothing more than guessing, must be done across large areas of the planet.
No amount of data processing can recreate physical measurements were non were originally made.
Such guesstimates are a type of systematic error/bias.
2/ How well known are the systematic errors/biases in the measuring devices used back in, say, 1800?
The accuracy, precision, and calibration drift over time? There is no a priori reason to assume that such systematic errors of different instruments are normally distributed and average out. Rather such systematic errors, assuming that they are independent, are added together in quadrature in standard error analysis – estimation.
http://www.ocean-sci.net/9/683/2013/os-9-683-2013.html
http://meetingorganizer.copernicus.org/EGU2009/EGU2009-5747-3.pdf
The weather station siting and other systematic bias issues investigated and documented by A. Watts et al.
3/ Given the planet is an open dynamical system far from thermodynamic equilibrium, does the measurement of an intensive quantity such as temperature to calculate a spatial average have any physical meaning? Should an extensive quantity such as heat content not be used instead?
Given the above, the claim that the “mean global temperature” back in 1800 is known to within +/- 0.1C [Hansen et al] not only strains credulity, but tosses it in the blender.
Steven Mosher says:
August 17, 2013 at 4:21 pm
“But in another sense the error that you are talking about matters less, because it is the same number over all years. Put another way, often when you are talking about anomalies, you don’t care so much about the climatology (or the base years). You want to know whether a particular year was hotter than other years, or to calculate a trend. The error you are discussing won’t affect trends.”
Precisely.
Also if people want to go back to absolute temperature with our series they can. We solve the field for temperature. In our mind we would never take anomalies except to compare with other series. Or we’d take the anomaly over the whole period not just thirty years and then adjust it accordingly to line up with other series. So, anomalies are really only important when you want to compare to other people ( like roy spencer ) or ( jim hansen) who only publish anomalies. Anomalies are also useful if you are calculating trends in monthly data.
<<<<<<<<<<<<<<<<<<<<<<<<<
From my non statistical mind I feel like a bait and switch has happened here. Many low resolution readings will lower the error bars if they are all trying to read the same thing, if today actually had a temperature. But, it does not. The day has many temperatures and we are trying to create a single one. Since each thermometer is measuring something different (literally) the trick of lowering the error bars does not seem to apply.
Steven Mosher says:
August 17, 2013 at 7:11 pm
I understand all of that, Steven, and thanks for the explanation. The part I don’t understand is, how can the error in your published climatology be GREATER than the error in your published anomaly. That’s the part that you and Nick haven’t explained (or if so, I sure missed it).
Best regards,
w.
What do the retroactive changes imply about the errors? If the temperature at Teigarhorn in Jan 1900 can change from 0.7°C to -0.5°C does that imply the errors on any stations monthly average is at least ±1.2°C?
http://endisnighnot.blogspot.co.uk/2013/08/the-past-is-getting-colder.html
Seems to me that global temperature pretty much follows number of windmills: http://upload.wikimedia.org/wikipedia/commons/1/1a/Wind_generation-with_semilog_plot.png
/sarc
climatebeagle:
At August 18, 2013 at 10:29 am you ask
No, it means the errors are potentially infinite because the value of global temperature can be computed to be whatever you want it to be.
Please see my above post at August 18, 2013 at 7:32 am.
This link jumps to it
http://wattsupwiththat.com/2013/08/17/monthly-averages-anomalies-and-uncertainties/#comment-1393835
Richard
Nick Stokes says:
August 18, 2013 at 4:23 am
As I pointed out, the size of the error ranges from a third to three-quarters of the month-to-month variation.
The point of your whole post seems to be that the error is small. First, we don’t know that overall, and it’s certainly not small in the Anchorage data.
Second, the errors should all be carried forwards whether they are big or small. You can’t just ignore the error in the climatology (or any other error) and leave it out.
And finally, the error in the climatology, whether it is big or small, establishes a lower limit for the size of the error in any given monthly anomaly. This is because
Anomaly = Observations – Climatology
The errors in observations and climatology add in quadrature in that calculation. So the final anomaly error cannot be smaller than the climatology error.
Best regards,
w.
Nick Stokes says:
August 18, 2013 at 4:23 am
I’ve shown in the Anchorage example that the error is up to 75% of the monthly variation in temperature. Your claim that the error is small relative to the variation is simply not true.
w.
I found this interesting abstract, paywalled of course …
Makes sense to me … it would be interesting to see the equivalent for the Berkeley Earth temperature field. Above my pay grade, unfortunately …
w.
Luboš, “The systematic error of the average of similar pieces of data doesn’t decrease with N but it is the same as it is for the individual entries – it can never be greater.”
That’s true only when the systematic error is constant. Systematic error is not necessarily constant when the problem is uncontrolled experimental variables. That is the case in air temperature measurements, for which the effects of insolation, wind speed, and albedo, among others, are all uncontrolled and variable. In these cases, systematic error can increase or decrease with N, but the direction is always unknown. So, one can only get an average of systematic error by carrying out a calibration experiment with a known standard, and recording the observed error. That error is then applied as an uncertainty in the measurements of the experimental unknown.
Gail Combs certainly knows this approach to experimental measurement error, and her discussion on this thread is exactly pertinent. Average systematic error is always an empirical quantity. The true magnitude of systematic error in a given experiment is unknown. Data containing systematic error can look and behave just like good-quality data. The contaminated data may correlate between laboratories, and may pass all manner of statistical tests. The only way to get a handle on it is to track down the sources and eliminate them (if possible). The typical way to deal with systematic error is by calibrating the instruments and carrying a known standard through the experiment, to make sure that any such error isn’t large enough to wreck the experiment.
Willis Eschenbach:
If I’m following your question, you can have larger errors on an absolute scale than on a relative scale. If there is an overall offset error that is constant across measurements, using one (or a subset) of measurements as a new base line that you subtract all measurements from, reduces this offset error.
There’s a bit of discussion in this long thread on Lucia’s blog that may be helpful.
Jeff, it’s there in your Reply Part 1, with your admission, finally, that, “I get that Pat hasn’t included weather noise in his final calculations for Table 1,2 and the figures…” That admission took the heart out of your critical position.
@Lubos and the all other wise contributors
“yes, 1/sqrt(N) is the behavior of the statistical error, e.g. a part of the error from individual errors that are independent for each measurement.
“Then there can be a systematic error that is “shared” by the quantities that are being averaged. The systematic error of the average of similar pieces of data doesn’t decrease with N but it is the same as it is for the individual entries – it can never be greater.”
+++++++++
Thanks. I almost entirely agree. There could be systematic errors in the way the averaging is done so I hold open that caveat. I think the problem with temperatures is the instruments have different precisions and accuracies but there are ways to account for that. That does not contradict your point at all, however.
My second worries on this relate to the kinda loose manner in which ‘accuracy’ and ‘precision’ are being used in peer reviewed works. When one gets a lot more data points within a territory, extending the shooting range metaphor I sued above, it means getting a lot more shooters to use a different rifle each to take shots at their respective targets. None of that increase makes the shooters more accurate or precise and again, the precision of each rifle is not improved by having more of them.
My refinement of the perspective is this: Imaging you did not see any of the shooting take place. You do not know where the bullseye was for any shooter. All you have is a blank target with a large number of holes shot into it. The task at hand is to work out where the bullseye really was – the actual average temperature. The calculations demonstrated do not decrease the size of the error bars which is rooted in the precision of the rifle and the shooter and the accuracy of the shooters.
What is increased is the precision of the location of that point where we can confidently place the center of the error bar. We can have more confidence that it lies at exactly a certain point, to several significant digits, but in no way does this certain knowledge reduce the vertical height of the error bars. For temperatures it is still ±0.5°C for most of the land record.
This point about confidence as opposed to knowledge is not being stated clearly. Claims for increased precision in our knowledge of the position of the center of the error bar is being claimed to be an increase in the precision of the calculated answer. They are two different things entirely. One is like the GPS coordinates of a car, the other is like the size of the car.
Here is another example in the form of a question:
If I measure the same mass piece with 1000 different calibrated scales each having a resolution of 20 g, how precisely and now accurately can I state that single piece’s true mass?
If I measure 1000 different mass pieces, one mass piece weighed once on each scale, with what precision and accuracy can I claim to know the average mass of all the pieces?
These are conceptually different problems. The latter question is the one that applies to temperature measurements at different stations. The final answer cannot be better than the best instrument but might be worse than the worst instrument because of various systematic errors in the processing of data, or experimental errors in acquiring, transcribing or archiving it.
Regards
Crispin
Pamela Gray says:
August 18, 2013 at 9:16 am
LdB, to further your discussion, and maybe not in the direction you intended (correct me if I am wrong), we filter out the very thing we should be studying. The noise. It is screaming the strength of natural variation as loud as it can but half the world is not listening.
You need to think about how you are filtering it out …. look at Nicks example above he is using a 30 year average. If there is a 100 year, 200 year, 400, 1000 year chaotic signals it goes straight thru any filter you construct. There are very likely to be fluctuations of that size because of the earths size and thermodynamic inertia.
The problem is you can’t keep stretching the time because you are killing the climate signal you are actually looking for. See the problem the low frequency chaotic noise is going to slide easily thru any filter you construct for climate change because it will have components of the same sort of timespans. They may not be large but they are probably going to be the largest error post filtering and there is no easy way to separate them.
That’s why it becomes a digital filtering exercise, if you look at J.D. Annan et al work with oceans that is sort of heading that direction but looks like they have troubles with some of the models.
@Wilis: Super resolution microscopy uses techniques where total error (measuring the position of a light emitter) is smaller than the measurement error (the diffraction limit of light).
See: http://en.wikipedia.org/wiki/Super_resolution_microscopy
The PALM & STORM methods use the aggregate measurement and infer true position from assumptions about the point spread function and the probability of having more than one emitter in the field.
I’m not sure how that affects your argument, but, these techniques have ben validated in subcellular structures in many ways and they are examples of what you were asking for.
Crispin in Waterloo says:
August 18, 2013 at 11:24 am
“…What is increased is the precision of the location of that point where we can confidently place the center of the error bar. We can have more confidence that it lies at exactly a certain point, to several significant digits, but in no way does this certain knowledge reduce the vertical height of the error bars. For temperatures it is still ±0.5°C for most of the land record…”
Imaging the shooters do not have that different rifles. Instead, all rifles Had come from one factory “CRUT” (Center of Rifles Used for Testing). They are of some different models (CRUT1, Crut2, …), but with similar construction (like methods for averaging and compensating for errors). All rifles tend to go a little high to the right. You do not have a better clue of where the bullseye where, as long as there is a systematic error.
— Mats —
Pat Frank says:
August 18, 2013 at 11:05 am
Great work, Pat Frank. Too bad that those whom you intend to help just cannot imagine that empirical matters are relevant to their statistical claims. Please keep up your good work because many benefit from it, as does empirical science generally.
To those who have no respect for empirical matters in the surface temperature record, especially BEST, I have one question. When are you going to accept Anthony’s five-fold classification of measurement stations, work your statistical magic on his numbers, and then address his claims about bias in the measurement records?
On systematic errors.
The largely unstated assumption is that Tmin represents some climatological significant value, specifically how cool a location gets at night, even though Tmin typically occurs after dawn.
The value of Tmin is dependent on the time it occurs. There are several factors that affect the timing of Tmin, including early morning insolation, which is affected by aerosols and low level clouds, both of which are known to have declined over recent decades.
Therefore changes in Tmin have 2 components. One, the change in night time temperatures, is climatologically significant, arguably our signal. The other, the change in early morning insolation, is not climatologically significant.
I have previously written, the change in early morning insolation could be as much as half the change in Tmin over the last 60 years, and a major reason for the satellite – surface temperature divergence, as insolation changes only affect surface temperatures.
HaroldW says:
August 18, 2013 at 4:57 am
You are correct. However, the errors in question are usually (and often questionably) assumed to be gaussian, or at least symmetrical. Those errors do affect the anomaly.
w.
There likely is not any kind of hard-sided cycle in climate. The same “cycle”, if you want to use that term, may have a periodicity with highly stretchable onset and offset period, not to mention a highly malleable duration. In addition, the cycles likely do not cancel out between the warm and cool phase. Any kind of filter at all, removes the most important part of the data series in my opinion. However, the raw data does need some kind of statistical work. That is why I have always wanted to see a three month moving average along with seasonal combined month averages and warm/cool changes, much like what is done with oceanic and atmospheric data sets.
Of even more importance is this fact, well accepted by climate researchers: the anthropogenic portion of the absolute temperature data or anomaly cannot be teased out of a global or regional average. It can be teased out of local data if a control match gold standard is available. And now we are back to calibration, of which there is none.