Guest Post by Willis Eschenbach
I have long suspected a theoretical error in the way that some climate scientists estimate the uncertainty in anomaly data. I think that I’ve found clear evidence of the error in the Berkeley Earth Surface Temperature data. I say “I think”, because as always, there certainly may be something I’ve overlooked.
Figure 1 shows their graph of the Berkeley Earth data in question. The underlying data, including error estimates, can be downloaded from here.
Figure 1. Monthly temperature anomaly data graph from Berkeley Earth. It shows their results (black) and other datasets. ORIGINAL CAPTION: Land temperature with 1- and 10-year running averages. The shaded regions are the one- and two-standard deviation uncertainties calculated including both statistical and spatial sampling errors. Prior land results from the other groups are also plotted. The NASA GISS record had a land mask applied; the HadCRU curve is the simple land average, not the hemispheric-weighted one. SOURCE
So let me see if I can explain the error I suspected. I think that the error involved in taking the anomalies is not included in their reported total errors. Here’s how the process of calculating an anomaly works.
First, you take the actual readings, month by month. Then you take the average for each month. Here’s an example, using the temperatures in Anchorage, Alaska from 1950 to 1980.
Figure 2. Anchorage temperatures, along with monthly averages.
To calculate the anomalies, from each monthly data point you subtract that month’s average. These monthly averages, called the “climatology”, are shown in the top row of Figure 2. After the month’s averages are subtracted from the actual data, whatever is left over is the “anomaly”, the difference between the actual data and the monthly average. For example, in January 1951 (top left in Figure 2) the Anchorage temperature is minus 14.9 degrees. The average for the month of January is minus 10.2 degrees. Thus the anomaly for January 1951 is -4.7 degrees—that month is 4.7 degrees colder than the average January.
What I have suspected for a while is that the error in the climatology itself is erroneously not taken into account when calculating the total error for a given month’s anomaly. Each of the numbers in the top row of Figure 2, the monthly averages that make up the climatology, has an associated error. That error has to be carried forwards when you subtract the monthly averages from the observational data. The final result, the anomaly of minus 4.5 degrees, contains two distinct sources of error.
One is error associated with that individual January 1951 average, -14.7°C. For example, the person taking the measurements may have consistently misread the thermometer, or the electronics might have drifted during that month.
The other source of error is the error in the monthly averages (the “climatology”) which are being subtracted from each value. Assuming the errors are independent, which of course may not be the case but is usually assumed, these two errors add “in quadrature”. This means that the final error is the square root of the sum of the squares of the errors.
One important corollary of this is that the final error estimate for a given month’s anomaly cannot be smaller than the error in the climatology for that month.
Now let me show you the Berkeley Earth results. To their credit, they have been very transparent and reported various details. Among the details in the data cited above are their estimate of the total, all-inclusive error for each month. And fortunately, their reported results also include the following information for each month:
Figure 3. Berkeley Earth estimated monthly land temperatures, along with their associated errors.
Since they are subtracting those values from each of the monthly temperatures to get the anomalies, the total Berkeley Earth monthly errors can never be smaller than those error values.
Here’s the problem. Figure 4 compares those monthly error values shown in Figure 3 to the actual reported total monthly errors for the 2012 monthly anomaly data from the dataset cited above:
Figure 4. Error associated with the monthly average (light and dark blue) compared to the 2012 reported total error. All data from the Berkeley Earth dataset linked above.
The light blue months are months where the reported error associated with the monthly average is larger than the reported 2012 monthly error … I don’t see how that’s possible.
Where I first suspected the error (but have never been able to show it) is in the ocean data. The reported accuracy is far too great given the number of available observations, as I showed here. I suspect that the reason is that they have not carried forwards the error in the climatology, although that’s just a guess to try to explain the unbelievable reported errors in the ocean data.
Statistics gurus, what am I missing here? Has the Berkeley Earth analysis method somehow gotten around this roadblock? Am I misunderstanding their numbers? I’m self-taught in all this stuff and I’ve been wrong before, am I off the rails here? Always more to learn.
My best to all,
w.
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
Crispin in Waterloo says:
August 17, 2013 at 9:47 pm
Hmmm… Where to enter the mix.
==============
You just opened a Pandora’s box, of bench shooters 🙂
Lubos, is this true only in cases where the measurements are of the same thing?
“So while the error of the sum grows like sqrt(N) if you correctly add (let us assume) comparable errors in quadrature, the (statistical) error of the average goes down like 1/sqrt(N).”
I think there is confusion about the nature of the raw data. I think the above applies when each recording station has made measurements at all stations and the data sets are averaged. I don’t think it applies to a set of individual data sets, one from each locality. No calculated result made from imprecise data, where each has measured a different thing, can have a greater precision than the raw data.
If I may try to break it down a little different, there are three high level sources of error you are discussing. 1) temperature sensor error, which I think you can include siting errors like UHI, 2) human reading/writing documenting error, and then 3) climatalogical error, where a specific location may have just had a particularly warm or cold month or year. I don’t see #3 as real error; that is just actual variation in the data. If you magically fixed #1 and #2, #3 would be perfectly measured and reported data, despite it not being average. That’s not error, that’s good data. Why would perfect data that is not average cause any uncertainty?
Sounds like fair game to me to average out #3. However #1 and #2 you can’t average out. That is error, and that would be cheating. The difficulty of course is establishing how much of that result is #1 and #2, and how much is #3.
Luboš, descent by 1/sqrt(N) is true only when the error is random.
John Norris, shield irradiance, ground albedo (including winter snow), and wind speed all impact the measurement accuracy of otherwise well-functioning air temperature sensors. Calibration experiments under ideal site conditions show this. Even a PRT sensor inside a CRS (Stevenson) screen show ~(+/-)0.5 C average measurement error about an average bias, and that inaccuracy is not randomly distributed.
Willis and all, shurely the main problem is not the error of the average. This must be small due to the large sampling. The main problem is that the statistics only catch random errors and we have no reason whatsoever to think that they are random. Many things have changed that is not random but systematic. Sea routes, built and paved environment, the increase and then decrease of land thermometers, cultivation…and on and on. These are not random errors! McKittrick showed a correlation to industrialization, pielke showed a correlation to land use. Correlations that shouldn’t be there if errors were random!
Doug Proctor says: August 17, 2013 at 5:55 pm
“……in order to use a quadrature to reduced error, I believe you have to be taking multiple readings of the same item using the same equipment in the same way. The Argo floats move, and there are 3500 of them. Each day they take readings at the same depth but of different water and different temperatures (even if off by a bit), and each one is mechanically different – same as all the land stations…..”
Nail hit right on head right here. These are NOT repeated measures of the same thing.
Mosher, I’m still waiting for your list of 39,000 stations for 1880 – or any particular year of your choosing for that matter.
Pat Frank says: @ur momisugly August 17, 2013 at 7:37 pm
Measured temperature is what’s being recorded. “Measured” is the critical adverb here.
In your formalism, Steve, actual measured temperature = T_m_i = C + W_i + e_i, where “i” is the given measurement, and e_i is the error in that measurement. Most of that error is the systematic measurement error of the temperature sensor itself. CRU completely ignores that error, and so does everyone else who has published a global average.
My own suspicion is that people ignore systematic measurement error because it’s large and it cannot be recovered from the surface air record for most of 1880-2013. Centennial systematic error could be estimated by rebuilding early temperature sensors and calibrating them against a high-accuracy standard….
>>>>>>>>>>>>>>>>>>>>>>>>>
Those are my thoughts exactly.
Also the statistics used are for repeated samples of the same thing where as the actual sample size is ONE. Temperature is one measurement at one location on earth during one point in time. It is not repeated measurements of the same location simultaneously with matched instruments. Heck it is often not even the same actual location or in the case of the ‘Official global temperature’ the same number of thermometers.
The ‘Station drop out’ problem
Thermometer Zombie Walk
From what I can see the Julian date in successive years is being treated as if it were repeat measurements of the same place with the same equipment at the same time of day (they don’t get that repeated either.) The assumption being that July 4th in podunk midworst should have the exact same temperature in 2013 as it did in 1931. However from the work that Anthony has done we know it does not. Aside from the random variables; clouds rain/fog/snow, cold fronts/warm fronts, droughts or whatever there are the systematic changes.
The cow pasture became a plowed field and then becomes a village which grows into a city. Someone like Asa Sheldon comes along with his oxen and men removes nearby Pemberton Hill and dumps it into the Back Bay salt marsh changing the microclimate. The thermometer is broken and replaced. The observer retires and is replaced. The stevenson screen is freshly white washed and then weathers, It is replaced by a latex or oil paint painted screen or a vinyl screen. Trees grow up near the weather station blocking the wind and are then cut down. The gravel parking lot is paved and air conditioning is added to the nearby building…
ALL of these are systematic and do not produce random errors. I know from watching my local rural temperature (now a local airport) that it is ‘Adjusted’ up by as much as 3F the next day after ‘homogenization’ to give the ‘Official Reading’ so there is that sort of ‘error’ too.
When I think of these sources of error and then look at the ‘Precision’ of the reported numbers fed to us as ‘Accurate’ estimates of global temperature all I can do is laugh. You might if you are real lucky get an error of +/-1°C.
I am not a statistician though I have had some training in statistics. I have spent decades trying to reduce error in laboratory measurements of production batches or continuous processes in a QC lab setting.
As surely as you need to draw another breath to survive, Statisticians will always make the assumptions upon which their enterprise is theoretically founded.
The data are useful for exploratory purposes, but the whole enterprise of climate stat inference rests upon contextually fundamentally indefensible assumptions, notably random iid. The confidence intervals cannot sensibly be interpreted as meaning what pure theory would suggest they should mean.
Our understanding of climate is at best in the exploratory stage. It is only by understanding natural climate at a profoundly deeper level that bad assumptions can be weeded out of fundamentally corrupted attempts at climate stat inference.
Here is another quicky analogy. If I am looking at widgets from a molding machine, and measure 200 samples on the same day from the same cavity I will have a standard deviation of A. If the machine has 5 cavities and I take a random sample for that machine the standard deviation B will be greater than A. If there are 5 machines in the factory (all using the same raw material) a random sample from the factory will have an standard deviation of C that is greater than B. If you then include all the machines in the company in 10 factories scattered across the country you get an standard deviation D which in general is much greater than C which is why customers will often designate product from one factory only as ‘Approved’
Remember this is with everyone in the company doing their darnest to try and stay within tolerance. In some cases all the product from one cavity will be rejected as out of tolerance.
As was said about Scott ‘s (August 17, 2013 at 6:18 pm) example, in industry we get feedback, with the temperature data you get potluck.
Nick Stokes says:
August 17, 2013 at 2:52 pm
Thanks for those links, Nick, they are interesting reading.
So we agree about that.
It’s not clear what you mean by “rather small”. I also don’t understand the part about 1/30 of the total. Take another look at the data in Figure 2. Each individual anomaly at any time is calculated by taking the observations (with an associated error) and subtracting from them the climatology (again with an associated error). The month-by-month standard error of the mean for the 30-year reference period ranges from 0.15 to 0.72°C, with an average of 0.41°C, without adjusting for autocorrelation. Is that “rather small”? Seems large to me.
These two errors add in quadrature. So I don’t understand the “1/30” part. Whatever the standard error of the mean of the anomaly base period is, that has to get carried forward to the error in each individual anomaly.
If (as you say) I want to know whether January 2012 was hotter than February 2011, I need to know the true error of the anomalies. Otherwise, if the two months come up half a degree different, is that significant or not? I can’t say without knowing the total error of the anomalies. So the answer is, for some questions it’s not important, and for some questions it is.
You say that the error “matters less” because it is the same number over all years. However, remember that the number is not used alone. It is added in quadrature with all of the other errors. So its effect varies over time, as the errors in the monthly data wax and wane. Its effect, in other words, is not constant over all years, far from it.
It also establishes a hard lower limit on the total error for a given monthly anomaly. The monthly anomaly error can’t be less than the error in the corresponding month in the climatology.
It also comes into play when comparing say the changes in the spring, summer, fall, and winter temperatures. We need to know the errors to understand the significance of what we find.
In addition, it is a substantial error. As I showed above, the error in Anchorage is as high as 3/4 of a degree … and once we account for autocorrelation, it goes up to 1.4°C for the one-sigma error for January. This means that the 95% confidence interval for January is ± 2.7°C … and the average size of the January monthly anomaly is 3.8°C. That 95% CI of ± 2.7° covers more than a five-degree wide temperature span, not a small error in any sense. The 95%CI for each month ranges from a third to three-quarters of the average size (standard deviation) of the corresponding monthly anomaly.
But regardless of its size, the total error of the anomalies has to include the month-by-month error in the climatology. Not 1/30 of that climatology error, but the actual standard error of the mean for e.g. March in the anomaly base period.
You are right that for any given calculation that may or may not make a difference. My point is, it should be calculated correctly, by carrying the error through, and then we can see if it is significant for a particular question.
At least that’s how I see it … through a glass eye, darkly …
Thanks,
w.
I’m with Pat on this one.
The error reduction of 1/sqrt(N) ONLY applies when you are measuring similar objects that are meant to be UNCHANGING.
The systematic error in a changing system INCREASES in the long term by about sqrt(2).
Now as the systematic error (or absolute error) of a thermometer graduated in 1C is 0.5C, then the systematic error of the average of a changing quantity tends towards 0.7C
Willis, I’m with you. But I think the problem is even more basic. This is a link to a lengthy comment from Aug. 12, 2012 that identifies three sources of error in the calculation of the anomalies:
We do not measure Tave. Tave is a calculation of two components we do measure: Tmin and Tmax. If Tmin and Tmax are separated by just 10 deg C, then the mean std error of a month’s Tave, 30 readings of Tmin and Tmax, must be at least (10/2)/sqrt(30) = or about 0.9 deg C. This is the error of the monthly anomaly against a known base.
But the base, too, has uncertainty, not from 30 years, of 30 Tave for each month, but 30*30 Tmins and Tmaxs. The 30 year average Base temperature each month should have a mean std error of at least 10/2/sqrt(30*30) or 5/30 or 0.16 deg C.
To get the anomally, you subtract Tave – Tbase, but Tave has an uncertainty of 0.9 and Tbase has an uncertainty of 0.16. Uncertainties add via root sum of squares, so the uncertainty of the anomally is 0.914. So the uncertainty of the Tbase can be neglected because the uncertainty in each month’s Tave dominates the result.
BEST compounds this error by slicing long temperature records into shorter ones, ignoring the absolute temperature readings and working only with short slopes. I have repeatedly objected to this from a Fourier Analysis, Information Theory approach, with its loss of low frequency content. The scalpel may be removing the removing the essential recalibration and keeping the instrument drift as real data.
But let’s just look at the implication of BEST slicing records from an uncertainty analysis point of view. The uncertainty of the slope of a linear regression is ( I think) proportional to 1/(square of time length of the series). Cut a 400 month time series into two 200 month series, and the uncertainty in the 200 month slope will be 4 times the uncertainty of the 400 month series. Sure, you now have two records and optimistically you might reduce uncertainty by sqrt(2). Slicing the record was a loosing game for the uncertainty in slope is at best increased by 2*sqrt(2) or almost a factor of 3.
Correction:
If Tmin and Tmax are separated by just 10 deg C, then the mean std error of a month’s Tave, 30 readings of Tmin and Tmax (each), must be at least (10/2)/sqrt(60) = or about 0.65 deg C.
Steven Mosher says:
August 17, 2013 at 4:21 pm
The issue for me is not anomalies versus absolute temperatures. Both have their uses. The issue is the correct calculation of the anomaly errors.
First, let me clarify that by “remove seasonality” you mean subtract out the climatology. If so, that’s exactly what I’m doing, and what I’m talking about.
The part I don’t understand is why you think it doesn’t affect the error of the trend. Suppose we have 48 data points, with values going linearly from zero to one in 48 equal steps. Suppose also that we know that there are no errors of any kind in the data points. The error in that trend is zero.
But if we add a repeating series of twelve different errors to each of those forty-eight values, suddenly we get a calculated trend which is not the true trend, and it has an associated non-zero error.
When we “remove seasonality” by using a base period, as is quite common in climate science, we introduce a repeating series of twelve different errors. And contrary to your and Nick’s claims, that act of removing seasonality does indeed increase the standard error of the trend.
Having said that, I do understand what you are saying about you just want to establish the anomaly field, and for most purposes thats fine. And I agree with that. What I’m saying is that the error in the anomaly field perforce must include the error in the climatology.
Someone above pointed me to the explanation of the Berkeley Earth method. My understanding, and please correct me if I’m wrong, is that Berkeley Earth establishes the climatology by using a formula that gives the average value for a given month, latitude, and altitude, plus a fitted factor “bi“.
My interest is in seeing if the three errors (latitude, altitude, and fitted factor errors) are carried forwards to the error estimate for the final product, and in understanding how that is done.
Best regards,
w.
PS—Just wondering if you understand this claim from the methods paper:
Say what? I find that very difficult to believe. If I put up two weather stations in my front yard, I certainly expect much better than an R^2 of 0.75 between the two temperature records. In addition, the stations I’ve seen where two records overlap have had higher than 87% correlation.
Your comments welcome.
Take it for what it is worth Willis, but I think you are right
You say that the final error estimate for a given month’s anomaly cannot be smaller than the error in the climatology for that month.
This error seems to be the computed deviation from normal distribution from the monthly datasets.
There are of course other contributing factors like the daily reading error and the monthly averaging of the daily data, but they will only contribute to make the final error larger, not smaller, so I think you are correct.
I used statistics actively as a research scientist back in the 1990s so I know what I am talking about.
So if you have calculated the numbers right, I have not checked that, I think you are right.
There is something that has been bothering me about ‘error’ in statistics. in particular when standard deviations are used on ‘average’ calculations. I dont mean the average of a dataset, I mean an average of the averages of a number of datasets.
So you have a group of averages that are then averaged out again, for example the Artic Sea Ice Extent – On one of the charts you see the +/- 2 Standard Deviations. But what is the SD calculated from. Is it the total individual data points or is it the average of the data points ? Or daily average temperatures (h+l)/2
If I give an analogy of why this is not right it may help understand my question.
If each year the height of each schoolchild on their 3rd birthday is taken at each school in a county and the average is calculated (which would be around 37 inches) .. This would include those that were say 33 inches to 42 inches which would (probably) be in the normal deviation … However, when the averages are sent to a central state center and the standard deviation is calculated on the averages, then the distribution will be much MUCH smaller and those at the extremes of the distribution would be considered abnormally short or tall. This is probably not how height statistics are calculated, but for Artic Sea Ice and temperature I can not see how they would discern between a (daily high + daily low)/2 is not an average. Also, do they phisically go out and measure every single ice flow in the Arctic ?
I think what would be better in temperatures would be to take the daily highs and lows separately then do the statistics on those, but to collate them all together come up with an average and SD would not be truly representative of the statistics.
If I am wrong in these assumption, then i have leant something, but if i am right I hope someone else learns something.
In 2011 W. Briggs did a highly critical review of the “Berkeley Earth temperature averaging process.” at: http://wmbriggs.com/blog/?p=4530
some of briggs points on using smoothed data as input: http://wmbriggs.com/blog/?p=735
and http://wmbriggs.com/blog/?p=195
SIGHhhh….
This seems to be a discussion where the same word is used to mean two entirely different things (So what else is new in the Post Normal World.)
To those of us in industry the word ‘Error’ has a very specific meaning. It is the deviation from the ‘correct’ or ‘true’ value.
Mosher on the other hand is talking about a different definition of error. Kriging is a method of Interpolation when there is not enough data. Within the concept of Kriging there is the nugget effect.
That is the ‘Error” Mosher is talking about.
Kriging comes out of gold mining and other geological work where the geologist is trying to make the best guess with insufficient data. The ‘Error” in Kriging is the error of that guess, and really has nothing to do with the error of the data which Pat Frank and I and others are talking about.
It also seems to be ‘Controversial” to say the least. According to J W Merks ( Vice President, Quality Control Services, with the SGS Organization, a worldwide network of inspection companies that acts as referee between international trading partners) It is a fraud for producing data when you do not have any. see Geostatisics: From Human Error to Scientific Fraud http://www.geostatscam.com/
Also see the wikipedia talk section on Kriging: http://en.wikipedia.org/wiki/Talk:Kriging#Further_revision_proposal_by_Scheidtm
Hope that helps Willis.
I strongly suggest that people read this link.
http://www.rit.edu/~w-uphysi/uncertainties/Uncertaintiespart2.html
in particular part a)
We are making individual readings of a changing quantity.
The error in the average will equal the average of the error. and if we take say Tmax and Tmin, on a thermometer graduated in degrees C, then the error each time is +/- 0.5C….. No standard deviation involved .
so the error in Tavg (daily) is also 0.5C because we add, then divide by 2
so the error in Tavg (monthly) is also 0.5C
There is NO contraction or decrease of the error when averaging changing quantities.
So you average a month then do a difference between that average and the data to show the anomaly. An anomaly of what? The average is 0ne month not the average temperature of the same month in a climate cycle, roughly 30 years. Meaningless really. You are assuming that your month of choice is truly average but none of them are since every month of a cycle will be slightly different.
This is not a method I would use since it assumes too much and gets the guessing index up.
Dear Crispin and Pat Frank,
yes, 1/sqrt(N) is the behavior of the statistical error, e.g. a part of the error from individual errors that are independent for each measurement.
Then there can be a systematic error that is “shared” by the quantities that are being averaged. The systematic error of the average of similar pieces of data doesn’t decrease with N but it is the same as it is for the individual entries – it can never be greater.
A. Scott says:
…
I agree with Nick …. I don’t know if he (or Willis) are right or wrong – but I do know denigrating comments absent supporting evidence do zero towards understanding the issue or finding answers.
Science should be about collaborative effort. And anytime you have knowledgeable folks willing to engage in discussion you should take advantage of it.
Indeed. We should be encouraging Nick and helping him critique this piece as much as possible. This is exactly how the scientific process works – you make a hypothesis and then get people to do their damnedest to find a flaw. Someone who disbelieves in your hypothesis is FAR better at that than any number of supporters…
On August 17, 2013 at 10:09 PM, Pat Frank wrote:
Luboš, descent by 1/sqrt(N) is true only when the error is random.
It’s my recollection that the condition for “descent by 1/sqrt(N)” is not that the error be random, but that the error be normally distributed. Given what we know about the propensities for readings to be integers or mid-range decimals (i.e., not xx.1 or xx.9), I find it hard to believe that the error is normally distributed. If the error isn’t normally distributed, “descent by 1/sqrt(N)” doesn’t apply, and the actual error will be larger.