Not Whether, but How to Do The Math

Guest Post by Willis Eschenbach

The Berkeley Earth Surface Temperature (BEST) team is making a new global climate temperature record. Hopefully this will give us a better handle on what’s going on with the temperature.

BEST has put out a list of the four goals for their mathematical methods (algorithms). I like three of those goals a lot. One I’m not so fond of. Here are their goals:

1)  Make it possible to exploit relatively short (e.g. a few years) or discontinuous station records. Rather than simply excluding all short records, we prefer to design a system that allow short records to be used with a low – but non‐zero – weighting whenever it is practical to do so.

2)  Avoid gridding. All three major research groups currently rely on spatial gridding in their averaging algorithms. As a result, the effective averages may dependant on the choice of grid pattern and may be sensitive to effects such as the change in grid cell area with latitude. Our algorithms seek to eliminate explicit gridding entirely.

3)  Place empirical homogenization on an equal footing with other averaging. We distinguish empirical homogenization from evidence‐based homogenization. Evidence‐based adjustments to records occur when secondary data and/or metadata is used to identify problems with a record and propose adjustments. By contrast, empirical homogenization is the process of comparing a record to its neighbors to detect undocumented discontinuities and other changes. This empirical process performs a kind of averaging as local outliers are replaced with the basic behavior of the local group. Rather than regarding empirical homogenization as a separate preprocessing step, we plan to incorporate empirical homogenization as a process that occurs simultaneously with the other averaging steps.

4)  Provide uncertainty estimates for the full time series through all steps in the process.

Using short series, avoiding gridding, and uncertainty estimates are all great goals. But the whole question of “empirical homogenization” is fraught with hidden problems and traps for the unwary.

The first of these is that nature is essentially not homogeneous. It is pied and dappled, patched and plotted. It generally doesn’t move smoothly from one state to another, it moves abruptly. It tends to favor Zipf distributions, which are about as non-normal (i.e. non-Gaussian) as a distribution can get.

So I object to the way that the problem is conceptualized. The problem is not that the data requires “homogenization”, that’s a procedure for milk. The problem is that there are undocumented discontinuities or incorrect data entries. But homogenizing the data is not the answer to that.

This is particularly true since (if I understand what they’re saying) they have already told us how they plan to deal with discontinuities. The plan, which I’ve been pushing for some time now, is to simply break the series apart at the discontinuities and treat it at two separate series. And that’s a good plan. They say:

Data split: Each unique record was broken up into fragments having no gaps longer than 1 year. Each fragment was then treated as a separate record for filtering and merging. Note however that the number of stations is based on the number of unique locations, and not the number of record fragments.

So why would they deal with “empirical discontinuities” by adjusting them, and deal with other discontinuities in a totally different manner?

Next, I object to the plan that they will “incorporate empirical homogenization as a process that occurs simultaneously with the other averaging steps.” This will make it very difficult to back it out of the calculations to see what effect it has had. It will also hugely complicate the question of the estimation of error. For any step-wise process, it is crucial to separate the steps so the effect of each single step can be understood and evaluated.

Finally, let’s consider the nature of the “homogenization” process they propose. They describe it as a process whereby:

… local outliers are replaced with the basic behavior of the local group

There’s a number of problems with that.

First, temperatures generally follow a Zipf distribution (a distribution with a large excess of extreme values). As a result, what would definitely be “extreme outliers” in a Gaussian distribution are just another day in the life in a Zipf distribution. A very unusual and uncommon temperature in a Gaussian distribution may be a fairly common and mundane temperature in a Zipf distribution. If you pull those so-called outliers out of the dataset, or replace them with a local average, and you no longer have temperature data – you have Gaussian data. So you have to be real, real careful before you declare an outlier. I would certainly look at the distributions before and after “homogenization”, to see if the Zipf nature of the distribution has disappeared … and if so, I’d reconsider my algorithm.

Second, while there is a generally high correlation between temperature datasets out to 1200 km or so, that’s all that it is. A correlation. It is not a law. For any given station, there will often be nearby datasets that have very little correlation. In addition, for each of the highly correlated pairs, there will be a number of individual years where the variation in the two datasets is quite large. So despite high correlation, we cannot just assume that any record that disagrees with the “local group” is incorrect, as the BEST folks seem to be proposing.

Third, since nature itself is almost “anti-homogeneous”, full of abrupt changes and frequent odd occurrences and outliers, why would we want to “homogenize” a dataset at all? If we find data we know to be bad, throw it out. Don’t just replace it with some imaginary number that you think is somehow more homogeneous.

Fourth, although the temperature data is highly correlated out for a long distance, the same is not true of the trend. See my post on Alaskan trends regarding this question. Since the trends are not correlated, adjustment based on neighbors may well introduce a spurious trend. If the “basic behavior of the local group” is trending upwards, and the data being homogenized is trending horizontally, both may indeed be correct, and homogenization will destroy that …

Those are some of the problems with “homogenization” that I see. I’d start by naming it something else. It does not describe what we wish to do to the data. Nature is not homogenous, and neither should our dataset be homogeneous.

Then I’d use the local group, solely to locate unusual “outliers” or shifts in variance or average temperature.

But there’s no way I’d replace the putative “outliers” or shifts with the behavior of the “local group”. Why should I? If all you are doing is bringing the data in line with the average of the local group, why not just throw it out entirely and use the local average? What’s the advantage?

Instead, if I found such an actual anomaly or incorrect data point, I’d just throw out the bad data point, and break the original temperature record in two at that point, and consider it as two different records. Why average it with anything at all? That’s introducing extraneous information into a pristine dataset, what’s the point of that?

Lastly, a couple of issues with their quality control procedures. They say:

Local outlier filter: We tested for and flagged values that exceeded a locally determined empirical 99.9% threshold for normal climate variation in each record.

and

Regional filter: For each record, the 21 nearest neighbors having at least 5 years of record were located. These were used to estimate a normal pattern of seasonal climate variation. After adjusting for changes in latitude and altitude, each record was compared to its local normal pattern and 99.9% outliers were flagged.

Again, I’d be real, real cautious about these procedures. Since the value in both cases is “locally determined”, there will certainly not be a whole lot of data for analysis. Determination of the 99.9% exceedance level, based solely on a small dataset of Zipf-distributed data, will have huge error margins. Overall, what they propose seems like a procedure guaranteed to convert a Zipf dataset into a Gaussian dataset, and at that point all bets are off …

In addition, once the “normal pattern of seasonal climate variation” is established, how is one to determine what is a 99.9% outlier? The exact details of how this is done make a big difference. I’m not sure I see a clear and clean way to do it, particularly when the seasonal data has been “adjusted for changes in latitude and altitude”. That implies that they are not using anomalies but absolute values, and that always makes things stickier. But they don’t say how they plan to do it …

In closing, I bring all of this up, not to oppose the BEST crew or make them wrong or pick on errors, but to assist them in making their work bulletproof. I am overjoyed that they are doing what they are doing. I bring this up to make their product better by crowd-sourcing ideas and objections to how they plan to analyze the data.

Accordingly, I will ask the assistance of the moderators in politely removing any posts talking about whether BEST will or won’t come up with anything good, or of their motives, or whether the eventual product will be useful, or the preliminary results, or anything extraneous. Just paste in “Snipped – OT” to mark them, if you’d be so kind.

This thread is about how to do the temperature analysis properly, not whether to do it, or the doer’s motives, or whether it is worth doing. Those are all good questions, but not for this thread. Please take all of that to a general thread regarding BEST. This thread is about the mathematical analysis and transformation of the data, and nothing else.

w.

The climate data they don't want you to find — free, to your inbox.
Join readers who get 5–8 new articles daily — no algorithms, no shadow bans.
0 0 votes
Article Rating
158 Comments
Inline Feedbacks
View all comments
Pamela Gray
March 23, 2011 6:11 am

We have a number of “Oregon’s Freezer” areas here, known for huge ranges between the night time low and the day time high. This pattern is not unusual at all, occurring rather frequently in long term records. These outliers are simply part of the climate we experience. Marginal desert plains high in altitude relative to the surrounding country side exhibit typical desert characteristics but in a much more random “outlier” fashion. Other similar landscapes in counties and states in the US experience similar odd lows and highs within a 24 hr period. To treat such occurrences as outliers seems to me an effort to negate climate patterns entirely natural for the area. Why do that?
I hope these folks are meteorologists or at least have a good in-the-field meteorologist on their staff so they have some understanding of weather pattern variation based on topographical climate sub-zones within the major climate zones long known among weather men and women. If they rely on computer machination to “take care of outliers”, we are back to square one with yet another piece of junk science.

MikeP
March 23, 2011 6:12 am

I would prefer that all outliers be flagged rather than removed. Coding is essentially as easy this way, it just takes a little more data storage. Given the small size of the dataset, this is not a burden IMO.
Flagged outliers (preferably with the reason they are considered outliers) would make it very easy and straightforward for anybody to investigate the nature of the outliers and express opinions (conclusions) about whether the exclusions are optimal or if algorithms should be modified. Processing errors begin to stand out. It would also make it far more straightforward for somebody to modify later processing, such as replacing homogenation with an algorithm for creating separate data sets, such as AW suggests. Early processing can be accepted, modified, or rejected.
When young, I helped develop processing for a satellite data set. I (and others) fought to keep all data even in the final data product. Data considered “bad” were simply marked as bad with a bit coded reason for it. This approach led to quick recognition and correction of a coding error where data marked as bad wasn’t. It also allowed researchers to investigate these points (one person’s outliers are anothers focus of investigation).
The counter arguments given at that time were “That it was a burden for data users to have to check the flags” and “It might lead to some researchers naively including bad data in their analyses leading to bad results” and “We’re the experts, people ought to trust our judgment of what’s bad and what’s not”. While there’s some validity to these arguments, I think the benefits outweigh any extra costs.

alan neil ditchfield
March 23, 2011 6:18 am

Willis Eschenbach says:
2) Avoid gridding. All three major research groups currently rely on spatial gridding in their averaging algorithms. As a result, the effective averages may dependant on the choice of grid pattern and may be sensitive to effects such as the change in grid cell area with latitude. Our algorithms seek to eliminate explicit gridding entirely.
The problem of change in grid area with latitude can be avoided with a different geographical reference system. See: http://www.neubert.net II – Platonic Spheres – Octahedron’s tesselation.
A. N. Ditchfield

Dave
March 23, 2011 6:32 am

Joel Heinrich>
I think you’ve put your finger on an important point: homogenisation is not something reasonable to do for temperatures, but it is reasonable for temperature _measurements_.

Eric Anderson
March 23, 2011 6:38 am

Thanks, Willis, great post.
crosspatch is absolutely correct, we can drive 30 km and get a whole different climate, much less a different temperature. No doubt this is true in other parts of the world as well. The idea that 1200 km represents some kind of reasonable number seems purely arbitrary, as well as way beyond what we in fact know to be the case in the real world.
Homogenization, by definition, will destroy information. By definition, it introduces an a priori, preconceived artifact into the record of what we think the temperature *should* look like. You cannot get away from that and there is absolutely no reason to go down this path, so why do it? Does it make obtaining the average temperature easier? Sure. Because we can sit behind a desk and just do calculations and statistics, rather than getting out in the field and actually looking at the site in question to see if there is a perfectly legitimate reason it is an “outlier”.
I’m hopeful that BEST will be an improvement over existing approaches, but this whole concept of homogenization is one of the first things that has to go.

Steve from rockwood
March 23, 2011 6:44 am

What’s really missing is a sensitivity analysis of the points you bring up, not whether they are valid or not.
For example, grossed data is not a problem I’d used properly so once you have your data set, perform a sensitivity analysis on grid size in over and under sampled areas. What does it do to the final answer? You can determine the maximum grid size allowable and go with that.
Same for outliers. What happens to the output when the outliers are increasingly removed from the input.
Methods that show a high sensitivity to slight changes should be more carefully investigated. I suspect if this were done people would find the earth’s temperature cannot be determined to within 0.1 deg.

Steve from rockwood
March 23, 2011 6:45 am

Sorry, should have been gridded data not grossed data.

March 23, 2011 6:46 am

The Surface Stations project shows that most of the data being collected in the US currently is bad data. This demonstrates either a lack of seriousness regarding the data collection process, or a desire to assure continued collection of a “loosey goosey” data set which can be subjected to massive adjustments. Neither of the above explanations for the current situation leaves me with a “warm, fuzzy feeling”.

Gnomish
March 23, 2011 6:52 am

did ya hear the one about the statistician who was found dead in his kitchen?
his feet were in the freezer and his head was in the oven. on average he felt fine.
he was an average person, with one breast and one testicle – and when rendered in a blender, was quite homogenized.

March 23, 2011 6:56 am

The problem BEST face is that they are trying to do the analysis using only rules – no human judgement. Its the use of human judgement in the current data which causes people to doubt its accuracy.
However, given they are doing this purely algorithmically, it is easy for them to do it in multiple ways. So they could implement an algorithm which does it the way Willis likes, and see how different the results turn out.
What should be guarded against is people running the whole thing multiple times with multiple rules, and then selecting the appropriate rule post-facto, because it appeals to their prejudices.
So I think they should accept submissions from people who have a particular rule in mind, and develop algorithms for these rules before they have any results. Then run their software multiple times, once with each rule.
My bet, for what its worth, is that the statistical effect of a large amount of data will overwhelm any errors in the data, so that whether you use raw data, homogenized data or human adjusted data, won’t make much difference. But we won’t know until its been done.

March 23, 2011 7:06 am

Willis:
Would not the inclusion of short records require working with absolute temperatures rather than anomalies?
While I am very much in favour and optimistic about the BEST effort in terms of creating a series which has more transparency than the existing series, the new series should come with many of the caveats and cautions alluded to in the above comments.

Olen
March 23, 2011 7:07 am

Is homogenization another word for massage.

kim
March 23, 2011 7:07 am

Emperic homogenization assumes a common and unchanging density of intersite relationships. It’s turbulent, it is, through time and space.
============

March 23, 2011 7:11 am

Today is the World Meteorological Day.
Any suitable comments ?
REPLY: Meteorology is not Climatology – Anthony

Robin Edwards
March 23, 2011 7:15 am

I can only add my endorsement to those who advocate that “outliers” should be treated with the greatest caution. The term presupposes a sure knowledge of the underlying true distribution, and so a circular argument ensues. The only real excuse for adjusting “outliers” is that a gross clerical error has been introduced. You can perhaps check this by reference to the original data source, but it is a time-consuming process. When I was an industrial chemist the most common error was a reversal of a pair of adjoining digits. Often easy to spot for leading digits, but elusive thereafter. I’ve often argued that smoothing procedures are to be avoided, but I can understand their attraction for the supposed purpose of clarifying a complex situation, so that people such as politicians and journalists don’t have to use their brains. This does not form a pretext for unbridled smoothing at the scientific level, though.

A G Foster
March 23, 2011 7:16 am

In producing a global temperature average we are fundamentally working with small geographical areas and combining them into larger ones. We cannot escape grids. Whereas most of the globe is ocean and immune to the microclimates of mountains and plateaus, and not easily carpeted with weather stations, larger starting grids and more averaging are necessary. More fundamental questions are, should high elevation surface temps be given equal weight with their thin air, and of course, should SST be weighted equally with land?

Roberto
March 23, 2011 7:18 am

What a great question, and what a wonderful project. Data can appear non-representative for so many reasons. A change in wind direction, a sudden storm, an architectural change, moving cloud shadows, different microclimates. Some of the short distances mentioned above should also mention a difference in elevation of up to a mile. So I would suggest that all raw data be kept. If any of it is dropped or adjusted for purposes of this project, then it should be categorized and marked with the reason and the adjustment, so that it is not permanently gone from later discussions. We should be able to bring different sets back into the fold, to try different treatments of the data.

March 23, 2011 7:21 am

May I ask a naive question: What is it temperature?…..my microwave oven was cold until I turned it on to heat my food up. Does temperature measure reveal its origin?

March 23, 2011 7:24 am

“Accordingly, I will ask the assistance of the moderators in politely removing…..”
A thousand one-liners about Berkeley shot to heck. 🙁
At any rate I’m glad their not gridding, “spatial” considerations are way different than “nearby” considerations.
Willis, perhaps you know, I certainly don’t, are they spatially weighting at all? It seems to me that they must. Perhaps it is the vernacular used, but I don’t think an average of temp data is particularly all that useful. Do a dozen thermometer temp readings from LA county weigh more than the 2 interior Antarctic temp readings? (pretending that those two are reasonably reliable)
Like you, I don’t think the simultaneous steps are justified. When doing any kind of heavy mathematical data analysis, I always break it down into a clear sequential process. It’s much easier to see where I went wrong the first time when doing it this way. And how does one accomplish simultaneous steps to begin with? There is a hierarchy of math processing. In other words, regardless of how one formulates and algorithm, we don’t add and divide at the same time.

Vince Causey
March 23, 2011 7:29 am

One of the things they could do is keep a record of all outliers that have been excluded. That way it would be possible to compare what effect the outliers would have had on the overall result.

eadler
March 23, 2011 7:42 am

Before making my comment, I would like to apologize for accusing Mr Eschenbach of fraud in one of my posts. I should not be so fast to accuse someone of dishonesty because I have a strong disagreement with their position. I should recognize that people can have honest differences of opinion and can make honest mistakes.
I am curious about the statement that Temperature is a Zipf distribution. I have never seen this referred to before. Looking up Zipf distribution, in Wikipedia, I find that it refers to a distribution where the 2nd most commonly occurring value appears 1/2 as often as the first, the third most common appears 1/3 as often as the first etc.. A generalized version is ~1/n^s, where n is the rank and s is a number greater than 1.
http://en.wikipedia.org/wiki/Zipf%27s_law
Does anyone have a reference which documents this dependence of temperature distribution on rank? Since temperature is a continuous variable rather than a discrete variable, how is rank defined?

REPLY: Mr. Eadler is now restored to posting status – Anthony

eadler
March 23, 2011 7:45 am

[“Snipped – OT” – see article body]

Andrew Krause
March 23, 2011 7:50 am

I think there should be a an xml schema definition created for an ideal temp record. This will allow universally agreed defintions, make explicit the assumptions about missing data and make database storage and transfer of the records much easier.

The Man
March 23, 2011 7:52 am

I find all of this very confusing. Presented with the hypothesis that catastrophic warming will destroy life as we know it, we are called upon to spend billions or trillions of dollars to avoid this horrible fate. If this is such a big deal, why do we have a tiny group of researchers doing this on shoestring budget? Do we really believe that manufacturing a derived set of numbers from this data will increase our confidence in what is, after all, a set of empirical measurements which could, in theory have any statistical and numerical properties you care to conceive. And, furthermore, performing this distortion of empirical data automatically.
Spend a lot less than a billion dollars and hire enough graduate students to check every data point from every station and throw out the garbage. The process will take a few years and have to be managed, but at the end you’ll have raw data that you can put some trust in. The warmists are talking about the end of the world and then going out to the garage with a can of WD-40 and a hammer; I’m sorry but even BEST is bush league in this context. If you going to do the job, do it right.