Guest Post by Willis Eschenbach
The Berkeley Earth Surface Temperature (BEST) team is making a new global climate temperature record. Hopefully this will give us a better handle on what’s going on with the temperature.
BEST has put out a list of the four goals for their mathematical methods (algorithms). I like three of those goals a lot. One I’m not so fond of. Here are their goals:
1) Make it possible to exploit relatively short (e.g. a few years) or discontinuous station records. Rather than simply excluding all short records, we prefer to design a system that allow short records to be used with a low – but non‐zero – weighting whenever it is practical to do so.
2) Avoid gridding. All three major research groups currently rely on spatial gridding in their averaging algorithms. As a result, the effective averages may dependant on the choice of grid pattern and may be sensitive to effects such as the change in grid cell area with latitude. Our algorithms seek to eliminate explicit gridding entirely.
3) Place empirical homogenization on an equal footing with other averaging. We distinguish empirical homogenization from evidence‐based homogenization. Evidence‐based adjustments to records occur when secondary data and/or metadata is used to identify problems with a record and propose adjustments. By contrast, empirical homogenization is the process of comparing a record to its neighbors to detect undocumented discontinuities and other changes. This empirical process performs a kind of averaging as local outliers are replaced with the basic behavior of the local group. Rather than regarding empirical homogenization as a separate preprocessing step, we plan to incorporate empirical homogenization as a process that occurs simultaneously with the other averaging steps.
4) Provide uncertainty estimates for the full time series through all steps in the process.
Using short series, avoiding gridding, and uncertainty estimates are all great goals. But the whole question of “empirical homogenization” is fraught with hidden problems and traps for the unwary.
The first of these is that nature is essentially not homogeneous. It is pied and dappled, patched and plotted. It generally doesn’t move smoothly from one state to another, it moves abruptly. It tends to favor Zipf distributions, which are about as non-normal (i.e. non-Gaussian) as a distribution can get.
So I object to the way that the problem is conceptualized. The problem is not that the data requires “homogenization”, that’s a procedure for milk. The problem is that there are undocumented discontinuities or incorrect data entries. But homogenizing the data is not the answer to that.
This is particularly true since (if I understand what they’re saying) they have already told us how they plan to deal with discontinuities. The plan, which I’ve been pushing for some time now, is to simply break the series apart at the discontinuities and treat it at two separate series. And that’s a good plan. They say:
Data split: Each unique record was broken up into fragments having no gaps longer than 1 year. Each fragment was then treated as a separate record for filtering and merging. Note however that the number of stations is based on the number of unique locations, and not the number of record fragments.
So why would they deal with “empirical discontinuities” by adjusting them, and deal with other discontinuities in a totally different manner?
Next, I object to the plan that they will “incorporate empirical homogenization as a process that occurs simultaneously with the other averaging steps.” This will make it very difficult to back it out of the calculations to see what effect it has had. It will also hugely complicate the question of the estimation of error. For any step-wise process, it is crucial to separate the steps so the effect of each single step can be understood and evaluated.
Finally, let’s consider the nature of the “homogenization” process they propose. They describe it as a process whereby:
… local outliers are replaced with the basic behavior of the local group
There’s a number of problems with that.
First, temperatures generally follow a Zipf distribution (a distribution with a large excess of extreme values). As a result, what would definitely be “extreme outliers” in a Gaussian distribution are just another day in the life in a Zipf distribution. A very unusual and uncommon temperature in a Gaussian distribution may be a fairly common and mundane temperature in a Zipf distribution. If you pull those so-called outliers out of the dataset, or replace them with a local average, and you no longer have temperature data – you have Gaussian data. So you have to be real, real careful before you declare an outlier. I would certainly look at the distributions before and after “homogenization”, to see if the Zipf nature of the distribution has disappeared … and if so, I’d reconsider my algorithm.
Second, while there is a generally high correlation between temperature datasets out to 1200 km or so, that’s all that it is. A correlation. It is not a law. For any given station, there will often be nearby datasets that have very little correlation. In addition, for each of the highly correlated pairs, there will be a number of individual years where the variation in the two datasets is quite large. So despite high correlation, we cannot just assume that any record that disagrees with the “local group” is incorrect, as the BEST folks seem to be proposing.
Third, since nature itself is almost “anti-homogeneous”, full of abrupt changes and frequent odd occurrences and outliers, why would we want to “homogenize” a dataset at all? If we find data we know to be bad, throw it out. Don’t just replace it with some imaginary number that you think is somehow more homogeneous.
Fourth, although the temperature data is highly correlated out for a long distance, the same is not true of the trend. See my post on Alaskan trends regarding this question. Since the trends are not correlated, adjustment based on neighbors may well introduce a spurious trend. If the “basic behavior of the local group” is trending upwards, and the data being homogenized is trending horizontally, both may indeed be correct, and homogenization will destroy that …
Those are some of the problems with “homogenization” that I see. I’d start by naming it something else. It does not describe what we wish to do to the data. Nature is not homogenous, and neither should our dataset be homogeneous.
Then I’d use the local group, solely to locate unusual “outliers” or shifts in variance or average temperature.
But there’s no way I’d replace the putative “outliers” or shifts with the behavior of the “local group”. Why should I? If all you are doing is bringing the data in line with the average of the local group, why not just throw it out entirely and use the local average? What’s the advantage?
Instead, if I found such an actual anomaly or incorrect data point, I’d just throw out the bad data point, and break the original temperature record in two at that point, and consider it as two different records. Why average it with anything at all? That’s introducing extraneous information into a pristine dataset, what’s the point of that?
Lastly, a couple of issues with their quality control procedures. They say:
Local outlier filter: We tested for and flagged values that exceeded a locally determined empirical 99.9% threshold for normal climate variation in each record.
and
Regional filter: For each record, the 21 nearest neighbors having at least 5 years of record were located. These were used to estimate a normal pattern of seasonal climate variation. After adjusting for changes in latitude and altitude, each record was compared to its local normal pattern and 99.9% outliers were flagged.
Again, I’d be real, real cautious about these procedures. Since the value in both cases is “locally determined”, there will certainly not be a whole lot of data for analysis. Determination of the 99.9% exceedance level, based solely on a small dataset of Zipf-distributed data, will have huge error margins. Overall, what they propose seems like a procedure guaranteed to convert a Zipf dataset into a Gaussian dataset, and at that point all bets are off …
In addition, once the “normal pattern of seasonal climate variation” is established, how is one to determine what is a 99.9% outlier? The exact details of how this is done make a big difference. I’m not sure I see a clear and clean way to do it, particularly when the seasonal data has been “adjusted for changes in latitude and altitude”. That implies that they are not using anomalies but absolute values, and that always makes things stickier. But they don’t say how they plan to do it …
In closing, I bring all of this up, not to oppose the BEST crew or make them wrong or pick on errors, but to assist them in making their work bulletproof. I am overjoyed that they are doing what they are doing. I bring this up to make their product better by crowd-sourcing ideas and objections to how they plan to analyze the data.
Accordingly, I will ask the assistance of the moderators in politely removing any posts talking about whether BEST will or won’t come up with anything good, or of their motives, or whether the eventual product will be useful, or the preliminary results, or anything extraneous. Just paste in “Snipped – OT” to mark them, if you’d be so kind.
This thread is about how to do the temperature analysis properly, not whether to do it, or the doer’s motives, or whether it is worth doing. Those are all good questions, but not for this thread. Please take all of that to a general thread regarding BEST. This thread is about the mathematical analysis and transformation of the data, and nothing else.
w.
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.

Agree with John Kehr regarding hourly temp readings. These should be included where available as a genuine way of reducing the impact of erroneous outliers. Sure, not every station can provide these, but it won’t be difficult to be able to provide separate results for all stations, just max/min stations and just hourly stations to see the difference it makes (if any).
I’m in agreement with Crosspatch, even though I have no knowledge of the areas he is discussing. The geographical area I am most familiar with, the Rodney district in New Zealand, has a temperature profile that looks absolutely bizarre to those unfamiliar with it. It is coastal, has an enormous but shallow harbour with the longest harbour shorelineline in the world, lies on a very narrow and often rugged tongue of land and has an unusual number of vastly differing microclimates within a very small geographical area. Attempting to draw any conclusions from temps from adjacent but differing microclimates would seem to be a way to destroy any meaning in the records. As an example, the entire area is in the ‘subtropical’ climate, yet in some microclimates, morning frosts suficiently severe to freeze exposed water pipes are commonplace, yet other microclimates ‘just over the hill’ are entirely frost-free.
Perhaps I am missing something due to my own ignorance, but I tend to agree with Willis in regard to homogenising of data and in removing outliers. If those outliers are an accurate record of temperature as it happened, removing them is rendering the data incorrect. While I see the BEST initiative as an invaluable exercise, I would be even happier if the same effort was made to ensure high quality data is taken, free from contamination.
I guess I am still a suspicious country boy at heart, as I tend to distrust exterme cleverness with mathematics and statistics employed to ‘get a result’ taken from situations and equipment that are influenced by factors other than those we are attempting to measure. I do not understand the rush to get the BEST thing done, as the world and it’s climate will be here for a while yet doing its thing, whatever that is.
“If we find data we know to be bad, throw it out. Don’t just replace it with some imaginary number that you think is somehow more homogeneous.”
…and split the series. Allow short series (much shorter than a year) to enable data to be tossed without creating a hole in the series that needs to be filled by imaginary data.
Then the algorithm can be run and rerun with varying criteria on discarding data.
@Willis : “Instead, if I found such an actual anomaly or incorrect data point, I’d just throw out the bad data point, and break the original temperature record in two at that point, and consider it as two different records. Why average it with anything at all? That’s introducing extraneous information into a pristine dataset, what’s the point of that?”
Ahhh yup. Man, that’ll teach me to post mid article…
Actually if you want to get a real feel for what has happened with real global temps lets say, since 1880, is to look at unadjusted raw RURAL data only, I mean one that is STILL rural. I don’t think I have seen ONE anywhere showing any significant warming, has anyone here? In any case it does not matter what BEST come up with in the end, it is the current trend for say next 10 years that will count (yes even with urban station) etc, because since 2002 it is already FLAT! ie no extra warming as predicted
As one other poster, above, noted, averages can be deceiving. There is another poster on here which has an excellent analysis of Canadian temperatures. His graphs and analysis shows that maximum daily highs (as recorded by land based instruments) have either not been increasing or have in been decreasing over a number of decades, while the daily minimums have been increasing. By averaging the temperatures, it appears that Canada is experiencing a warming trend. But this sort of warming trend (where only the daily minimums increase) is actually very good for many reasons. With no increase in daily maximums, is there any harm being felt?
So what I am asking for is if BEST could show both the trend in daily maximums and daily minimums rather than the change in the average temperatures.
I agree with your perspective Willis. They need to keep the real distribution in the data. One can explain and eliminate an outlier, or leave it alone, but don’t change it to something else, include it, and pretend you have the same representation of the data. I am not confident about the choice of confidence limits, those are contingent on the type of data distribution, and people usually default to Gaussian which may not represent nature, as you point out.
Great analysis, Willis. I am with you all the way. And I go farther than you. Climate is local. That is apparent to lovers of the outdoors. For example, the coastline of Florida is considerably cooler than inland areas and the wind is always blowing on the coastline. But the coastline is a permanent feature. Surely it qualifies as climate not weather. Yet how many times will the coastline show up as an outlier in an inland cell? Seems to me that the cell based approach is seriously flawed. Why use cells? If using cells, why make them larger than one mile by one mile? This is the age of computers, after all. Data management has benefited so much from computers that a finer mesh of data really seems required at this time.
I’m kinda of new to this stuff. Yes, the homogenization thing jumped out at me also. Some questions:
1. Is homogenization a technique for getting the final data point? Or is it merely a screen to minimize the number of data points that need to be manually examined for plausibility?
2. When homogenization takes place, are they working with actual temperatures, or “anomaly (first difference?) temperatures”
3. Has any of the input data been OCRed from old records? If so, there really should be a step somewhere to try to detect misreads e.g. 3s read as 8s,5s as 6s (the two are virtually indistinguishable in at least one font-I forget which). Some OCR errors are pretty much indetectable even by humans, but errors in the leading digit can be pretty blatant.
3a. Has handwriting recognition technology been used on any of the input data? It has probably improved some since I worked with it tangentially two decades ago, but it was pretty iffy back then (Don’t confuse real time recognition where stroke order information is available with trying to identify the result without stroke order info. The latter is much harder and would be what is needed here).
3b. Is there any provision for flagging records with a high detected error rate as being doubtful?
3c. I know that editing the data seems to be traditional in “climate science”, but if that is really the case, might not this be a good time to break with tradition? Is it feasible to pass through all the data and flag the values that screening says are doubtful so that the end analyst can choose to use them, tweak them, or reject them?
4. The output of this effort is what? A single global temperature? A cleaned up temperature set from which “global temperatures” can be computed?
I don’t think the temperature sets should be looking at max and min data, they should be loking at diurnal variance and be plotted against local cloud cover. Everyone knows anecdotally that if you have a clear day and cloudy night way less heat is lost than with a clear night. What AGW should be proving if true is that the diurnal variation should be lessening as night time heat loss decreases. I have abig problem with the black body theory – Part of the world is absorbing heat whilst part of the world is losing heat all the time, and each part of the world is different in terms of land/water/biomass coverage. This kind of simplictis modelling proposed is still way way off where we need tobe – until we get fractal based systems we’re a long way off
There is another problem in homogenising with “neighbouring” stations, like Straßburg (France) and Karlsruhe (Germany), which are some 70 km apart. The French compute the mean temperature like the US as (Tmax + Tmin)/2, while in Germany it was (T0730 +T1430 +2*T2130)/4 and now is the mean of 24 hourly measurements (with T0730 being the temperature at 7:30 CET). There is a difference of up to 1K between the Frenchs and the Germans mean daily temperature. Any homogenisation would introduce a bias.
Agree wholeheartedly with the article. In trying to arrive at my own “scientifically justifiable” approach to analyzing global temperature, I saw “homogenization” (for want of a better word) and “data creation” to be the “Achilles heal” of the methodologies already being used.
Honestly, with no attempt to discredit BEST so early in the game, I think “homogenization” represents a “shallowness” of thought or understanding regarding the actual physics, measurement techniques, geographical, and statistical effects bearing on “global temperature change estimation”. What is needed is “deep” thought on this topic.
I had high hopes that BEST will provide a universally “acceptable” product. Hopefully, BEST will reconsider this aspect.
Assuming they will not have the resources/time to dive into each station and it’s individual history, how about chunking the surface into geographical regions based on the best available inventory of micro-climate and “medium-climate” types. Then, generally treat those regions as independent islands.
Also, I think that they should do runs where they;
a) leave out micro/medium climate areas that don’t have stattion coverage
b) Include those uncovered areas based on some stated, rational method.
c) Land only,
d) ocean only
e) lastly, global
I would really like to see a straight “binning” analysis which defines some metric of warming, *at a particular site*, and then allows one to flexibly bin the data. This way we could ask questions like “what percentage of sites at 30-65 deg latitude have warmed?”, and “what percentage of rural stations have warmed vs. urban stations”.
I’d also like to see the definition of “warming” be flexible. For example one could select traditional temp anomaly, or just average temp for the year, or compare monthly average year over year, or use absolute max temp for the year, absolute min temp, average of monthly max temp, average of monthly min temp – etc…
Willis – I agree with you completely.
In particular, I am very much against homogenization.
I know that you are aware of the Australian Bureau of Metrology “High Quality” datasets, which produce, via very sophisticated statistics, a very sharply rising Australian temperature map.
This has been built on the top of raw data that has very little trend, except for some UHI here and there.
Please forward your criticisms to BEST if you haven’t yet done so, as by and large, they seem to be trying to produce an honest series.
If BEST can get the method right then eventually the BOM can be induced to follow their lead.
BOM have for too long followed the siren call of the IPCC.
Time to change ships before it’s too late.
Thanks for a useful post, Willis – I admit to (customary) laziness in not checking through BEST’s campaign plan, and share the general unease with “normalisation”.
I think Stephen Richards comes closest to my immediate reaction. If they are proposing to release a “massaged” dataset, then they should also release (a) all the raw data exactly as received, with just an indication of where cuts into data packets have been made, and (b) full information on what changes have been made where, and why.
In short, all the original data and details which the Hockey Team prefer to keep under the carpet, so that all the information is freely available for inspection and discussion by anyone. Not just plots, either – the original figures are the foundation of the entire structure built on them, so should be there “in person” in great chunks of CSV or whatever for complete access.
Like that uoguelph.ca link above, which ends with:
DATA and CODE:
* The programs use (whatever software, algorithms etc.)
* The zipped archive is (link) here.
– like it oughta.
After reading Anthony, Willis, and most of the comments, it is very to see why there is so much excitement about the scientific nature of this project. If valid, unadjusted raw data in various regions has not been identified — and hasn’t some of it been destroyed? — how can any analysis of that data be anything but worthless. And homogenization, come on. “Global climate temperature” is not possible, but overall warming and cooling in the various regions does seem doable. Why sign on to this project?
The homogenization also would seem to obscure the noise inherent in the data. If nearby 10 stations all have small DC offsets- say, up to 0.5C, you just can’t arbitrarily remove those, find the standard deviation (assuming a Gaussian distribution of errors, which just can’t be right) then quote some absurdly small uncertainty in the temperatures. And you can’t then average a thousand such areas and quote an even smaller uncertainty.
There has to be some, ahem, “robust” accounting for the fact that you must have significant uncertainty in finding “local” average temps, and that doesn’t really become smaller like sqrt(number of stations), since the errors are a combination of many sources, some non-random and non-Gaussian. The homogenization obscures this.
Jeff
Jeff
I worked for many years in acoustics and phonetics research, building electronic gadgets and writing software to perform experiments.
I know one thing: In any honest scientific field, you don’t “homogenize” anything. When one subject shows responses that indicate he can’t do the task, or misunderstands the instructions, you don’t try to “pull” his responses toward other similar subjects. You toss him out. If you have to toss so many subjects that your total data set ends up unusable, you toss out the whole experiment and try something different.
I question how good any homogenization technique will be in detecting errors that creep in over decades. Such as micro site contamination from growing plants or urban encroachment.
If the nearby city is big enough, encroaching urbanization could easily affect almost all of the sensors in a given region. In such a case, would homogenization adjust the one or two sensors that aren’t being encroached to better match those that are?
Willis,
I live near the water on Georgian Bay, Ontario, Canada.
We have a 50′ hill all around this area of shore line.
The other day it was 4 degrees C colder at the bottom of the hill to at the top when I was driving home.
Ask me if I trust temperatures on a global scale…NOT!
Is it reasonable to do the homogenization as the last step and then show the with/without delta so that we might appreciate what its effect is? If the effect is insignificant? If it is significant, then the “outliers” really must be identified and qualified, musn’t they?
Data are. They are readings taken from instruments. Good data are data taken from properly selected, properly calibrated, properly sited, properly installed, properly maintained instruments. All other data are bad data, for whatever reasons. Missing data are simply missing.
Bad data cannot be converted into good “data” by gridding, infilling, adjusting, homogenizing, pasteurizing, folding, bending, spindling or mutilating the bad data. Missing data cannot be magically, mystically “appeared” any more than good data should be “disappeared”.
Data which has been adjusted is no longer data. I don’t know what it is, only what it is not. In this context, however, it is typically referred to as a “global temperature record”.