Problems With The Scalpel Method

Guest Post by Willis Eschenbach

In an insightful post at WUWT by Bob Dedekind, he talked about a problem with temperature adjustments. He pointed out that the stations are maintained, by doing things like periodically cutting back the trees that are encroaching, or by painting the Stevenson Screen. He noted that that if we try to “homogenize” these stations, we get an erroneous result. This led me to a consideration about the “scalpel method” used by the Berkeley Earth folks to correct discontinuities in the temperature record.

The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.

There are a variety of ways to deal with that, which are grouped under the rubric of “homogenization”. A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.

The method that I’ve recommended in the past is called the “scalpel method”. To see how it works, suppose there is a station move. The scalpel method cuts the data at the time of the move, and simply considers it as two station records, one at the original location, and one at the new location. What’s not to like? Well, here’s what I posted over at that thread. The Berkeley Earth dataset is homogenized by the scalpel method, and both Zeke Hausfather and Steven Mosher have assisted the Berkeley folks in their work. Both of them had commented on Bob’s post, so I asked them the following.

Mosh and/or Zeke, Stephen Rasey above and Bob Dedekind in the head post raise several points that I hadn’t considered. Let me summarize them, they can correct me if I’m wrong.

• In any kind of sawtooth-shaped wave of a temperature record subject to periodic or episodic maintenance or change, e.g. painting a Stephenson screen, the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.

• Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.

• As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.

Now this makes perfect sense to me. You can see what would happen with a thought experiment. If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.

So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.

In your reply, please consider that I have long thought and written that the scalpel method was the best of a bad lot of methods, all methods have problems but I thought the scalpel method avoided most of them … so don’t thump me on the head, I’m only the messenger here.


Unfortunately, it seems that they’d stopped reading the post by that point, as I got no answer. So I’m here to ask it again …

My best to both Zeke and Mosh, who I have no intention of putting on the spot. It’s just that as a long time advocate of the scalpel method myself, I’d like to know the answer before I continue to support it.

Regards to all,



newest oldest most voted
Notify of

What happens if you keep the scalpel corrections as signed increments to the error bands?

Any data handling method that can produce a positive temperature trend is highly sought after amongst CAGW supporters in these days of no obvious warming. TOBS, homogenisation, UHI, relocation and loss of sites can all be pressed into the service of The Cause in some way or another. There is no consideration of Scientific Method here, it is now all just politics.

Global cooling

Link to Dedekind’s post does not work.
Do you assume that the drift is always positive: a tree creating a shadow, new asphalt, new buildings, fading paint, …
Is it possible to find signal that is smaller than the measurement errors. This is the big question of the temperature records.


same problem as exposed for GISS pair-wise correction. The method cannot catch gradual drift followed by step-wise correction. Thus it introduces bias into signals that have no bias.
however, since the bias tends to introduce warming and warming was expected, the error went undetected by the programmers.


no conspiracy is required. programmers never look for errors when the data gives the expected result. that is how testing is done most of the time. you only look for problems when you don’t get the answer you expect. so, if you expect warming, you only look for bugs when the results don’t show warming. as a result, bugs that cause warming are not likely to be found – at least not by the folks writing the code.


“the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.”
Could we create a subset containing the first measurement only from each change, if these are the most accurate?
Much fewer measurements, but illuminating, *if*, the trend differs from using all of the measurements?

Geoff Sherrington

steverichards1984 says: June 28, 2014 at 11:39 pm
subset containing the first measurement only….
Yes, please. I’ve often advocated a look at a large temperature subset composed of the first 5 years of operation only, of a new or relocated station. I lack the means but I promote the idea.
More philosophically, it is interesting how the principle of ‘adjustment’ has grown so much in climate work. It’s rather alien to most other fields that I know. I wonder why climate people put themselves in this position of allowing subjectivity to override what is probably good original data in so many cases.


That is a well and long known problem
I don’t think there is much enthusiasm for any improvement among those invested in AWG.
Halving temperature trends over land may give a better guess than BEST, That matches better with McKitricks paper, with Watt’s draft and with lower troposphere satellite trends.

Dr Burns

“…a temperature record subject to periodic or episodic maintenance or change…”. Have any tests been done to determine the magnitude of changes such as repainting compared to daily dirt and dust build-up? I have a white car which progessively becomes quite grey until a good rain storm. I would imagine such dirt build up could have a significant effect on a Stevenson screen, between rain storms?


Splitting a record at a breakpoint has the same effect as correcting the breakpoint. If the breakpoint was caused by station maintenance or other phenomena that RESTORES earlier observing conditions after a period of gradually increasing bias, correcting the breakpoint or splitting the record will preserve the biased trend and eliminate a needed correction. If a breakpoint is caused by a change in TOB, the breakpoint needs to be corrected or the record needs to be split to eliminate the discontinuity. If a breakpoint is cause by a station move, we can’t be sure whether we should correct it or leave it alone. If the station was moved because of a gradually increasing [urban?] bias and the station was moved to an observing location similar to the station’s early location, correcting the breakpoint will preserve the period of increasing urban bias. If the station wasn’t moved because the observing site wasn’t degrading, then correction is probably warranted.
WIthout definitive meta-data, one can’t be sure which course is best. However, only one large shift per station which cools the past can be attributed to a change is TOB, along with any pairs of offsetting large shifts. All other corrections that are undocumented probably should be included in the uncertainty of the observed trend (corrected for documented biases). For example, global warming in the 20th century amounted to 0.6 (observed change after correcting for documented artifacts) – 0.8 degC (after correcting all apparent artifacts).

richard verney

It is clear beyond doubt (see for example the recent articles on Steve Goddard’s claim regarding missing data and infilling) and the poor siting issues that the surface station survey highlighted, that the land based thermometer record is not fit for purpose. Indeed, it never could be, since it has always been strained well beyond its original and design purpose. The margins of error far exceed the very small signal that we are seeking to wean out of it.
If Climate Scientists were ‘honest’ they would, long ago, have given up on the land based thermometer record and accepted that the margins of error are so large that it is useless for the purposes to which they are trying to put it. An honest assessment of that record leads one to conclude that we do not know whether it is today warmer than it was in the 1880s or in the 1930s, but as far as the US is concerned, it was probably warmer in the 1930s than it is today..
The only reliable instrument temperature record is the satellite record, and that also has a few issues, and most notably the data length is presently way too short to be able to have confidence in what it reveals.
That said, there is no first order correlation between the atmosheric level of CO2 and temperature. The proper interpretation of the satellite record is that there is no linear temperature trend, and merely a one off step change in temperature in and around the Super El Nino of 1998.
Since no one suggests that the Super El Nino was caused by the then present level of CO2 in the atmosphere, and since there is no known or understood mechanism whereby CO2 could cause such an El Nino, the take home conclusion from the satellite data record is that climate sensitivity to CO2 is so small (at current levels, ie., circa 360ppm and above) that it cannot be measured using our best and most advanced and sophisticated measuring devices. The signal, if any, to CO2 cannot be seperated from the noise of natural variability.
I have always observed that talking about climate sensitivity is futile, at any rate until such time as absolutely everything is known and understood about natural variation, what are its constituent forcings and what are the lower and upper bounds of each and every constituent forcing that goes to make up natural variation.
Since the only reliable observational evidence suggests that sensitivity to CO2 is so small, it is time to completely re-evaluate some of the corner stones upon which the AGW hypothesis is built. It is at odds with the only reliable observational evidence (albeit that data set is too short to give complete confidence), and that sugggests that something fundamental is wrong with the conjecture.

richard verney

Per Willis
“…As a result, the raw data may not reflect the actual temperatures….”
Wrong; the raw data is the actual temperature at the location where the raw data is measured.
What you mean is whether there are some factors at work which have meant that the actual temperature measured (ie., the raw data) should not be regarded as representaive of temperatures because it has been distorted (upwards or downwards) due to some extrinsic factor (in which i include changes in the condition of the screen, instrumentation, TOBs as well as more external factors such as changes in vegetaion, nearby building etc). .

Willis Eschenbach

Global cooling says:
June 28, 2014 at 11:03 pm

Link to Dedekind’s post does not work.

Thanks, fixed.

Do you assume that the drift is always positive: a tree creating a shadow, new asphalt, new buildings, fading paint, …

While in theory the jumps should be equally positive or negative, most human activities tend to raise the local temperature. In particular the growth of the cities has led to UHI. As a result, when many of the weather stations moved to nearby airports after WWII, there would be a sharp cooling of the record.
In addition, if you just leave a met station alone, the aging of the paint and the growth of surrounding vegetation cutting out the wind both tend to warm the results.
However, there are changes that cool the station, so yes, the jumps will go both ways. But that doesn’t fix the problem. The scalpel method is removing the very information we need to keep from going wrong.

Is it possible to find signal that is smaller than the measurement errors. This is the big question of the temperature records.

Mmmm … in theory, sure. Signal engineers do it every day. But in the temperature records? Who knows.

richard verney

I don’t understand this constant fiddling with data.
Consider; you perform an experiment, you get some results (ie., the actual results of the experiment which is the raw data that you have found). You then interpret these results, and set out your findings and conclusions (which will by necessity discuss the reliability amd mergins of errors of the actual results of the experiment). But you never substitute the actual results of the experiments with your own interpreted results, and claim that your own interpreted results are the actual results of the experiment conducted.
When someone seeks to replicate the experiment, they are seeking to replicate whether the same raw data is achieved. When you seek to review an earlier performed experiment, two distinct issues arise;
1. Does the replicated experiment produce the same raw data?
2 Does the interpretation which the previous experimentor gave to the findings withstand scientific scrutiny, or is there a different (possibly better) interpretation of the raw data?
These should never be confused.
The raw data should always remain collated and archived so that others coming after can consider what they consider the raw data shows. Given advances in techology and understanding, later generations may well have a very different take on what the raw data is telling. Unfortunately it appears that much of the original unadjusted raw data, on a global basis, is no longer available.
If we cannot detect the effects of UHI in the land based thermoter record, given that UHI is a huge signal and we know that over the years urbanisation has crept and that station drops outs have emphasised urban stations over truly rural stations, there is no prospect of seeing the far weaker signal of CO2.

richard verney

Willis Eschenbach says:
June 29, 2014 at 1:03 am
Commonsense would suggest that the majority of recent adjustments should be to cool recent measurements.
Given the effects of UHI and urban crawl, switch to airports etc (with more aircraft traffic and more powerful jet engines compared to props etc) these past 40 or so years, would be that adjustments to the ‘record’ for 2014 through to say 1970 should be such that these measurements are lowered (since the raw data would be artificially too high due to warming pollution by UHI etc).
Yet despite this being the commonsense position, it appears that the reverse is happening. Why???

In Nov 1, 2011, Steve McIntyre writes:

One obvious diagnostic that BEST did not provide – presumably because of their undue haste in the answer – is a histogram of the thousands of steps eliminated in the slicing. If they provide this, I predict that the histogram will show the bias that I postulate. (This is pretty much guaranteed since the BEST trend is greater than CRU trend on GHCN data.)

That was a while ago. Has BEST published such a histogram of breakpoint trend offsets?
On the majority of BEST stations and it’s breakpoints I investigate, I am appalled at the shortness of the segments produced by the scalpel. We’ve just seen Lulling, TX. I’ve written about Stapleton Airport Denver,
CO, where the BEST breakpoints do not match airport expansion events, yet BEST misses the opening and closing of the airport!!!
People, I’ll accept a breakpoint in a station if it is move with a significant elevation change. No Airport in the world could have a move breakpoint based on elevation change. I’ll grant you that moving a temperature station at LAX from east end of the runway to the west end of the runway might warrant a breakpoint. But a climate change within the bounds of an airport is the exception, not the rule. Let us see from BEST how many AIRPORT stations have c(0,1,2,3,4,…) breakpoints in their history. I bet 90% of them make no sense. If there is a station move WITHIN an airport, and it is for microsite conditions, it does not deserve a break. If it is for maintenance, it does not deserve a break. If it is to move it away from an expanding terminal, it does not deserve a break. If it is moved next to the ocean, Ok, give it a breakpoint. How often does than happen? According to BEST it happens all the time.

Peter Azlac

Richard Verney says:
“If we cannot detect the effects of UHI in the land based thermometer record, given that UHI is a huge signal and we know that over the years urbanization has crept and that station drops outs have emphasized urban stations over truly rural stations, there is no prospect of seeing the far weaker signal of CO2.”
Perhaps the answer is to follow the advice of John Daly and use only those ´rural´ stations with a long record in areas where the response to CO2 ( if discernable) will be at its highest and competition with water vapour at its lowest – in the Arctic and Antarctic regions where the low temperatures give the highest IR output in the region where CO2 has its highest absorption bands. Given that it has been stated that we only need 50 or so records to give an accurate GTA he offers 60 plus sites that meet the criteria of being minimally affected by UHI effects, most of which do not show any long term warming trend with many showing cooling.

I agree completely!

Berényi Péter

Temporal UHI (Urban Heat Island) effect is another continuous drift, this is why BEST was unable to identify it. It also goes into the positive direction, because, although population explosion was already over 2 decades ago (global population below age 15 is not increasing any more), population density, along with economic activity keeps increasing in most places, due to increasing life expectancy and ongoing economic development. Which is a good thing, after all.
It is far from being negligible, UHI bias alone can be as much as half the trend or more.

I think a reasonable request of BEST is to produce a graph:
X: Number of Years
Y: Number of station segments whose length >= X
By lines:: For Years = (1902, 2012, by= 5) That would be about 21 curves.
That would let us easily see how many of the 40,000 stations BEST claims have a segment length of say 40 years for 2002. Or >= 20 years for 1992. I think most observers would be shocked at how few such segments remain in the record.
We made this request to Zeke toward the end of WUWT Video Zeke Hausfather Explains…, Dec. 18, 2013.

Thanks, Zeke. Not only is the number of stations important, but the Length of the usable records is important. I am not the only one who would like to see a distribution of station lengths between breakpoints at points in time or across the whole dataset.

Richard H. and I made some very rough estimates from the raw files. BEST has this data at hand. They could post it tomorrow, if they wanted to.


Richard Verney
I always appreciate your comments. As I commented on the other thread
‘It is absurd that a global policy is being decided by our governments on the basis that they think we know to a considerable degree of accuracy the global temperature of land and ocean over the last 150 years.
Sometimes those producing important data really need to use the words ‘ very approximately’ and ‘roughly’ and ‘there are numerous caveats’ or even ‘we don’t really know.’
Tony Brown -Climate Reason

In my 2:21 post I mentioned that Richard H and I investigated some data on station length from the raw files BEST listed in an index. This is a plot of stations of greater length than X (for the entire life of the dataset). There are three curves based upon the percent of missing data.
The total number of sites was 35,000,
but only 10,000 of them had less than 10% missing data
and only 3000 of them had <10% missing data and greater than 38 years long.

Stephen Richards

I cannot remember the number of times I have written on blogs that ” IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA” unless you have good, scientific and mathematical analysis to allow you to do so without any doubt or favour.


The problem is an experimental error. Trying to fix an experimental area after the experiment using a statistical/processing methodology seems superficial. Fixing the flaw in the experimental design and rerunning the experiment is the way to go. It would seem better to remove stations that have moved or have had significant changes in land use. This may not be an option in most parts of the world but the US record may act as a benchmark for how one should assess global records.

son of mulder

“Stephen Richards says:
June 29, 2014 at 2:50 am
Absolutely right. If there are perceived issues with historical data then do analysis on the raw data to show things eg like measured rural temperature rises more slowly, than airport or city data etc. (if it does).
Do they even still have the historical raw data? As I understand it, lots of stations are no longer included in analysis, there are a myriad ways of deselecting stations to create any sort of trend you may wish. At least with raw data you are measuring what we experience and it is what we experience that determines whether “things are getting worse or better”. Signs of any “Thermogeddon” would appear in raw data with more certainty than in treated data.

richard verney

tonyb says:
June 29, 2014 at 2:21 am
Whilst I do not like proxies, I consider that historical and archaelogical record provide some of the best proxy evidence available. At least we are dealing with fact, even if it is relative and not absolutely quantative in nature. I consider that climate scientist could do with studying history. Somehow the idea that there was some form of climate stasis prior to the industrial revelotion has taken root simply because people did not live through prior times, and have no insight into the history of those times.
I applaud your reconstruction of CET since not only does it provide us with more data, it is a valuable insight into the history of those times, and what people thought of their history. Whilst I am not a fan of the globalisation of climate, instead I consider that in the terms that we are talking about it is inherently regional in nature, there is no particular reason why CET should not be a good trend marker for the Northern Hemisphere, particularly the near Northern European Continent. So your reconstruction is an extremely valuable tool.
A point to ponder on. I consider that the next 10 to 12 years could be compelling. IF there is no super El Nino event, and IF temperatures were to cool at about 0.1 to 0.12degC per decade (and as you know from CET there has been a significant fall in temperatures since 2000, especially the ‘winter’ period), then we will be in a remarkable position.
I know that this contains some IFs, but if it were to come to pass, we would find ourselves (according to the satellite data) at about the same temperature as 1979. There would, in this scenario, be no warming during the entirety of the satellite record, and during this period about 80% of all manmade CO2 emissions would have taken place (ie., the emissions from 1979 to say 2025).
I do not think that many people appreciate that that would be the resulting scenario, of what would be quite a modest drop in temperature, and the knock on consequence is huge, since this would make it very difficult for someone to argue that manmade CO2 emissions are significant (if 80% of all such emissions has resulted in no measured increase at all!),

Addendum to my 2:45 am post:
The total number of sites was 35,000
but only 10,000 of them had less than 10% missing data
and only 3000 of them had <10% missing data and greater than 38 years long.

To be clear, these numbers came from a superficial look at the raw data files Zeke and Mosher provide links to in a index page. These values are prior to the use of the BEST scalpel. BEST has ready and fast access to post-scalpel segment length. Constructing our suggested census plots by sampled year should be easy and no burden compared to other processing they do.
Another interesting chart that Richard H and I briefly explored is a census map of 1×1 degree grid based upon post-segment length. For instance color code the 1×1 deg cells by the number of segments that exist that are longer that 20 years and cover the year 2005. I hypothesize it would be a pretty blank map. 2×2 deg cells? (that is about 100×125 mile cells).


ferdberple says:
June 28, 2014 at 11:26 pm
Us component/pcb/system test and reliability engineers tried to break things without exceeding the specification. Once a problem was identified one of the Not Invented Here/It’s not our hardware responses was that condition will never happen in the field; Test reponse “how do you know?” Design Answer “because I/we designed it”. Test engineer “ou designed in this fault. how do you KNOW?” Designer “yes but….”
We’re now in the yes but scenario.
Willis’ question is unanswerable as the error is designed in, and will happen in the field.


“the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.”
The drift is always positive, that’s why it doesn’t work. If the drift were random it would work.

Bloke down the pub

Didn’t Anthony have an experiment running to test Stevenson screens under different maintenance regimes? Is there anything from those tests that would be of use in this discussion?

Bill Illis

The breakpoints should be equally distributed between positive and negative adjustments of roughly the same magnitude.
And the positive and negative breakpoints should be equally distributed through time. In 1910, there should be 200 negative breakpoints and 200 positive adjustments of roughly the same magnitude. And this should be roughly consistent from the start of the record to the end of the record.
This is how bias can be detected. Simple histograms. And then we can determine whether the excess of negative adjustments in 1944 is valid for example. What is the main reason for this.
There should not be a trend through time in the breakpoints unless it can be proven that they should vary through time.
BEST hasn’t shown this that I am aware of. The description of the simple breakpoint/scapel method suggests that the breakpoints should be random through time. Are they? Why haven’t they shown this and explained it?
We know the NCDC has shown their adjustment have a systematic trend through time with a maximum negative adjustment of -0.35C to -0.55C in the 1930s to 1940s. Why? Does the TOBs adjustment vary throughout time? Why? Shouldn’t the other changes vary randomly through time? Why not?
Should people be allowed to change the data without fully explaining it.

Dub Dublin

FWIW, I designed and built a precision Web/Internet enabled temperature monitor system about a decade ago (selling them to customers like HP Research and the Navy Metrology Lab), and although we never built a version for outdoor environmental monitoring and data acquisition, I did do research on things like radiation shields/Stevenson screens and the like, thinking we might sell to that market.
One of the more interesting things I discovered was that some white paints (Krylon flat white spray paint being one of them) can actually have a far higher emissivity in the infrared spectrum than most other colors, including some *black* paints. (I don’t have the link handy, but I remember one source of this was a NASA document comparing the IR emissivity of a fairly large number of coatings. Krylon’s aerosol flat white was among the paints with the very highest IR emissivity which, of course, also means it’s near the top in IR absorption.
Moral: Visible ain’t IR, and your eyes can’t tell by looking whether that paint absorbs or reflects heat. The fact that the really high emissivity was for flat white paint does call into question whether and how weathering/aging might dramatically increase the thermal absorption of white paints or even molded plastic radiation shields over time and hints that glossiness is at least as important as color.). Every paint I’ve ever encountered tends to get flat and/or chalky over time as it ages and oxidizes. As a result, repainting a shield could either raise or lower the temperature inside! If anyone were *really* interested in actual climate science, this would be a topic of research, but the Global Warming narrative is better served by ignoring it, so don’t hold your breath. One more reason why climate science and good temperature measurements are harder than they appear.
(BTW, most temp sensors and instruments, even quite a few of the expensive ones, give pretty crappy readings that are frequently subject to offset errors of a degree or more (C or F, take your pick…) Thermocouples are especially problematic, as almost no one actually gets all the intermediate junction stuff and cold junction compensation right. Some systems I’ve seen correlate to the temp of the galvanized pole they’re mounted on better than they do to ambient air temp. (Further, I’m amazed at how many so-called high precision temperature measurement systems ship TCs with the ends twisted together rather than properly welded.) I prefer platinum RTDs for accurate temp measurements, but doing that right requires precision excitation power supplies and bridge completion resistors that are stable across your entire temp range over time. These things are expensive and no one wants to pay for them. Bottom line: accurately and precisely measuring temperature electronically is much harder than it appears, and it’s often done very poorly. I strongly suspect that hundred year-old temperature recohandhand recorded from mercury thermometers were dramatically more accurate and consistent than what we’re getting today.)


Pick out a set of 100 sites that give a quasi satisfactory spread of locations in contiguous USA.
Sites that can be maintained well and not on top of mountains.
Sites that have TOBS perfect
Exclude Hawaii and Alaska.
Accept that it is a proxy and not perfect.
Use 50 of the sites for the accepted temperature record.
Use 50 adjacent sites as potential back ups if sites go down.
Put a caveat up this temp is reconstructed using real raw data with * so many substitute sites.
Allow Nick Stokes to use the anomalies from this method to do his sums as he cannot understand anomalies are just the variation from the absolute at each site.
When problems arise infill the missing days from the average for that day over all past years .
Put a caveat up that the record has been amended this way for that year.
Wash, Rinse, Repeat.

Don K

Willis — I think you’re basically right. There’s a fair amount of subtlety to all this and I wouldn’t bet much much money on you (we) being correct, but I don’t see any obvious flaws in your discussion.
One minor point made by others, probably not all the biases increase over time. For example, vegetation growth (corrected by occasional trimming) would probably be a negative bias. But I suspect most of the biases are positive. (Does the sum of all biases that change over time = UHI?)
It occurs to me that if this were a physical system we were designing where we had a great deal of control over the system and the measurement procedures, we’d probably try to design all the time variant biases out. If we couldn’t do that, we’d likely have a calibration procedure we ran from time to time so we could estimate the biases and correct them out. What we (probably) wouldn’t do was ignore calibration and try to correct the biases out based on the measurements with no supporting data on the biases. But that seems to be what we’re forced to do with climate data. I’m not smart enough to know for sure that can’t be done, but I wouldn’t have the slightest idea how to go about doing it.
Anyway, good luck with this. I’m sure you’ll let us know how it works out.


Oh I see, so they check each station to see if it was a physical change…or weather
…and if it’s weather, they don’t splice it

“A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.”
Interesting. The thermometer only reads the temperature. So, how does one determine, site-by-site, which “other than temperature” effects should be removed?
I’m in agreement with those who say that only a site that doesn’t have any other than temperature effects should be used.
I suspect we could look at any improperly sited station and spend days discussing the possible adjustment(s) necessary to make the data for that station reasonably correct.
I’m making a huge assumption: that the folks here ultimately would agree on the adjustments for that station. I’m of the opinion that we could not.
So, If we can’t get one right…


SandyInLimousin says:
June 29, 2014 at 3:47 am
i routinely get systems designers telling me that the odds of a particular event are billions to one against, so we can safely ignore it in the design. then I show them with a simple data mining exercise that we routinely have many of these billion to one events. most people are very poor at estimating events when it is in their favor to estimate poorly.


BEST must have the data, showing how much the offset was on each slice. This should average out to zero if the slice is bias free. It is my understanding that BEST has not made this information available, despite being asked. why not?


Correcting slice bias is not difficult. Add up all the offsets from slicing. some will be positive and some negative. whatever the residual, add the inverse into the final result to remove any trend created by slicing. but without knowing the offsets due to slicing, there is no way to know if it introduced a trend or not.


The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.

Another underlying problem is the assumption that a discontinuity in the record is a problem in the record.

Willis, maybe Zeke and Mosher were no longer interested in the discussion, but I did answer your question.

That is why you should not only correct jumps known in metadata, but also perform statistical homogenization to remove the unknown jumps and gradual inhomogeneities. Other fields of science often use absolute homogenization methods (finance and biology), with which you can only remove jumps. In climatology relative homogenization methods are used that also remove trends if the local trend in one station does not fit to the trends in the region. Evan Jones may be able to tell you more and is seen here as a more reliable source and not moderated.
P.S. To all the people that are shocked that the raw data is changed before computing a trend: that is called data processing. Not much science and engineering is done without.

Relative homogenization methods, compute the difference between the station you are interested in and its neighbours. If the mean of this difference is not constant, there is something happening at one station, that does not happen at the other. Thus such changes are removed as well as possible in homogenization to be able to compute trends with more reliability (and the raw data is not changed and can also be downloaded from NOAA). In relative homogenization you compare the mean value of the difference before and after a potential date of change, if this difference is statistically significant a break is detected or in case of BEST the scalpel is set and the series are split. It does not matter whether the means are different due to a gradual change or due to a jump. What matters is that the difference in the means before and after is large enough.
This part of the BEST method is no problem. With Zeke, BEST, NOAA and some European groups we are working on a validation study aimed especially at gradual inhomogeneities. This complements the existing studies where a combination of break inhomogeneities and gradual ones were used for validation.


ps: i’ve used the term offset as the difference between the end points each side of the slice. there may be a better term. Willis et al contend that the majority of these offsets will be in one direction, leading to bias over time. unless BEST corrects for the residual – the net sum of the positive and negative biases – the slice must introduce an artificial trend in the result. BEST should publish the offset data so it can be evaluated, to see if the slicing created an artificial trend.


BEST by year, what is the net total of the difference between the endpoints for all the slices?


ferdberple clarifies: “The method cannot catch gradual drift followed by step-wise correction. Thus it introduces bias into signals that have no bias.”
And there’s a perfect mechanism for a warm bias in the simple fading of initially bright white paint of temperature stations. Paint was what ruined the Hubble telescope mirror for lack of double checking. They only used a single flatness detector that has a bit of paint chipped off it so they altered the massive mirror to match the paint chip. Later a Metric system conversion left out crashed a Mars lander. So here is Willis asking ahead of time, prior to launch of an American carbon tax, “hey, have you guys checked for this little error issue I ran into?”
Dirt simple errors being missed in big technical projects often lead to disaster even for rocket scientists with an extreme interest in not being wrong, unlike the case for climatologists.


pps: one would also need to consider the gridding when apportioning the residuals from slicing. even if they added up to zero in absolute terms, this could still introduce bias when gridded.


Willis: “Is it possible to find signal that is smaller than the measurement errors. This is the big question of the temperature records.
Mmmm … in theory, sure. Signal engineers do it every day. But in the temperature records? Who knows.”
This statement needs a little clarification. Yes algorithms are in use that easily detect signals an order of magnitude or two below noise level. However (and this is a big however!), these algorithms are searching for signals with known characteristics. Typically these will be sine waves with specific modulation characteristics. The less that is known about a signal, the better the signal to noise ratio must be and the longer the detection period must be for detection.
The point is that searching for a long term trend in our USHCN records is not analogous to decoding data transmissions from a distant Voyager satellite. From a signal analysis perspective, reliably detecting that trend would require a significantly positive signal to noise ratio. (trend greater than noise over the observation period)
An example of the problem of finding a trend signal in noise was presented to us by a popular hockey stick graph developed from tree ring analysis. The analysis algorithm emphasized records that matched a modern thermometer record. Since the thermometer record had a positive trend, the algorithm dug into the noisy tree ring data pulled out a positive trend result. Of course, the algorithm was also able to find false positive trends in pure random noise most of the time too. It found what it was designed to find.

Crispin in Waterloo but really in Beijing

Willis I like the scalpel because it creates a break where there is a shift in data bias. That is a good way to start.
Suppose the change was trimming a tree, painting the enclosure or cutting down a nearby hedge. The influence of this class of change is gradual – we can assume it grew to be a problem gradually so the effect is approximately linear.
The bias (up or down) is the difference between the last and first data of the two sets but there would have to be some brave corrections because the week following the change is not the same as the one before it.
Suppose the new data were consistently 1 degree cooler than the old. Doesn’t work. We have to take a whole year. But a year can be a degree colder than the last. Big problem.
If there really was a 1 degree bias, we have to assume the early part of a data set is ‘right’. The step change has to be applied to the old data to ‘fix’ it assuming the change is linear.
Step change = s
Number of observations = n
N is the position in the series
D1 = data value 1
D1c = Corrected value 1
Linear Correction:
D1c = D1+(N1-1)*s/n
D2c = D2+(N2-1)*s/n
D3c = D3+(N3-1)*s/n
That works for steps up or down.
The data set length can be anything. All it needs is the value and sign of the step.
Other types of episodic change like paving nearby sidewalk need offsets, not corrections for drift because that is a different class of change.

George Turner

A couple of points.
If we assume that the discontinuities represent corrections back to the initial conditions, then the climate signal is best measured immediately after the discontinuities by connecting those as points and ignoring what’s in between. This is based on the idea that the discontinuity is due to a major human intervention that recalibrated the station, and that afterwards bias creeps in. Going all the way with the saw-tooth analogy, you have a region with absolutely no trend, on which sit a bunch of saws, teeth up, and the teeth having wildly varying sizes (different periods between recalibrations), but the same thickness between the back of the saw blade and the inner notch of a tooth. The best measurement of the underlying surface is the bottom of the notches, not the teeth themselves. .
Also, you could assume that the bias signal accumulating in all the stations is half the maximum tooth height, then subtract that from all the readings, and simply average all the stations. Since the adjustments are asynchronus, this would probably give you a pretty good overall picture of the climate. This implies that the raw data is probably better than the adjusted data for seeing the major trends, since removing similar snippets from a periodic waveform will create a major false trend. If also agrees with the simple logic that thousands of station monitors were probably not all horribly misreading their thermometers, and that this misreading grows vastly worse in the past. Their readings were with X degrees now, and were within X degrees in the past.
I’m also confident that it’s a whole lot easier to warm a station than to cool it, if we disallow large trees growing over a site. I think Anthony’s surface station project could confirm this by simply counting how many stations should be rejected because they should be reading too hot instead of too cold.

Quoting Willis “quotes” from the article:
Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.
As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.
So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.
So I’m here to ask it again …

Willis, I will offer my solution for your stated problem but you will have to determine if it is applicable and if it can be implemented or not.
First of all, a discontinuity “flag” character would have to be chosen/selected that would be appended to the “first” temperature reading that was recorded after said maintenance was performed at/on the Surface Station. Said “flag” would thus denote a “maintenance break-point” in the daily temperature data. …. And for “talking” purposes I will choose the alpha character “m” with both the capital “M” and small “m” having significance.
Next, a new “maintenance” program would have to be written that would “scan” the temperature data file for each Surface Station looking for any capital “M” discontinuity “flags” and if found, it would calculate the “trend” variance between said “M” flagged temperature value and the one previous to it. And via that actual “trend” value said maintenance program would algebraically “add” it to said previous temperature …. and then via an algorithm sequential “decrease” in/of said “trend” value would add said newly calculated “trend” values to all sequentially previous temperature data until it detects a small “m” discontinuity “flag” or a “default” date. The program would then change the originally noted discontinuity “flag” from a capital “M” to a small ”m” thus signifying that “trend” corrections had been applied to all temperature data inclusive between the small “m” discontinuity “flags”.
Of course, the above requires use of the “raw” and/or “trend” corrected data each time said new “maintenance” program is executed.
If that was not what you were “asking for” ….. then my bad, …. I assumed wrong.


Richard Verney,
Good comments all – I second all your questions. Have you seen this today?
Does anyone know how basic quality control is done to detect such data changes?