Guest Post by Willis Eschenbach
In an insightful post at WUWT by Bob Dedekind, he talked about a problem with temperature adjustments. He pointed out that the stations are maintained, by doing things like periodically cutting back the trees that are encroaching, or by painting the Stevenson Screen. He noted that that if we try to “homogenize” these stations, we get an erroneous result. This led me to a consideration about the “scalpel method” used by the Berkeley Earth folks to correct discontinuities in the temperature record.
The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.
There are a variety of ways to deal with that, which are grouped under the rubric of “homogenization”. A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.
The method that I’ve recommended in the past is called the “scalpel method”. To see how it works, suppose there is a station move. The scalpel method cuts the data at the time of the move, and simply considers it as two station records, one at the original location, and one at the new location. What’s not to like? Well, here’s what I posted over at that thread. The Berkeley Earth dataset is homogenized by the scalpel method, and both Zeke Hausfather and Steven Mosher have assisted the Berkeley folks in their work. Both of them had commented on Bob’s post, so I asked them the following.
Mosh and/or Zeke, Stephen Rasey above and Bob Dedekind in the head post raise several points that I hadn’t considered. Let me summarize them, they can correct me if I’m wrong.
• In any kind of sawtooth-shaped wave of a temperature record subject to periodic or episodic maintenance or change, e.g. painting a Stephenson screen, the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.
• Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.
• As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.
Now this makes perfect sense to me. You can see what would happen with a thought experiment. If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.
So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.
In your reply, please consider that I have long thought and written that the scalpel method was the best of a bad lot of methods, all methods have problems but I thought the scalpel method avoided most of them … so don’t thump me on the head, I’m only the messenger here.
w.
Unfortunately, it seems that they’d stopped reading the post by that point, as I got no answer. So I’m here to ask it again …
My best to both Zeke and Mosh, who I have no intention of putting on the spot. It’s just that as a long time advocate of the scalpel method myself, I’d like to know the answer before I continue to support it.
Regards to all,
w.
tonyb says:
June 29, 2014 at 2:21 am
///////////
Thanks.
Whilst I do not like proxies, I consider that historical and archaelogical record provide some of the best proxy evidence available. At least we are dealing with fact, even if it is relative and not absolutely quantative in nature. I consider that climate scientist could do with studying history. Somehow the idea that there was some form of climate stasis prior to the industrial revelotion has taken root simply because people did not live through prior times, and have no insight into the history of those times.
I applaud your reconstruction of CET since not only does it provide us with more data, it is a valuable insight into the history of those times, and what people thought of their history. Whilst I am not a fan of the globalisation of climate, instead I consider that in the terms that we are talking about it is inherently regional in nature, there is no particular reason why CET should not be a good trend marker for the Northern Hemisphere, particularly the near Northern European Continent. So your reconstruction is an extremely valuable tool.
A point to ponder on. I consider that the next 10 to 12 years could be compelling. IF there is no super El Nino event, and IF temperatures were to cool at about 0.1 to 0.12degC per decade (and as you know from CET there has been a significant fall in temperatures since 2000, especially the ‘winter’ period), then we will be in a remarkable position.
I know that this contains some IFs, but if it were to come to pass, we would find ourselves (according to the satellite data) at about the same temperature as 1979. There would, in this scenario, be no warming during the entirety of the satellite record, and during this period about 80% of all manmade CO2 emissions would have taken place (ie., the emissions from 1979 to say 2025).
I do not think that many people appreciate that that would be the resulting scenario, of what would be quite a modest drop in temperature, and the knock on consequence is huge, since this would make it very difficult for someone to argue that manmade CO2 emissions are significant (if 80% of all such emissions has resulted in no measured increase at all!),
Addendum to my 2:45 am post:
The total number of sites was 35,000
but only 10,000 of them had less than 10% missing data
and only 3000 of them had <10% missing data and greater than 38 years long.
To be clear, these numbers came from a superficial look at the raw data files Zeke and Mosher provide links to in a index page. These values are prior to the use of the BEST scalpel. BEST has ready and fast access to post-scalpel segment length. Constructing our suggested census plots by sampled year should be easy and no burden compared to other processing they do.
Another interesting chart that Richard H and I briefly explored is a census map of 1×1 degree grid based upon post-segment length. For instance color code the 1×1 deg cells by the number of segments that exist that are longer that 20 years and cover the year 2005. I hypothesize it would be a pretty blank map. 2×2 deg cells? (that is about 100×125 mile cells).
ferdberple says:
June 28, 2014 at 11:26 pm
Us component/pcb/system test and reliability engineers tried to break things without exceeding the specification. Once a problem was identified one of the Not Invented Here/It’s not our hardware responses was that condition will never happen in the field; Test reponse “how do you know?” Design Answer “because I/we designed it”. Test engineer “ou designed in this fault. how do you KNOW?” Designer “yes but….”
We’re now in the yes but scenario.
Willis’ question is unanswerable as the error is designed in, and will happen in the field.
“the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.”
The drift is always positive, that’s why it doesn’t work. If the drift were random it would work.
Didn’t Anthony have an experiment running to test Stevenson screens under different maintenance regimes? Is there anything from those tests that would be of use in this discussion?
The breakpoints should be equally distributed between positive and negative adjustments of roughly the same magnitude.
And the positive and negative breakpoints should be equally distributed through time. In 1910, there should be 200 negative breakpoints and 200 positive adjustments of roughly the same magnitude. And this should be roughly consistent from the start of the record to the end of the record.
This is how bias can be detected. Simple histograms. And then we can determine whether the excess of negative adjustments in 1944 is valid for example. What is the main reason for this.
There should not be a trend through time in the breakpoints unless it can be proven that they should vary through time.
BEST hasn’t shown this that I am aware of. The description of the simple breakpoint/scapel method suggests that the breakpoints should be random through time. Are they? Why haven’t they shown this and explained it?
We know the NCDC has shown their adjustment have a systematic trend through time with a maximum negative adjustment of -0.35C to -0.55C in the 1930s to 1940s. Why? Does the TOBs adjustment vary throughout time? Why? Shouldn’t the other changes vary randomly through time? Why not?
Should people be allowed to change the data without fully explaining it.
FWIW, I designed and built a precision Web/Internet enabled temperature monitor system about a decade ago (selling them to customers like HP Research and the Navy Metrology Lab), and although we never built a version for outdoor environmental monitoring and data acquisition, I did do research on things like radiation shields/Stevenson screens and the like, thinking we might sell to that market.
One of the more interesting things I discovered was that some white paints (Krylon flat white spray paint being one of them) can actually have a far higher emissivity in the infrared spectrum than most other colors, including some *black* paints. (I don’t have the link handy, but I remember one source of this was a NASA document comparing the IR emissivity of a fairly large number of coatings. Krylon’s aerosol flat white was among the paints with the very highest IR emissivity which, of course, also means it’s near the top in IR absorption.
Moral: Visible ain’t IR, and your eyes can’t tell by looking whether that paint absorbs or reflects heat. The fact that the really high emissivity was for flat white paint does call into question whether and how weathering/aging might dramatically increase the thermal absorption of white paints or even molded plastic radiation shields over time and hints that glossiness is at least as important as color.). Every paint I’ve ever encountered tends to get flat and/or chalky over time as it ages and oxidizes. As a result, repainting a shield could either raise or lower the temperature inside! If anyone were *really* interested in actual climate science, this would be a topic of research, but the Global Warming narrative is better served by ignoring it, so don’t hold your breath. One more reason why climate science and good temperature measurements are harder than they appear.
(BTW, most temp sensors and instruments, even quite a few of the expensive ones, give pretty crappy readings that are frequently subject to offset errors of a degree or more (C or F, take your pick…) Thermocouples are especially problematic, as almost no one actually gets all the intermediate junction stuff and cold junction compensation right. Some systems I’ve seen correlate to the temp of the galvanized pole they’re mounted on better than they do to ambient air temp. (Further, I’m amazed at how many so-called high precision temperature measurement systems ship TCs with the ends twisted together rather than properly welded.) I prefer platinum RTDs for accurate temp measurements, but doing that right requires precision excitation power supplies and bridge completion resistors that are stable across your entire temp range over time. These things are expensive and no one wants to pay for them. Bottom line: accurately and precisely measuring temperature electronically is much harder than it appears, and it’s often done very poorly. I strongly suspect that hundred year-old temperature recohandhand recorded from mercury thermometers were dramatically more accurate and consistent than what we’re getting today.)
Pick out a set of 100 sites that give a quasi satisfactory spread of locations in contiguous USA.
Sites that can be maintained well and not on top of mountains.
Sites that have TOBS perfect
Exclude Hawaii and Alaska.
Accept that it is a proxy and not perfect.
Use 50 of the sites for the accepted temperature record.
Use 50 adjacent sites as potential back ups if sites go down.
Put a caveat up this temp is reconstructed using real raw data with * so many substitute sites.
Allow Nick Stokes to use the anomalies from this method to do his sums as he cannot understand anomalies are just the variation from the absolute at each site.
When problems arise infill the missing days from the average for that day over all past years .
Put a caveat up that the record has been amended this way for that year.
Wash, Rinse, Repeat.
Willis — I think you’re basically right. There’s a fair amount of subtlety to all this and I wouldn’t bet much much money on you (we) being correct, but I don’t see any obvious flaws in your discussion.
One minor point made by others, probably not all the biases increase over time. For example, vegetation growth (corrected by occasional trimming) would probably be a negative bias. But I suspect most of the biases are positive. (Does the sum of all biases that change over time = UHI?)
It occurs to me that if this were a physical system we were designing where we had a great deal of control over the system and the measurement procedures, we’d probably try to design all the time variant biases out. If we couldn’t do that, we’d likely have a calibration procedure we ran from time to time so we could estimate the biases and correct them out. What we (probably) wouldn’t do was ignore calibration and try to correct the biases out based on the measurements with no supporting data on the biases. But that seems to be what we’re forced to do with climate data. I’m not smart enough to know for sure that can’t be done, but I wouldn’t have the slightest idea how to go about doing it.
Anyway, good luck with this. I’m sure you’ll let us know how it works out.
Oh I see, so they check each station to see if it was a physical change…or weather
…and if it’s weather, they don’t splice it
“A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.”
Interesting. The thermometer only reads the temperature. So, how does one determine, site-by-site, which “other than temperature” effects should be removed?
I’m in agreement with those who say that only a site that doesn’t have any other than temperature effects should be used.
I suspect we could look at any improperly sited station and spend days discussing the possible adjustment(s) necessary to make the data for that station reasonably correct.
I’m making a huge assumption: that the folks here ultimately would agree on the adjustments for that station. I’m of the opinion that we could not.
So, If we can’t get one right…
SandyInLimousin says:
June 29, 2014 at 3:47 am
==============
i routinely get systems designers telling me that the odds of a particular event are billions to one against, so we can safely ignore it in the design. then I show them with a simple data mining exercise that we routinely have many of these billion to one events. most people are very poor at estimating events when it is in their favor to estimate poorly.
BEST must have the data, showing how much the offset was on each slice. This should average out to zero if the slice is bias free. It is my understanding that BEST has not made this information available, despite being asked. why not?
Correcting slice bias is not difficult. Add up all the offsets from slicing. some will be positive and some negative. whatever the residual, add the inverse into the final result to remove any trend created by slicing. but without knowing the offsets due to slicing, there is no way to know if it introduced a trend or not.
Another underlying problem is the assumption that a discontinuity in the record is a problem in the record.
Willis, maybe Zeke and Mosher were no longer interested in the discussion, but I did answer your question.
Relative homogenization methods, compute the difference between the station you are interested in and its neighbours. If the mean of this difference is not constant, there is something happening at one station, that does not happen at the other. Thus such changes are removed as well as possible in homogenization to be able to compute trends with more reliability (and the raw data is not changed and can also be downloaded from NOAA). In relative homogenization you compare the mean value of the difference before and after a potential date of change, if this difference is statistically significant a break is detected or in case of BEST the scalpel is set and the series are split. It does not matter whether the means are different due to a gradual change or due to a jump. What matters is that the difference in the means before and after is large enough.
This part of the BEST method is no problem. With Zeke, BEST, NOAA and some European groups we are working on a validation study aimed especially at gradual inhomogeneities. This complements the existing studies where a combination of break inhomogeneities and gradual ones were used for validation.
ps: i’ve used the term offset as the difference between the end points each side of the slice. there may be a better term. Willis et al contend that the majority of these offsets will be in one direction, leading to bias over time. unless BEST corrects for the residual – the net sum of the positive and negative biases – the slice must introduce an artificial trend in the result. BEST should publish the offset data so it can be evaluated, to see if the slicing created an artificial trend.
BEST by year, what is the net total of the difference between the endpoints for all the slices?
ferdberple clarifies: “The method cannot catch gradual drift followed by step-wise correction. Thus it introduces bias into signals that have no bias.”
And there’s a perfect mechanism for a warm bias in the simple fading of initially bright white paint of temperature stations. Paint was what ruined the Hubble telescope mirror for lack of double checking. They only used a single flatness detector that has a bit of paint chipped off it so they altered the massive mirror to match the paint chip. Later a Metric system conversion left out crashed a Mars lander. So here is Willis asking ahead of time, prior to launch of an American carbon tax, “hey, have you guys checked for this little error issue I ran into?”
Dirt simple errors being missed in big technical projects often lead to disaster even for rocket scientists with an extreme interest in not being wrong, unlike the case for climatologists.
pps: one would also need to consider the gridding when apportioning the residuals from slicing. even if they added up to zero in absolute terms, this could still introduce bias when gridded.
Willis: “Is it possible to find signal that is smaller than the measurement errors. This is the big question of the temperature records.
Mmmm … in theory, sure. Signal engineers do it every day. But in the temperature records? Who knows.”
This statement needs a little clarification. Yes algorithms are in use that easily detect signals an order of magnitude or two below noise level. However (and this is a big however!), these algorithms are searching for signals with known characteristics. Typically these will be sine waves with specific modulation characteristics. The less that is known about a signal, the better the signal to noise ratio must be and the longer the detection period must be for detection.
The point is that searching for a long term trend in our USHCN records is not analogous to decoding data transmissions from a distant Voyager satellite. From a signal analysis perspective, reliably detecting that trend would require a significantly positive signal to noise ratio. (trend greater than noise over the observation period)
An example of the problem of finding a trend signal in noise was presented to us by a popular hockey stick graph developed from tree ring analysis. The analysis algorithm emphasized records that matched a modern thermometer record. Since the thermometer record had a positive trend, the algorithm dug into the noisy tree ring data pulled out a positive trend result. Of course, the algorithm was also able to find false positive trends in pure random noise most of the time too. It found what it was designed to find.
Willis I like the scalpel because it creates a break where there is a shift in data bias. That is a good way to start.
Suppose the change was trimming a tree, painting the enclosure or cutting down a nearby hedge. The influence of this class of change is gradual – we can assume it grew to be a problem gradually so the effect is approximately linear.
The bias (up or down) is the difference between the last and first data of the two sets but there would have to be some brave corrections because the week following the change is not the same as the one before it.
Suppose the new data were consistently 1 degree cooler than the old. Doesn’t work. We have to take a whole year. But a year can be a degree colder than the last. Big problem.
If there really was a 1 degree bias, we have to assume the early part of a data set is ‘right’. The step change has to be applied to the old data to ‘fix’ it assuming the change is linear.
Step change = s
Number of observations = n
N is the position in the series
D1 = data value 1
D1c = Corrected value 1
Linear Correction:
D1c = D1+(N1-1)*s/n
D2c = D2+(N2-1)*s/n
D3c = D3+(N3-1)*s/n
Etc
That works for steps up or down.
The data set length can be anything. All it needs is the value and sign of the step.
Other types of episodic change like paving nearby sidewalk need offsets, not corrections for drift because that is a different class of change.
A couple of points.
If we assume that the discontinuities represent corrections back to the initial conditions, then the climate signal is best measured immediately after the discontinuities by connecting those as points and ignoring what’s in between. This is based on the idea that the discontinuity is due to a major human intervention that recalibrated the station, and that afterwards bias creeps in. Going all the way with the saw-tooth analogy, you have a region with absolutely no trend, on which sit a bunch of saws, teeth up, and the teeth having wildly varying sizes (different periods between recalibrations), but the same thickness between the back of the saw blade and the inner notch of a tooth. The best measurement of the underlying surface is the bottom of the notches, not the teeth themselves. .
Also, you could assume that the bias signal accumulating in all the stations is half the maximum tooth height, then subtract that from all the readings, and simply average all the stations. Since the adjustments are asynchronus, this would probably give you a pretty good overall picture of the climate. This implies that the raw data is probably better than the adjusted data for seeing the major trends, since removing similar snippets from a periodic waveform will create a major false trend. If also agrees with the simple logic that thousands of station monitors were probably not all horribly misreading their thermometers, and that this misreading grows vastly worse in the past. Their readings were with X degrees now, and were within X degrees in the past.
I’m also confident that it’s a whole lot easier to warm a station than to cool it, if we disallow large trees growing over a site. I think Anthony’s surface station project could confirm this by simply counting how many stations should be rejected because they should be reading too hot instead of too cold.
Quoting Willis “quotes” from the article:
“Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.”
“As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.”
“So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.
So I’m here to ask it again …”
———–
Willis, I will offer my solution for your stated problem but you will have to determine if it is applicable and if it can be implemented or not.
First of all, a discontinuity “flag” character would have to be chosen/selected that would be appended to the “first” temperature reading that was recorded after said maintenance was performed at/on the Surface Station. Said “flag” would thus denote a “maintenance break-point” in the daily temperature data. …. And for “talking” purposes I will choose the alpha character “m” with both the capital “M” and small “m” having significance.
Next, a new “maintenance” program would have to be written that would “scan” the temperature data file for each Surface Station looking for any capital “M” discontinuity “flags” and if found, it would calculate the “trend” variance between said “M” flagged temperature value and the one previous to it. And via that actual “trend” value said maintenance program would algebraically “add” it to said previous temperature …. and then via an algorithm sequential “decrease” in/of said “trend” value would add said newly calculated “trend” values to all sequentially previous temperature data until it detects a small “m” discontinuity “flag” or a “default” date. The program would then change the originally noted discontinuity “flag” from a capital “M” to a small ”m” thus signifying that “trend” corrections had been applied to all temperature data inclusive between the small “m” discontinuity “flags”.
Of course, the above requires use of the “raw” and/or “trend” corrected data each time said new “maintenance” program is executed.
If that was not what you were “asking for” ….. then my bad, …. I assumed wrong.
Richard Verney,
Good comments all – I second all your questions. Have you seen this today?
http://notalotofpeopleknowthat.wordpress.com/2014/06/29/more-news-on-ushcn-temperature-adjustments/#comment-26002
Does anyone know how basic quality control is done to detect such data changes?