Problems With The Scalpel Method

Guest Post by Willis Eschenbach

In an insightful post at WUWT by Bob Dedekind, he talked about a problem with temperature adjustments. He pointed out that the stations are maintained, by doing things like periodically cutting back the trees that are encroaching, or by painting the Stevenson Screen. He noted that that if we try to “homogenize” these stations, we get an erroneous result. This led me to a consideration about the “scalpel method” used by the Berkeley Earth folks to correct discontinuities in the temperature record.

The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.

There are a variety of ways to deal with that, which are grouped under the rubric of “homogenization”. A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.

The method that I’ve recommended in the past is called the “scalpel method”. To see how it works, suppose there is a station move. The scalpel method cuts the data at the time of the move, and simply considers it as two station records, one at the original location, and one at the new location. What’s not to like? Well, here’s what I posted over at that thread. The Berkeley Earth dataset is homogenized by the scalpel method, and both Zeke Hausfather and Steven Mosher have assisted the Berkeley folks in their work. Both of them had commented on Bob’s post, so I asked them the following.

Mosh and/or Zeke, Stephen Rasey above and Bob Dedekind in the head post raise several points that I hadn’t considered. Let me summarize them, they can correct me if I’m wrong.

• In any kind of sawtooth-shaped wave of a temperature record subject to periodic or episodic maintenance or change, e.g. painting a Stephenson screen, the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.

• Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.

• As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.

Now this makes perfect sense to me. You can see what would happen with a thought experiment. If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.

So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.

In your reply, please consider that I have long thought and written that the scalpel method was the best of a bad lot of methods, all methods have problems but I thought the scalpel method avoided most of them … so don’t thump me on the head, I’m only the messenger here.

w.

Unfortunately, it seems that they’d stopped reading the post by that point, as I got no answer. So I’m here to ask it again …

My best to both Zeke and Mosh, who I have no intention of putting on the spot. It’s just that as a long time advocate of the scalpel method myself, I’d like to know the answer before I continue to support it.

Regards to all,

w.

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
181 Comments
Inline Feedbacks
View all comments
July 5, 2014 2:58 pm

RE: my July 2, 12:06 pm
Speaking of mse rmse> (mean standard error),…..
We never measure any daily Tave.
We measure 30 daily Tmin and 30 daily Tmax.
So let’s assume that our
30 Tmins are c(4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6), and
30 Tmax are c(14,16,14,16,4,16,4,16,4,16,14,16,14,16,4,16,4,16,4,16,14,16,14,16,4,16,4,16,4,16)
The monthly Tavg from the 30 Tmins and 30 Tmaxs = 10.000 deg. C
but the StDev = 5.099, count = 60
Trmse = 5.099/sqrt(60) = 0.658 deg C.

What is the Trmse of that month’s ANOMALLY?
Answer #1. The anomaly is just the average shifted to the average the base period. It is just a bulk shift. For the purposes of uncertainty analysis, we are subtracting by a constant and we are not adding uncertainty. It makes no difference to the slope of the trend if we subtract 9, 10, 11 or pi().
Answer #1 is correct if and only if the adjustment is the same for all months.
For instance, If we were to look at the trend in May temperature anomalies over the period 1980 to 2010, and we subtracted all by the mean of (may temperatures over 30 years) or Tm(May, 1980-2010) , we would not have to be concerned with the mean standard error of that estimate (Trmse(May, 1980-2010)).
But when we are combining anomalies for May, June, July, …. April, then the Trmse for each month becomes important.
Answer #2. The Anomaly is a bulk shift by an uncertain quantity: (Tm(May, 1980–201), Trmse(May,1980-2010))
If we use the example of 30 Tmins and 30 Tmaxes for the month of May above, and we assume it repeats constantly for 30 years in the same month of May, then
Mean: Tm(May, 1980-2010) = Tm(May) = 10.000 deg. C.
RMSE (Standard Error of mean)
Trmse(May, 2010) = 0.658 deg C
Trmse(May, 1980-2010) = 0.120 deg C
So there is an error bar proportional to 0.120 deg C added to the data when we take the anomaly. Each month gets this error bar, just from taking an anomaly from 30 years of constant temperatures where the daily difference between high and low is between 8 and 12 deg C and averages 10 deg C.
So, when you take the Anomaly for the Month
TA(May,2010) = T(May,2010) – T(May,1980-2010) and we treat these temperatures as (mean,std Deviation)
Then
TA(May,2010) = (10, 0.658) – (10 , 0.120) = (0, 0.669)
(these standard deviations add like the Pythagorean Theorem)
So while the mean Anomaly is zero (as desired), It’s mean standard error is increased to 0.669 deg. C.
Moral of the story, the Trmse of the 30 year average is not insignificant, easily a couple tenths of a degree. It matters when you attempt to compare one month against another month to remove an unknown but estimated seasonal signal. The overall error bar of an individual month’s anomaly TA(Month,year) is probably mostly composed of the uncertainty from the single months Trmse(Month, Year) derived from the month’s measured Mins and Maxes.

July 6, 2014 11:47 am

Addendum to my July 5, 2:58 pm
Answer #1 is correct if and only if the adjustment is the same for all months.
The bulk shift of a station-month temperature record to an Temperature Anomaly record is the addition of a constant with no uncertainty IF and ONLY IF the shift is the same for all months AND ALL STATIONS that it will be compared against.
The head post is about “Problems with the Scalpel Method” of BEST.
BEST uses it’s scalpel by comparing the Temperature Anomaly record of a station with a krigged regional field derived from the Temperature Anomaly records of “nearby” stations. This comparison means that the Trmse, the mean standard error of mean Temperature must be included in the uncertainty analysis when comparing between stations of the same month.
I have show above that the individual station error bars of the mean Temperature of any month for any forstation, derived from the only observed measurements, the daily Tmin and Tmax, is greater than 0.6 deg C if the daily min and max range is only 10 deg. C. I have also shown that with a 0.6 deg uncertainty per month, then a 30 year average for the month will also have an uncertainty of at least 0.1 deg C. Every station in the regional homogenization grid experiences these same uncertainties.
I submit that with these real uncertainties in means, as well as in average maxes and average mins that derive from the raw recorded station data, that it is impossible for BEST or anyone else to determine an empirical breakpoint of any station based upon fit to a regional trend.
The regional trend, which we have no reason to believe is a monotonic surface, has significant fuzzy thickness from error bars. It is a surface where each control point has a Trmse of over 0.6 deg C, an anomaly adjustment of 0.1 deg C. The derived krigged surface has significant uncertainty thickness is then applied to a subject station, who also possesses a 0.6 deg C Temp uncertainty and 0.1 deg uncertainty to it anomaly. Under such circumstances, it is unlikely that any station will exceed the errors present to earn an empirical breakpoint, much less an average of over 5 breakpoints per station.
There are many reasons I reject the BEST scalpel and trend reliance to tease out a climate signal. From day one I had objections based upon information theory and the loss of low frequency (Climate) signal caused by the scalpel and emphasis on temperature trends. The results of BEST’s work on individual stations don’t make sense: It can find 20 station adjustments at Lulling, TX, 8 stations adjustments at DENVER STAPLETON AIRPORT (but misses the opening and closing of the airport), yet it misses the fire-bombing of Tokyo in March 1945.
Even the greatest supporters of BEST must acknowledge the BEST Scalpel requires precision in the data to justify 0.1 degree breakpoint shifts — a precision that does not exist when the raw data are daily mins and maxes.

July 7, 2014 10:18 pm

Understanding adjustments to temperature data
by Zeke Hausfather
http://judithcurry.com/2014/07/07/understanding-adjustments-to-temperature-data/
543 Comments in less than 18 hours.
Why Adjust Temperatures?
What are the Adjustments?
Quality Control
Time of Observation (TOBs) Adjustments
Pairwise Homogenization Algorithm (PHA) Adjustments
Infilling
Changing the Past?

This will be the first post in a three-part series examining adjustments in temperature data, with a specific focus on the U.S. land temperatures. This post will provide an overview of the adjustments done and their relative effect on temperatures. The second post will examine Time of Observation adjustments in more detail, using hourly data from the pristine U.S. Climate Reference Network (USCRN) to empirically demonstrate the potential bias introduced by different observation times. The final post will examine automated pairwise homogenization approaches in more detail, looking at how breakpoints are detected and how algorithms can tested to ensure that they are equally effective at removing both cooling and warming biases.

July 7, 2014 10:38 pm

More on Zeke’s post at Curry on July 7,
of the 547 comments so far, 98 of them are Steven Mosher with sentences so short, terse, and abbreviated of meaning, his word processor must use the BEST scalpel as a plug-in.
Paul Matthews has a short comments that sums up a good deal of the thread.

Paul Matthews | July 7, 2014 at 9:51 am | Reply
Congratulations, you’ve written a long post, managing to avoid mentioning all the main issues of current interest.
“Having worked with many of the scientists in question”
In that case, you are in no position to evaluate their work objectively.
“start out from a position of assuming good faith”
I did that. Two and a half years ago I wrote to the NCDC people about the erroneous adjustments in Iceland (the Iceland Met Office confirmed there was no validity to the adjustments) and the apparently missing data that was in fact available. I was told they would look into it and to “stay tuned for further updates” but heard nothing. The erroneous adjustments (a consistent cooling in the 1960s is deleted) and bogus missing data are still there.
So I’m afraid good faith has been lost and it’s going to be very hard to regain it.

July 7, 2014 10:50 pm

From Zeke’s Curry paper, at the end of the Pairwise Homogenization Algorithm (PHA) Adjustments section.

With any automated homogenization approach, it is critically important that the algorithm be tested with synthetic data with various types of biases introduced (step changes, trend inhomogenities, sawtooth patterns, etc.), to ensure that the algorithm will identically deal with biases in both directions and not create any new systemic biases when correcting inhomogenities in the record. This was done initially in Williams et al 2012 and Venema et al 2012. There are ongoing efforts to create a standardized set of tests that various groups around the world can submit homogenization algorithms to be evaluated by, as discussed in our recently submitted paper. This process, and other detailed discussion of automated homogenization, will be discussed in more detail in part three of this series of posts.

The Williams link is a pdf. the Venema link is to an abstract. Neither make any reference to “sawtooth”.

July 7, 2014 11:01 pm

Had to repost this gem of an observation from Patrick B, with Mosher’s typical “read the literature” (where) retort.
Patrick B | July 7, 2014 at 9:48 am | Reply

How could you have written this article without once mentioning error analysis?
Data, real original data, has some margin of error associated with it. Every adjustment to that data adds to that margin of error. Without proper error analysis and reporting that margin of error with the adjusted data, it is all useless. What the hell do they teach hard science majors these days?
Steven Mosher | July 7, 2014 at 11:24 am | Reply
the error analysis for TOBS for example is fully documented in the underlying papers referenced here.

It’s true. Patrick B’ comment is the first time “error” appears in the time stream. It is not in the head post.

1 6 7 8