Guest Post by Willis Eschenbach
In an insightful post at WUWT by Bob Dedekind, he talked about a problem with temperature adjustments. He pointed out that the stations are maintained, by doing things like periodically cutting back the trees that are encroaching, or by painting the Stevenson Screen. He noted that that if we try to “homogenize” these stations, we get an erroneous result. This led me to a consideration about the “scalpel method” used by the Berkeley Earth folks to correct discontinuities in the temperature record.
The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.
There are a variety of ways to deal with that, which are grouped under the rubric of “homogenization”. A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.
The method that I’ve recommended in the past is called the “scalpel method”. To see how it works, suppose there is a station move. The scalpel method cuts the data at the time of the move, and simply considers it as two station records, one at the original location, and one at the new location. What’s not to like? Well, here’s what I posted over at that thread. The Berkeley Earth dataset is homogenized by the scalpel method, and both Zeke Hausfather and Steven Mosher have assisted the Berkeley folks in their work. Both of them had commented on Bob’s post, so I asked them the following.
Mosh and/or Zeke, Stephen Rasey above and Bob Dedekind in the head post raise several points that I hadn’t considered. Let me summarize them, they can correct me if I’m wrong.
• In any kind of sawtooth-shaped wave of a temperature record subject to periodic or episodic maintenance or change, e.g. painting a Stephenson screen, the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.
• Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.
• As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.
Now this makes perfect sense to me. You can see what would happen with a thought experiment. If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.
So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.
In your reply, please consider that I have long thought and written that the scalpel method was the best of a bad lot of methods, all methods have problems but I thought the scalpel method avoided most of them … so don’t thump me on the head, I’m only the messenger here.
w.
Unfortunately, it seems that they’d stopped reading the post by that point, as I got no answer. So I’m here to ask it again …
My best to both Zeke and Mosh, who I have no intention of putting on the spot. It’s just that as a long time advocate of the scalpel method myself, I’d like to know the answer before I continue to support it.
Regards to all,
w.
@Bill Illis at 5:23 am
This is how bias can be detected. Simple histograms. And then we can determine whether the excess of negative adjustments in 1944 is valid for example
In addition to simple histograms, I think we need to see scatter plots of
Y: BreakPoint Offsets vs. X: Length of segment prior to the breakpoint.
For different 5-year semi-decades.
I can envision it possible for the simple histogram to show no bias in breakpoints, but only because the offsets after short segments counteract offsets of opposite sign after longer segments. A scatter plot of Offset vs prior segment length could show an interesting trend and how it changes for different periods of the total record.
Better yet, why don’t we just have access to a table of:
StationID, BreakPoint date, Segment Length Prior to break, Trend value before Break, Trend value after break.
Stephen Richards says:
June 29, 2014 at 2:50 am
Thanks, Stephen. And I imagine that just as many times as you have written that, someone has responded along the lines that almost all data contains a variety of errors.
For example, there are a number of temperature stations that have occasionally erroneously reported their data in Fahrenheit rather than Celsius. According to you, we should never alter the data, we should just use the incorrect data “as is”. So we should use the Fahrenheit figures, rather than daring to ALTER PAST DATA by converting them to the correct Celsius figures …
Do you see how crazy you sound with your absolute dicta? Science is rarely that black and white.
Here’s another example. Say that we have a change in the time of observation. Suppose that for years we’ve been taking afternoon temperatures at 3 PM at all of our temperature stations, and then we start taking them at 2 PM.
Of course the 2PM figures are warmer than the 3PM figures, so when you consider the raw data, it looks like we have massive global warming. Now if someone goes around saying “I have raw unaltered historical data which proves that there is global warming”, what will you say in response?
Me, I’ll say “No, there’s no global warming. You just haven’t accounted for the change in observation times”.
But in that situation, if you say that we must accept the observational data exactly as it was recorded because ”IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA”, then you’ve just put your full weight behind a highly misleading (although totally accurate and unaltered) temperature dataset.
When you realize that there are errors in historical data, there are two basic choices—throw out the data, or correct it for the bias caused by the change in the time of observation.
And while you can make valid arguments for one or the other, correcting the known errors in a dataset is a valid scientific choice, one made by reputable scientists in a host of fields.
So I fear that you’ll be a voice crying in the wilderness forever if you think we should keep Fahrenheit readings in place of Celsius readings or that we should not correct for the known bias caused by the change in times of observation …
Best regards,
w.
Bill Illis says:
June 29, 2014 at 5:23 am
Thanks, Bill, but why? Suppose all of the temperature datasets were perfect except for a change in time of observation in 1968. Almost all of the adjustments will be in the same direction, not randomly distributed plus and minus. As a result, we can’t test with a histogram as you propose …
w.
I never said it was practical….
Two parallel measurement stations required. A change at one of them could be classified as restorative (such as painting the outside), in which case the new measurement is taken as more accurate. The second station is needed to control for a coincidental real discontinuity.Then the restored unit can be used to correct for the degradation trend (there could be multiple degradation trends, actually). Replacing an aged thermometer with a new one could be classified similarly. However, changing measurement technology might not fall into a “restorative” category, in which case, the second station can be used to help make adjustments to the modified station’s output if the reported numbers are intended to create a continuous record. However, it is imperative the original raw data be preserved. A second station could also help prevent data loss when new tech fails. Recording metadata is vital if you ever hope to make meaningful corrections.
This was a great post, Willis.
Victor Venema says:
June 29, 2014 at 7:09 am
Thanks for your reply, Victor, but that doesn’t solve the conundrum. Let me present it again:
Your method, using “statistical homogenization to remove the unknown jumps and gradual inhomogeneities”, will not fix the bogus trend created out of thin air by the scalpel method.
My question was, it is even possible to fix that spurious trend created by the scalpel method, and if so, how are the Berkeley Earth folks doing it?
Much appreciated,
w.
ferdberple says:
June 29, 2014 at 7:14 am
ferd, thanks for your thoughts. A small correction. To me, the problem is not just that majority of jumps will be in one direction, leading to an overall trend.
The problem is that even if by chance the jumps are randomly distributed and there is no change in the overall trend, it plays havoc with the individual station trends.
As an explanatory example, suppose we have the following equations
2 + 2 = 5
2 + 2 = 3
Both of them are obviously wrong … so average the two equations (just like we average our station trends) and we get
2 + 2 = 4
… I’m sure you can see the problem. Getting the correct overall result does NOT mean that the underlying “corrected” data is now valid …
All the best,
w.
Relative homogenization methods, compute the difference between the station you are interested in and its neighbours. If the mean of this difference is not constant, there is something happening at one station, that does not happen at the other. Thus such changes are removed as well as possible in homogenization to be able to compute trends with more reliability (and the raw data is not changed and can also be downloaded from NOAA).
Hullo, Doc. V.
Problem is that the “something happening” appears to be Good Siting. And the result of that is that the most of the 20% of well sited stations are identified as outliers and are adjusted to conform with the readings of the 80% of poorly sited stations.
And since microsite bias is continual, without breakpoint, the problem is not identified by BEST.
The result is that the song of my precious Class 1\2 stations is silenced. Silenced as if it had never been sung. My beautiful song. My “true signal”. Gone with the Wind. Blown away.
And unless a deconstruction of the adjusted data is applied, there is not the slightest trace that their song was ever sung in the first place.
The problem is that even if by chance the jumps are randomly distributed and there is no change in the overall trend, it plays havoc with the individual station trends
==============
agreed. my thoughts were specific to calculating the overall trend. as soon as you start infilling the errors in the individual stations will blow my approach out of the water. a similar argument could also be applied to anomalies. since the individual stations have false trends, their anomalies will also be wrong, further aggravating the errors.
since this problem very much resembles pair-wise correction, the same problems are likely to persist. the underlying issue is that the method is sensitive to the rate of degradation of the signal, which leads to bias when degradation occurs slowly.
Willis
You say
So we should use the Fahrenheit figures, rather than daring to ALTER PAST DATA by converting them to the correct Celsius figures
I don’t think anyone would suggest that, especially as most of the world uses Celsius and the USA is an exception, even the UK uses Celsius with Fahrenheit as a bracketed value for older readers (known colloquially as Old Money). For global values one or other has to be converted. Adding a fudge factor to the changed value would be unacceptable as would not noting why the change was made. Changing historical datasets in the way Paul Homewood and at one time only Steven Goddard describe is not acceptable, that is adding a fudge factor to data someone else is going to use and will confirm your results and not your theory.
I’m quite happy for the original unedited data to be all that a government agency publishes, corrections and fudge factors can be added by any researcher. The Data keepers should be just that, not the arbiters of what the data recorder meant when he wrote the figures on the piece of paper, or the automated station meant when it sent the electrons down the wire.
And unless a deconstruction of the adjusted data is applied, there is not the slightest trace that their song was ever sung in the first place.
=========
agreed. I didn’t consider that infilling and anomalies would mask the false trends in individual stations, making post slice correction effectively impossible. the sum of the absolute value of the offsets is a measure of the maximum error in BEST, with the sum of the offsets a measure of the minimum error. so, publication of the detailed offsets due to slicing would appear to be the next step in validation of BEST methodology. residuals are not sufficient.
the really interesting thing about the slice method is that when it was proposed it seemed like a good idea. it is only now, much later in the day, that folks are realizing that it is sensitive to a certain class of errors.
it seems likely that pair-wise correction was the same. the researchers were trying to correct a specific problem, and the method worked for the cases studied. it was only much later, when the effects started to diverge from reality, that there was any indication there were problems.
like a computer system that randomly changes 0`s to 1`s and 1`s to 0`s. If it happens quickly enough you can find the error. but if the problem works slowly enough, over time your system will die and there is nothing you can do to prevent it. it is almost impossible to detect slow moving errors.
Hi Willis,
Sorry for not getting back to you earlier; just landed back in SF after a flight from NYC.
The performance of homogenization methods in the presence of saw-tooth inhomogenities is certainly something that could be tested better using synthetic data. However, as Victor mentioned, relative homogenization methods look at the time-evolution of differences from surrounding stations. If the gradual part of the sawtooth was being ignored, the station in question would diverge further and further away from its neighbors over time and trigger a breakpoint.
There are a number of examples of apparent sawtooth patterns relative to surrounding stations in the Berkeley data that seem to be correctly adjusted; I haven’t found an example of poor adjustment that creates a biased trend relative to surrounding stations, but I’d encourage folks to look for them.
Here are a few examples of sawtooth and gradual trend inhomogeneities seem to be correctly adjusted:
http://berkeleyearth.lbl.gov/stations/169993
http://berkeleyearth.lbl.gov/stations/30748
http://berkeleyearth.lbl.gov/stations/156164
http://berkeleyearth.lbl.gov/stations/161705
http://berkeleyearth.lbl.gov/stations/33493
http://berkeleyearth.lbl.gov/stations/34034
Its also worth mentioning that Berkeley has a second type of homogenization that would catch spuriously inflated trends, at least if they were isolated. The kriging process downweights stations with divergent trends via-a-vis surrounding stations when creating the regional temperature field, after all stations have been homogenized.
Willis: “Your method, using “statistical homogenization to remove the unknown jumps and gradual inhomogeneities”, will not fix the bogus trend created out of thin air by the scalpel method.”
I do not have a method, but only validated the methods of others up to now. The normal way of homogenization is not the scalpel method. In the normal way, the neighbours are also used to compute the corrections. This makes the long-term trend of the station with the gradual inhomogeneity similar to the one of the neighbours. I do not expect standard methods to have more problems with gradual inhomogeneities as with jump inhomogeneities.
Willis: “My question was, it is even possible to fix that spurious trend created by the scalpel method, and if so, how are the Berkeley Earth folks doing it?”
I understand their article right, BEST reduces the weight of data with gradual inhomogeneities. I would personally prefer to remove it, but they prefer to be able to say that they used all the data and did not remove anything. If the weight is small enough, that would be similar to removing the data. That is the part of the algorithm, I would study, not the scalpel mentioned in the title of this blog post.
Hello pruf Evan Jones, the quality of station placement is another problem as the one mentioned in this post. I do not want to redo our previous discussion, which would be off topic here. Do you have any new arguments since our last long, civil and interesting discussion?
Also, Willis, Berkeley really doesn’t optimize for getting the corrected underlying data as accurate possible; rather, it focuses more on generating an accurate regional-level field. It will remove urban heating, in Las Vegas for example, even though thats a “real” local temperature effect. It also produces temperature fields that may be a bit too smooth, though its difficult to test given the absence of good ground-truth high-resolution spatially complete data.
How many knobs are there?
Show us the effect on the result of turning each knob from zero to ten.
The BEST slice/dice knob is obviously already turned to eleven by Mophead Mosher:
http://youtu.be/4xgx4k83zzc
Does the clear overzealousness of the chopping affect the trend? YES OR NO? We don’t know. So we don’t trust your black box. Where is the online version that lets us play with the settings? This algorithm matches the other series out there only too well early on but then becomes a climate model matching outlier in the last decade. Why? Where is the discussion of this in the peer reviewed literature? I’ve compared it here to HadCRUT3, the Climategate University version put out before Phil Jones joined a Saudi Arabian university that he used as his affiliation in his HadCRUT4 up-adjusted version:
http://woodfortrees.org/plot/best/mean:30/plot/hadcrut3vgl/mean:30
In a field with a bladeless hockey stick making it into top journal Nature you really do have to show your work instead of just releasing software few know how to run, since no, we no longer trust you. You would think you would jump at the chance to convince us further. Maybe start with finding that blade in the Marcott input data, or if you can’t find it, work on getting that paper retracted and all of the “scientists” involved fired if not arrested. That is how it is done in normal non-activist science such as medical research:
“A former Iowa State University scientist who admitted faking lab results used to obtain millions of dollars in grant money for AIDS research has been charged with four felony counts of making false statements, an indictment filed in federal court shows.”
I’m even amazed Berkeley didn’t sue you guys for using their name like Harvard sued a company called Veritas. Legally, it’s very easy and in fact guaranteed that the public will associate the BEST plot with Berkeley University though at least you didn’t also swipe their logo. Your results helps the Obama administration feel justified in stereotyping us skeptics as Moon landing denying members of the Flat Earth Society. So we are asking you for clarification, loudly. Are the other climate model falsifying temperature products wrong or are you wrong? Are climate models and thus climate alarm now falsified or not?
Zeke Hausfather says:
June 29, 2014 at 12:01 pm
Thanks, Zeke, no worries about the timing. I’m well aware people have time constraints.
I fear I’m not following that one. It shows the record for Savannah, GA, with three station moves and no less than eight “empirical breaks”. These are identified by some computer algorithm whose exact details are unimportant to this discussion.
What is important is your claim that identifying these eight! “empirical breaks” and using the scalpel on them means they are “correctly adjusted” … what is the evidence for that?
Next, you say that
Unfortunately, you’ve fallen into the common trap of assuming that GOOD CORRELATION OF DATASETS MEANS GOOD CORRELATION OF TRENDS. I’ve demonstrated this in the past using both pseudodata and actual data. Here’s the pseudodata:


Note that the trends vary from the floor to the ceiling … why is this important? Because in all cases the correlation between all individual pairs of pseudodata is above 90%.
Now, because they are so highly correlated, your whiz-bang algorithm would “adjust” them so the trends are all quite similar …
Nor is this just a theoretical problem. Here are the trends from a group of stations within 500 miles of Anchorage, all of which have a correlation over 0.5 with Anchorage. Despite that, their trends vary by a factor of three.
So i fear that the fact that after you can’t find any “biased trends relative to surrounding stations” is not evidence that you’ve done it right as you claim, quite the opposite—as the Alaska example shows, if you adjust those so none of the trends are “biased relative to surrounding stations” that’s evidence you’ve done it wrong.
w.
Zeke:
OK, here’s one: Auckland in New Zealand.
BEST shows 0.99±0.25°C/century from 1910 for Auckland. The correct value (manually adjusting for known UHI and shelter) is closer to 0.5±0.3°C/century.
I can’t see how one can get a regionally correct value without first obtaining accurate underlying data, when the regional values are based on the underlying data. For example, I have no doubt that the incorrect Auckland series was used by BEST to adjust other NZ sites, thereby introducing an error.
Apologies, I didn’t close the link properly. Also a typo, the correct Auckland trend is 0.5±0.3°C/century.
[Fixed. -w.]
This example that our host pulled up yesterday would be worth looking at:
Luling TX is also the one Paul Holmwood picked out for other reasons.
Apparently this is a good site with stable MMTS since 1995 yet is seems several discontinuities are picked up in relations to its regional average.
Does that indicate that there is a notable bias in the regional mean?
” Bob Dedekind: I can’t see how one can get a regionally correct value without first obtaining accurate underlying data,”
Yes, this is what I’m questioning above. It seems like the method will just drag everything to the lowest common denominator.
There is an implicit assumption that regional average is somehow more accurate than any individual station. In a network with 80% sub-standard stations, I really don’t see that as justified.
This is what our host referred to as warm soup.
Homogenisation really means just that, putting it through the blender. This just ensures uniformly poor quality.
This makes the long-term trend of the station with the gradual inhomogeneity similar to the one of the neighbours.
And that, in a nutshell, doc, is by beef.
Greg Goodman says: June 29, 2014 at 2:35 pm
“There is an implicit assumption that regional average is somehow more accurate than any individual station.”
No, there isn’t. Regional averages are only used if the station data is doubtful or missing. Luling was a classic case where the algorithm correctly doubted.
Infilling with interpolated before integrating is a neutral choice. It does not improve or degrade. Consider trapezoidal integration. If you add linearly interpolated points, it makes no difference at all. The scheme assumes all unknown points are linear interpolates.
If it’s neutral, why do it? In USHCN, it’s just so you can keep a consistent set of climatologies in the mix, so their comings and goings don’t produce something spurious.
Willis,
Neighbor (or regional climatology) difference series don’t use correlations for anything. Rather, it uses the difference in temperature over time between the station in question and its neighbors. I realize that correlation often provides very little information about the trend, which is why its not a great indicator of potential bias.
The Savannah example shows some sawtooth patterns in the neighbor difference series, but they are homogenized in such a way that both the gradual trend and the sharp correction are removed.
.
Bob Dedekind,
Thats not a station record, thats a regional record. What specific stations in the Auckland area show sawtooth-type patterns being incorrectly adjusted to inflate the warming trend? Here is a list of Auckland-area stations: http://berkeleyearth.lbl.gov/station-list/location/36.17S-175.03E
Hello pruf Evan Jones, the quality of station placement is another problem as the one mentioned in this post. I do not want to redo our previous discussion, which would be off topic here. Do you have any new arguments since our last long, civil and interesting discussion?
Hey, Doc. V. BTW, I am no professor, sorry to say. Yes, siting is not the issue. Suturing zigzags into slopes is. I would agree that homogenization would not destroy such data, but suturing might well do. (That still doesn’t do much to reduce my — dare I say hatred? — of homogenization. But the H-monster is not the culprit here, I must concede.)
Willis has hit the nail on the head, and the zigzag paint issue of the CRS units is a prime example of how such a fallacy might manifest.
I’ll repeat to the others, Dr. Venema has treated me with great courtesy and professionalism. In our discussions since 2012 of the surface stations paper, he has begged to disagree, but has always argued to the point. There is information that we both are at an advantage to acquire: He is interested in whether the paper is for real. I, on the other hand, am interested in what form the criticism will take, especially after having dealt with the TOBS, moves, and MMTS-conversion issues.
I think we both got what we came for.
As for him, I do not think he will be too quick to adduce the point that, after all, adjusted data for both well and poorly sited stations are the same. And as for me, he has made me think more deeply beyond the stats to the mechanism in play. And I won’t be saying that TOBS doesn’t really matter that much.
Great to see this issue talked about more – I was first aware when GISS published their telling diagrams in 2001 –
How many times does a truth have to be told ? – UHI warming has been cemented into global temperature series by adjusting for steps outward from cities –
http://www.warwickhughes.com/blog/?p=2678