Guest Post by Willis Eschenbach
In an insightful post at WUWT by Bob Dedekind, he talked about a problem with temperature adjustments. He pointed out that the stations are maintained, by doing things like periodically cutting back the trees that are encroaching, or by painting the Stevenson Screen. He noted that that if we try to “homogenize” these stations, we get an erroneous result. This led me to a consideration about the “scalpel method” used by the Berkeley Earth folks to correct discontinuities in the temperature record.
The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.
There are a variety of ways to deal with that, which are grouped under the rubric of “homogenization”. A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.
The method that I’ve recommended in the past is called the “scalpel method”. To see how it works, suppose there is a station move. The scalpel method cuts the data at the time of the move, and simply considers it as two station records, one at the original location, and one at the new location. What’s not to like? Well, here’s what I posted over at that thread. The Berkeley Earth dataset is homogenized by the scalpel method, and both Zeke Hausfather and Steven Mosher have assisted the Berkeley folks in their work. Both of them had commented on Bob’s post, so I asked them the following.
Mosh and/or Zeke, Stephen Rasey above and Bob Dedekind in the head post raise several points that I hadn’t considered. Let me summarize them, they can correct me if I’m wrong.
• In any kind of sawtooth-shaped wave of a temperature record subject to periodic or episodic maintenance or change, e.g. painting a Stephenson screen, the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.
• Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.
• As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.
Now this makes perfect sense to me. You can see what would happen with a thought experiment. If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.
So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.
In your reply, please consider that I have long thought and written that the scalpel method was the best of a bad lot of methods, all methods have problems but I thought the scalpel method avoided most of them … so don’t thump me on the head, I’m only the messenger here.
w.
Unfortunately, it seems that they’d stopped reading the post by that point, as I got no answer. So I’m here to ask it again …
My best to both Zeke and Mosh, who I have no intention of putting on the spot. It’s just that as a long time advocate of the scalpel method myself, I’d like to know the answer before I continue to support it.
Regards to all,
w.
@ur momisugly Victor Venema
Especially something that would cause artificial trends due to micrositing in the 1990s, but not since the US climate reference network was installed in 2004. Puzzling to me.
I can answer this definitively. It is because there was a strong warming trend in the 1990s, but a flat trend since 2001.
Bad microsite does not create an artificial trend. It merely exaggerates a real, already-existing trend (warming or cooling). But from 2004 there is little if any trend. So there will be no trend to exaggerate from bad siting after 2004.
In the 1990s, on the other hand, there was a strong warming trend (CO2 forcing + positive PDO effect). So that is where the heat sink effect is dominant.
The microsite effect works both ways: From 1998 – 2008, there was cooling in the US (thanks to the 1998 El Nino start-point and the 2008 La Nina endpoint). And, yes, the poorly sited stations show significantly more cooling than the well sited stations.
In short, bad siting exaggerates trend, either warming or cooling. but if there is a flat trend, there will be no exaggeration.
Stephen Richards says: June 29, 2014 at 2:50 am
“IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA”
I agree with Willis, but the silly thing is, TOBS adjustment isn’t even doing that. The past data is actually the reading of the position of the markers, at whatever time. It isn’t a daily max. That requires an act of interpretation, indeed an assumption. The observer records say max marker at 80F at 5pm Tuesday. Is that a Tuesday or a Monday max? The data doesn’t say.
In the past, it would probably have been assumed Tuesday. But quite often it would have been Monday, and it makes a difference, because of double counting warm afternoons. We now have the ability to quantify that assumption, with hourly data.
Surely past assumptions aren’t sacrosanct.
Nick writes “But quite often it would have been Monday, and it makes a difference, because of double counting warm afternoons.”
That’s fine if you truly know when the readings were taken in the past but if its an assumption based on “policy” rather than actual meta data then you’re on shaky ground.
Reading all this analysis of how to back out climate trends from past temperature data, I’m reminded of a well-known saying about candidates trying to win elections:
If you’re explaining, you’re losing.
It seems unfair that this saying could be relevant, because science is supposed to be all about dispassionate contemplation of what is and is not known, and what can and cannot be measured. But as the whole climate-science fiasco makes clear, as a group activity there is also a political element — getting your work accepted by others as valid and trustworthy — and that goes double when lots of money is involved. So if you find yourself having to “explain” over and over what you’ve done and “explain” over and over why it makes sense, it may be time to try another approach.
Willis and Bob: my poor old brain is too tired to get into the minutiae of How adjustments are to be calculated.
But my accounting/BI background leads me to something I’ve stated here more than once.
Given the existence of wonderful data-handling and recording software, with the ability to slice and dice petabytes of data, there should be a concerted attempt to open-source a transactional record of temperatures.
So that the original (however obtained, and that would be a dimension in the data: MMTS, Hg and eye-o-meter, etc), would always be TransactionType = original. Never altered. That temp, by that method, at that lat/long/alt.
But, layered over that in separate records for the lat/long/alt, would be the adjustments.
– F to C – the ‘oops’ factor
– The first harmonisation: by what process, resulting in what adjustment, at this lat/long/alt
– harmonisation 2 – and so on.
Then the data query engine, to render up the temp for a datetime at lat/long/alt, simply sums the temp values it finds.
This is called the Audit Trail in accounting, and woe betide the personages who delete old parts of it, alter existing transactions or otherwise finagle existing records. However, introducing new transactions, even to correct years-old mistakes, is always acceptable, as long as one states, who did this, why, when etc: Transactional context.
If temp records were stored this way, we would not be chasing our tails trying to figure why history changed, why certain calculations seem to be applied Here and not There, and so on.
We’d simply sum through the Types of adjustments and compare them.
And all the time, those Originals, in all their human-error-prone glory, would be there for the researchers.
So, just how hard would That approach be?
One suggestion: cosy up to a Big Data vendor and suggest a pro-bono effort to put one together. It might just be easier than we all think.
evanmjones says: “Bad microsite does not create an artificial trend. It merely exaggerates a real, already-existing trend (warming or cooling). But from 2004 there is little if any trend. So there will be no trend to exaggerate from bad siting after 2004.”
I would call that a description of what your data shows, but not yet an explanation of what happened locally to the measurement.
Bill Illis writes “The breakpoints should be equally distributed between positive and negative adjustments of roughly the same magnitude.”
Why assume that? IMO there are many more ways to “unnaturally” warm something giving a warming bias than there are ways to cool something.
In my experience there is no statistical solution to this conundrum. There are only fudge factors
=============
the problem is similar to name and address correction in mailing lists. you have a number of similar entries that may or may not be for the same person. how do you tell which entry is the “most correct”, so you can use that while discarding the others.
Do you average all the entries for the same person? no, because that will create problems. the low quality entries will swamp out the good ones, leaving you with poor quality.
Rather, what you need to do is rate the quality of the entries, not based on how similar they are to their neighbors, but rather on how good the source was for the entries. Then you throw out the entries that are from a poor source.
This would appear to be the crux of the solution for temperature. you cannot determine how good a reading is by comparing it with its neighbors, because you don’t know the quality of the neighbors. the comparison is nonsensical. you need to rate the quality of the station based on information about the station itself, and then either use the data if it is high quality, or throw it out if it is low quality.
I don’t know, but it seems possible that a set of stations within any given region are going to be maintained by the same people.
If also seems likely that the stations are maintained using a regular schedule – perhaps each station is visited every year for a simple checkup/spider web removal, repainted every 5 years, and a major service every 10 years when local vegetation is cut down and removed.
If this is the case, then it is quite possible that the major service takes place for any one station, the nearby stations are not due for this treatment for another year or two. In which case, the drop in the sawtooth will be assumed to be an artifact, since the other local stations didn’t see it. And of course next year, when another nearly station shows a drop due to a service, the previous station will not show it, nor will the others yet to be serviced, and so again this is seen to be an artifact and removed either by the averaging method, or by scalpelling. Either way, the temperature record has been “adjusted” to make it further from the truth.
richardverney: Per Willis
“…As a result, the raw data may not reflect the actual temperatures….”
//////////////////////////
Wrong; the raw data is the actual temperature at the location where the raw data is measured.
What you mean is whether there are some factors at work which have meant that the actual temperature measured (ie., the raw data) should not be regarded as representaive of temperatures because it has been distorted (upwards or downwards) due to some extrinsic factor (in which i include changes in the condition of the screen, instrumentation, TOBs as well as more external factors such as changes in vegetaion, nearby building etc). .
I think richardverney puts the matter badly here, and Willis Eschenbach is closer to the truth. In measurement science it is common to distinguish between “accuracay” of measurement, and “precision of measurement”, where accuracy refers exactly to the question of how close the the true value you may consider the measured value to be. When these two aspects of measurement are addressed by statisticians, they are called “bias” (the complement of “accuracy”, the expected disparity between the expected value of the observations and the true value), and “variance” ( the complement of “precision”, namely the variation in repeated measures of the same quantity.) Mean Squared Error is then the sum of the squared bias and the population variance of the estimators.
“Temperature” of a region is proportional to the mean kinetic energy of the molecules in that region; the measured temperature is always slightly different from the true temperature by some small amount.
Notice that richardverney uses “should not be regarded as representative of” whereas Willis Eschenbach uses “may not reflect actual”. I do not perceive a meaningful difference imparted by richardverney’s “correction”, but he confueses the issue by presenting his rewording as a correction.
Remember always: “accuracy” and “precision”; and their complementary concepts in statistical analysis, “bias” and “variance”. A measuring system with high bias and very low precision will repeatedly and reliably get the wrong answer, and the mean of a large number of independent observations will even more reliably get the wrong answer.
Willis’ question can be restated (I hope, please forgive me if I misunderstand) as: “Does the scalpel process bias the estimates in such a way that the estimated trend is reliably too large?” The possibility that it might do so has to be addressed. That does not say it hasn’t been addressed. I have not (yet) read all the relevant literature on the temperature record.
I don’t understand, do all these corrections occur just because of discrepancies in the meta data? Why can’t these thermometer/weather stations be empirically and/or routinely recalibrated to an independent, objective method traced to some universal standard after maintenance/repairs are made, just like any other scientific instrument? Instead of finessing the data: WHEN IN DOUBT, THROW IT OUT!
So we know each beak-point is “most” correct vs the data surrounding. The question then is whether the reading increase from maintenance to maintenance is linear or asymptotic toward a flat line. I would expect the latter, but determining the specific behavior of temperature bias vs time would probably require a multi-year experiment using multiple temperature stations with varying degrees of routine maintenance.
What if then we applied a linear trend from discontinuity point to discontinuity point, and calculated the slope from start to end of the interval. We cannot do the slope using best fit as that will not be correct, it must be the slope using only the start and end points. Now we find the slope from the start point of the period and the start point of the next period. We’re pretty sure the start of the next period is correct and (most likely) the same as the current period so we adjust all points based on the difference of the slopes. This way no start points get shifted down and long term trends are kept correct. The discontinuities are also gone without going all willy nilly on the data also.
Am I the “lone voice” pointing out the absurdity of “average temperature”? It’s the RADIATION ENERGY BALANCE (notice I did not use the term “heat” as it too easily confounded with “temperature”) which matters. Thus the “HEAT CONTENT” or Enthalpy of the cubic volume of AIR is what really counts. ROUGHLY it can be measured with humidity and temperature. (Psychometric chart is helpful here.)
NO ONE, never, EVER talks about this. Yet, it would be the PRIMA FACIA way of assessing the RESULT of the energy balance of the atmosphere. AM I LOOPY? What’s wrong with this? I think the “KING HAS NO NEW CLOTHES” with regard to EVERYONE …i.e., skeptics and Warmistas …because NO ONE recognizes the need to STUDY ENERGY BALANCE and NET ENERGY CONTENT of the atmosphere.
Reply to Wayne Findley ==> You are 1000% (sic) correct — what is missing in all of this temperature transmogrification is the existence of proper Audit Trails that carefully and in detail show exactly what has been done to the originally recorded number, when, by whom, and why — in every case that the number is touched by anyone. ANYTHING else is bordering on illegality in the financial world == fiddled books. So it should be with all scientific data — down to the smallest and least consequential experiment.
In fact, in real science, one must produce his original lab log — any post hoc changes in it can totally invalidate the work and findings — inability to produce the lab log on demand have the same result.
I am not sure at all that the whole temperature record hasn’t been fiddle-faddled beyond any usefulness.
PeteJ says: June 30, 2014 at 9:44 am
“Instead of finessing the data: WHEN IN DOUBT, THROW IT OUT!”
I’ve written a post here which tries to illustrate the fallacy of that. When you are calculating the average for a period of time, or a region of space, that data point was part of the balance of representation of the sample. If you throw it out, you are effectively replacing it with a different estimate. You can’t avoid that. And your implied estimate could be a very bad one.
It’s not good advice.
REPLY: Nick Stokes, defender of the indefensible, is arguing to preserve bad data. On one hand he (and others) argue that station dropout doesn’t matter, on the other he argues that we can’t throw out bad stations or bad data because it won’t give a good result.
Priceless.
This is exactly what is wrong with climate science and the people that practice it.
– Anthony
Wayne Findley says:
June 30, 2014 at 2:28 am
Thanks, Wayne. The Berkeley Earth folks have already done an excellent job at both preserving the original data as well as showing the adjustments that have been made. While their adjustment (the “scalpel”) may have shortcomings and their overall view tends towards alarmism, they have been very transparent and professional in their data handling. Any of their individual station data pages shows the raw data, the adjustments, and the adjusted data.
In addition, Steve Mosher has put together an excellent package in the computer language “R” for accessing the data and using their methods, available from the normal CRAN repository. Discussion and details are available on Mosh’s blog.
So they have been completely up-front about their data and code, and have my congratulations on that part of the effort. Their documentation and the availability of both data and code puts them leagues in front of the other global temperature datasets such as GISS.
Regards,
w.
Max Hugoson says:
June 30, 2014 at 10:10 am
Given the number of capital letters in your diatribe, I’d leave out the question of your loopiness, folks might be encouraged to answer …
In any case, we’ve gone through this before, Max. People are well aware that temperature is not a complete measure of the enthalpy in the air. However, from my own investigations into the question, I have found that the inclusion of the latent heat (in order to calculate enthalpy) makes very little difference in the results.
So first off, yes, people do talk about this. Me, I’ve concluded that it’s not a significant factor.
So if you think it is a big factor, here’s what you should do. Get a good clean temperature and humidity record from one of the CRN (climate reference network) sites. Then calculate the temperature on the one hand, and the full enthalpy including water vapor on the other hand, and compare the two. I did it with some Canadian stations at some point, no idea where that data is now, but I found little difference.
Report back here with the results, and we’ll have another data point for the discussion. I don’t think it’s a big issue, but I’m always willing to learn.
Thanks,
w.
PS—I don’t want to be a spelling Nazi, but I hate to see a man make a mistake more than once, so I apologize in advance for this correction … it’s “prima facie”, and “facia” is a term I use as a builder, it’s a wooden piece covering the ends of the rafters.
Anthony replies @ur momisugly:
June 29, 2014 at 4:00 pm
“REPLY: You don’t need a PhD to be able to do research and publish papers, I’ve done three now, working on #4, and as many people like to point out, including yourself, I don’t have a PhD and according to many, I am too stupid to be in the same ranks with you. Yet, I do research and publish anyway.
But, I know in the eyes of many in your position that career path makes me some sort of lowbrow victim of phrenology.”
——————–
Right you are, Anthony. A big majority of those having been awarded an MA/MS or PhD Degree, …. including their “brainwashed” underlings and admirers, …. all possess a “Rank before Frank” mentality.
They have been nurtured to “bow down” to any “Rank” that is greater than their own …. and to ignore, discredit or defame any and all “Franks” regardless of what they might want to contribute to a conversation.
The per se “purchasing” of a PhD Degree from a reputable college or university is akin to …. some one “purchasing” a BIG toolbox chuck full of all kinds of “specialized” tools from a local Sears, Lowe’s or Home Depot.
Thus, both parties have “proof of ownership” (Diploma-Degree vrs. Sales Receipt) of their big box of “tools” ……… but said “proof of ownership” is neither proof nor factual evidence that said parties are actually capable of using the “tools” contained in their “toolbox”.
And when one of the aforesaid pulls “Rank before Frank” on you, …. you should immediately know what their debilitating “deficiency” problem is.
@Evan Jones at 9:51 pm
In short, bad siting exaggerates trend, either warming or cooling. but if there is a flat trend, there will be no exaggeration.
I cannot accept that statement as truth.
There might be an element of truth in it IF AND ONLY IF the bad micrositing issues remain constant.
However, bad micrositing is prima fascia evidence that care is not being taken with respect to the quality and consistency of the recording conditions. If you install an incinerator 5 feet from the Stevenson Screen, it is a bad micrositing issue — even if you don’t use it. But if you change the number of times you use it a month, the time of day you use it, or the quantity you incinerate it, then a change in trend observed will be partially a function of the changes in the incineration schedule. Micrositing issues can create and reverse a trend. UHI can turn a cooling into a warming.
This issue of variability of microsite, UHI, instrument drift is what invalidates BEST segmenting by scalpel, decoupling from absolute temperature. Using the slopes of the segments is valid if and only if contamination of the station record is a constant over the time span of the segment. Clearly, in regard to UHI constancy is false. Cutting the temperature record into shorter segment doesn’t change the contribution of UHI to record.
There is a theoretical possibility that some discrete microsite events (like a parking lot paving or nearby building constructed) can be eliminated by the scalpel, but it is a fools errand. There are many gradual micrositing changes (instrument drift, weathering, sensor aging, plant growth) that are gradual and the instantaneous change is necessary recalibration information that should not be lost to the scalpel.
There is a theoretical possibility that some discrete microsite events (like a parking lot paving or nearby building constructed) can be eliminated by the scalpel,….
In the case of airports, there are commonly construction projects as terminals and tarmacs expand, runways lengthened and added. If you don’t move the temperature sensor, you would be changing the microsite conditions and it arguably deserves a breakpoint. But if you move the sensor away from the construction to restore the micrositing classification back to Class 1, should you count it as a station move and institute a breakpoint?
I would argue, NO.
Breakpoints are not going to change the gradual build up in UHI and activity at the airport. Moving the station has restored and recalibrated the temperature sensor to make it less dependent on nearby sources of contamination. While you can argue that an (unnecessary) breakpoint can be inserted here without introducing bias, I argue that it is a bias against long term records, especially from well maintained sites, which appears to be a rare commodity.
There is only one way to account for discontinuities: overlap. It is relatively easy for a station change, keep the first open for a year (to account for annual cycles in precipitation, insolation and wind direction; longer would be better, but I am not that unrealistic) and use that to homogenise.
I am struggling with maintenance. The only thing I can think of to resolve maintenance of the station or its environs, removing saw-tooth patterns is far harder and horribly expensive: 100% overlap. A second station should be placed immediately adjacent that receives the same treatment more than twice as frequently and out of sync. The data from the second station are used only to correct for artificial trends in the primary station, and will make that correction mid-cycle so as to show those trends.
There might be an element of truth in it IF AND ONLY IF the bad micrositing issues remain constant.
Yes. And that is why, for the purposes of our study, I removed stations that moved, and also stations that did not move but whose ratings were changed by encroaching heat sink. (We will retain a station with a localized move, but only if the rating is not changed.)
To be clear, I refer only to such stations as we retained.
However, bad micrositing is prima fascia evidence that care is not being taken with respect to the quality and consistency of the recording conditions.
That I don’t think I agree with. Some of the oldest stations, with the finest station records and the most devoted staff, are out of compliance for siting (e.g., Blue Hill, MA, and Mohonk Lake, NY). And some of the nicest, most isolated Class 1\2s are battered up old CRS screens that look like they came out losers in a bar fight.
Remember, the regional directors who place the stations are not the actual curators who do the day to day and report the data. There is often a surprising disconnect, here. The curators — for the most part — love and care for their stations and are proud of them. I consider the curators to be victims of those who placed the stations badly.
No one loves his station more or keeps better records than “Old Griz” Huth up at Mohonk. But some yahoo placed his station in a tangled mess of vegetation within 6 meters of a structure (with an exposed, “working” chimney). Damn shame.
Micrositing issues can create and reverse a trend.
Yes, if the microsite condition itself changes. That could produce a step change in either direction, wreaking havoc with the trend. If, OTOH, it’s constant, it will tend to exaggerate either cooling trend or warming trend, but should not reverse either. Note also, however, that heat sink and waste heat are two different factors and do not have the same effect. Constant waste heat can actually reduce a trend by swamping the signal. But heat sink works via a different mechanism.
I can’t help it that I had to get a PhD to be allowed to do research. That is the way the system works.
Remember, Anthony, we are not talking America, but Western Europe, here. It’s a different professional ethic. Very territorial. (And a Class Thing.) These things are looser on this side of the Atlantic. Fewer demands to “see our papers”, as it were. More latitude for the self-made man.
Victor Venema says:
June 30, 2014 at 2:41 am (Edit)
. . .
I would call that a description of what your data shows, but not yet an explanation of what happened locally to the measurement.
It is consistent with the original hypothesis: If the temperature is a flat trend, bad microsite cannot affect that trend. Once a genuine tend occurs, that trend (warming or cooling) will be exaggerated.
There was a warming trend during the 1990s. That was exaggerated.
There was a cooling trend from 1998 – 2008. And that was exaggerated.
From 2002 (when CRN first began to be deployed) to date, however, there is a flat trend: Nothing “there” to exaggerate.
The trend match between CRN and USHCN (after 2001) is confirmation, not falsification.