Guest Post by Willis Eschenbach
In an insightful post at WUWT by Bob Dedekind, he talked about a problem with temperature adjustments. He pointed out that the stations are maintained, by doing things like periodically cutting back the trees that are encroaching, or by painting the Stevenson Screen. He noted that that if we try to “homogenize” these stations, we get an erroneous result. This led me to a consideration about the “scalpel method” used by the Berkeley Earth folks to correct discontinuities in the temperature record.
The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.
There are a variety of ways to deal with that, which are grouped under the rubric of “homogenization”. A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.
The method that I’ve recommended in the past is called the “scalpel method”. To see how it works, suppose there is a station move. The scalpel method cuts the data at the time of the move, and simply considers it as two station records, one at the original location, and one at the new location. What’s not to like? Well, here’s what I posted over at that thread. The Berkeley Earth dataset is homogenized by the scalpel method, and both Zeke Hausfather and Steven Mosher have assisted the Berkeley folks in their work. Both of them had commented on Bob’s post, so I asked them the following.
Mosh and/or Zeke, Stephen Rasey above and Bob Dedekind in the head post raise several points that I hadn’t considered. Let me summarize them, they can correct me if I’m wrong.
• In any kind of sawtooth-shaped wave of a temperature record subject to periodic or episodic maintenance or change, e.g. painting a Stephenson screen, the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.
• Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.
• As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.
Now this makes perfect sense to me. You can see what would happen with a thought experiment. If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.
So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.
In your reply, please consider that I have long thought and written that the scalpel method was the best of a bad lot of methods, all methods have problems but I thought the scalpel method avoided most of them … so don’t thump me on the head, I’m only the messenger here.
w.
Unfortunately, it seems that they’d stopped reading the post by that point, as I got no answer. So I’m here to ask it again …
My best to both Zeke and Mosh, who I have no intention of putting on the spot. It’s just that as a long time advocate of the scalpel method myself, I’d like to know the answer before I continue to support it.
Regards to all,
w.
In the case of airports, there are commonly construction projects as terminals and tarmacs expand, runways lengthened and added.
Usually (though not always) those changes occur further than 100m. from a station. That transitions us from the microsite to the mesosite level of consideration.
Airports are an interesting case. I once thought the well sited ASOS units definitely ran hot-to-trend. But then Google Earth got better focused and a number of those warm-running Class 2s turned out to be Class 3 or even 4. We also lost a large slice when I purged the moved stations. (Yes, I am both a “station dropper”, and “data adjuster”, good lord have pity on me. Just don’t ask me to homogenize. Some sins transcend the venal.)
The current data now shows Airport Class 1\2s running cooler than any other subset — but it is a small and statistically volatile subset from which no definitive conclusions may be drawn. So I can no longer conclude with confidence that airports are an inherently bad mesosite. (“When the facts changes, I change my mind,” shades of Keynes.)
@Evan Jones at 5:57 pm
(Yes, I am both a “station dropper”, and “data adjuster”, good lord have pity on me. Just don’t ask me to homogenize. Some sins transcend the venal.)
LOL.
Well, I’m a petroleum geophysicist, among other things. When it comes to dropping stations, nulling data for multiples, and adjusting timeseries (in the time domain!) for normal moveout, near surface static corrections, and converting to depth, seismic data processors have no peer. Fortunately, we have fold to rely upon —- which is a form of homogenization, to make each source-receiver pair look like others and the same Depth point and very similar to neighbors.
Guilty! Geophysicists commit data sin — discretely, and in bulk.
But we leave the field tapes alone and we document the processing steps.
@Evan Jones at 5:57 pm
I think I agree with you about airport microsite issues. Above I tried to make the case that BEST breakpoints at airports seem unlikely to be real.
I know DENVER STAPLETON AIRPORT, Its 5 moves and 5 additional breakpoints just don’t make sense for any airport, much less that one. But the opening of the airport in 1929, its expansion in 1944, it’s closing 1995 are not breakpoint in the record.
All I was saying is that a weather station at an airport, even it if moves away from airport construction does not deserve breakpoints if it maintains it’s distance from the terminal and tarmacs. Not all airports can say that: LAX, SEATAC, yes siting location could change the temperature, some. For most airports, within the limits of UHI, one Class 1 spot at an airport ought to be indistinguishable than another Class 1 spot.
BEST is just breakpoint happy. It allows regional grids from Class 4&5 to dictate breaks and adjustments at Class 1 stations. Long, unbroken records have the most value of any.
BEST is just breakpoint happy. It allows regional grids from Class 4&5 to dictate breaks and adjustments at Class 1 stations.
Does it, by god? (I think I may take matters into my own hands and see what is happening to my Class 1\2s, if I can figure out the BEST interface.)
Nick writes “You can’t avoid that. And your implied estimate could be a very bad one.”
Sometimes, if you know you have bad data, the actual answer is not knowable. No matter what “adjustments” you might make.
if you homogenize milk and manure, the end product will still taste like shzt.
you cannot eliminate the manure by comparing one pail of milk with another. what if the neighboring pail of milk is also contaminated? the only way to eliminate the manure is to compare each pail against a known standard.
thus Anthony’s approach of eliminating poorly sited stations is correct, while the various nearest neighbor comparison methods are flat out wrong.
because temperature data is numeric, there is a false belief that data quality can be improve via numerical methods. if your method can improve the quality of data, it should work with non numerical data as well.
however, once you approach the problem in this fashion you will realize that you can only improve data quality if you have an independent measure of those rows in the dataset that are poor quality and those rows that are high quality. which means you need to score that quality of the source. once that is established, data quality is enhanced by removing or treating the rows from the low quality source.
simply comparing a row with its neighbor does not establish data quality, because the quality of the neighbor is unknown.
if your neighbors all tell you the same tale, does that make it true? no, because you don’t know the quality of their source. only after you find out if the had a high quality source can your judge if the story is accurate or not.
Attempts at calculating a true and accurate “adjustment” to compensate for the several different randomly generated variables that directly or indirectly affect the numerical values within a number set of recorded temperatures ………. is an act of futility.
To the half-dozen posts above: Yes, if you are going to make an attempt at a “true signal”, one must, perforce, confine oneself to the subset of stations capable of providing such. That is what we do for our paper.
BEST does not appear to concede that it just sometimes gets colder or warmer in any given neck of the woods. Instead they kill every breakpoint. But natural factors can also produce breakpoints. I say that one must kill a breakpoint only if there is a specific reason to do so. (And I do not adjust such stations. I drop them.)
I prefer to drop a station only if it has moved or its TOBS is “reversed”. Even if there is no breakpoint. USHCN oversamples, so after the dust clears, we still have 400 stations whose conditions are reported by NCDC to be essentially unchanged, 80 of which are Class 1\2, the latter of which still provide adequate distribution and produce what I call the “true signal”.
evanmjones:
You conclude your post at July 1, 2014 at 9:00 am saying
Where was it that I read about this “true signal” before?
Oh, yes! I remember! It was this.
Clearly, when confronted with your “true signal” I am part of “the darkness”.
Richard
Nick Stokes: I’ve written a post here which tries to illustrate the fallacy of that. When you are calculating the average for a period of time, or a region of space, that data point was part of the balance of representation of the sample. If you throw it out, you are effectively replacing it with a different estimate. You can’t avoid that. And your implied estimate could be a very bad one.
It’s not good advice.
REPLY: Nick Stokes, defender of the indefensible, is arguing to preserve bad data. On one hand he (and others) argue that station dropout doesn’t matter, on the other he argues that we can’t throw out bad stations or bad data because it won’t give a good result.
If you knew for sure what data were “bad” and what data were “good”, your reply might make sense, but note Nick Stokes’ point that throwing out a “bad” data point is equivalent to imputing (the word most statisticians prefer to “estimating” missing data) a particular value to it, and that might not be the best imputation possible. In almost all cases, including the temperature data sets, all the data are “imperfect to some degree”, and the classification into “bad” vs “good” is an arbitrary simplification. With lots of imperfect but few “bad” data points, using the extant data to impute a value to the “bad” data probably is better than the particular imputation method of dropping the “bad” data.
Nick Stokes’s defense is reasonable, and the whole topic of methods of imputation is much addressed in the statistical literature. How good a particular method is in a particular case often can’t be determined with great confidence from the extant data, but dropping “identified BAD” data is almost always among the least defensible alternatives.
Another venue where this issue arises is in measuring small concentrations (of drugs, metabolites, toxic pollutants, etc) where a large number of values are positive but “below the limit of detection”. Throwing them out can be worse than using them. Two statisticians who have addressed this problem for particular cases are Diane Lambert, PhD (then of AT&T Bell Labs, more recently at Google) and Emery Brown, MD, PhD (then at Mass Gen or Macleans; now at Harvard Medical School.) If anyone is interested, I can get the full references, but the general topic is “missing values and data imputation”.
This is a terrific thread. I would like to thank Nick Stokes for hanging around and presenting a spirited defense of a defensible approach to missing weather values, and Willis for initiating the thread.
My thanks to Nick Stokes and Matt Marler for brining up an interesting question—when is throwing out bad data worse than keeping it? Nick says:
Suppose our method of analysis, as in Nick’s example, is an average of the data, to find e.g. an average temperature over some period. Nick says that throwing out a given piece of data in this situation is equivalent to replacing it with the average of the remaining data. While this is demonstrably true, there is one way in which they are not equivalent, and it is an important way.
This is that throwing out the bad data increases the uncertainty of the result by reducing N, while replacing the bad data with the average of the remaining data decreases the uncertainty of the result. As a result, Nick’s claim that the two are equivalent is simply not true.
w.
Willis Eschenbach: This is that throwing out the bad data increases the uncertainty of the result by reducing N, while replacing the bad data with the average of the remaining data decreases the uncertainty of the result. As a result, Nick’s claim that the two are equivalent is simply not true.
That is a pertinent point. When data are imputed, the number of imputed values has to be subtracted from N in order to avoid the misleading appearance of greater precision. Nick Stokes will speak for himself, but I bet that he knows that.
Now back to “when is imputing better than simply dropping?” Consider for now the estimate of the mean high temperature in the US on July 1. Doing the work, you find that the temperature for Lubbock TX is missing. Simply dropping it is equivalent to replacing it with US mean high temp from all of the other data. A different method of imputation is to replace it with the mean high temp of a region around Lubbock; then use that in calculating the US mean. Which of these imputations yields an estimate closer to the real US mean, given that neither imputed value is exact? Almost for sure, the imputation based on local stations is better than the overall mean. If there is enough other reliable information, the estimate calculated as the Bayesian posterior mean of a well-chosen locale (or based on values highly correlated in general, as with Kriging) can’t be beaten. But you probably can’t know for sure: the proofs depend on assumptions, and the assumptions are almost never exact representations of what you are working with.
This is that throwing out the bad data increases the uncertainty of the result by reducing N, while replacing the bad data with the average of the remaining data decreases the uncertainty of the result. As a result, Nick’s claim that the two are equivalent is simply not true.
How well I know.
And that’s yet another crime of homogenization — it gives an entirely false impression of precision. See my error bar. See how nice and small it is. Cocktails all ’round. Well of course it is! You have smoothed away all your outliers, haven’t you? You have reeducated them to conform with the majority of Good Citizens.
The problem arises when the Good Citizens actually turn out to be Bad Citizens.
Meanwhile, the true signal has vanished. What remains is meaningless pap.
Thanks, Matthew. I find myself uneasy about the logic regarding infilling “dead” stations. Suppose we’re calculating the average temperature of the US. As you point out, mathematically, infilling is the same as replacing the value for the station with some flavor of local average, and leaving it out is the same as infilling it with the national average. Your claim is that using the local average is better than leaving it out.
I see a couple of issues with this.
Let’s suppose we have an area where there are very few stations. So … we decide to use virtual stations. We pick some points, figure out what the local average is for those points, and we include them in the calculation … does this seem like a defensible procedure?
Because that procedure is exactly equivalent to infilling a dead station.
The problem is exacerbated by the common procedure of gridcell averaging. If we average all of the new virtual stations plus all of the real stations in a certain gridcell, the gridcell average will NOT be the same as it would be without the virtual stations. This is because the “local” averaging is often based on stations within a certain radius, and not stations within the gridcell.
Now, I agree that if we’re using a calculation of a smoothly varying “temperature field” rather than gridcell averaging, the inclusion of any number of “virtual stations” whose values are given by the local temperature field will not change that field.
However, I don’t think that actually solves the problem …
Puzzling …
w.
Willis Eschenbach: Your claim is that using the local average is better than leaving it out.
It’s a ranking: the mse of the overall estimated mean is smaller when the local average is used in place of the overall average: (1) depending on how different the true local average is from the national average (between-location variance); (2) depending on the precisions of the individual temperature recordings (at location measurement variance); (3) depending on how well the overall distribution (from place to place) can be approximated by a functional form (Gaussian etc, with estimated parameters.)
The Bayesian estimation procedure does not actually “solve” a problem, in any intuitive sense of “solve”; it uses all distribution information to reduce the mse of the estimate. It’s explained in Samaniego’s book “A comparison of frequentist and Bayesian methods of estimation”, and most other introductions to Bayesian estimation.
Matthew R Marler says:
July 1, 2014 at 9:58 pm
Yes, I understand all of that. I’m just trying to understand the further implications of those things.
For example … IF we can get a lower MSE by infilling data, then we could run the MSE down to zero by using the “virtual stations” approach I outlined above …
Comments?
w.
Willis writes “Doing the work, you find that the temperature for Lubbock TX is missing”
Now suppose Lubbock TX was accidentally replaced with another station’s dataset. This data cant be right as the region average is 15C and this dataset has an average of 20C. What to do?
If you try to use the bad data then you will artificially increase the region’s temperature.
So in this case its truely bad data and must be discarded. But how do you know when to do that. I mean another station who’s data averaged 17C would be even harder to pick..
For example … IF we can get a lower MSE by infilling data, then we could run the MSE down to zero by using the “virtual stations” approach I outlined above
Let us not forget, when ever you change, alter, or replace a data point, you must add in an uncertainty band to each estimate. Compare each station’s measurement with it’s difference to any krigged trend that doesn’t use that station. At minimum, an infill must add in at least that mean error.
At least that much error. For you should consider the mean error of the cases where you might want to infill, for who would want to infill a station that reads close to the krigged trend. In addition, I would want an estimate of how much the krigged trend would change for a random omission of 20% of the control points.
So, if people are honest about errors and uncertainty added to the dataset as infills are performed, then it is not possible to drive the mse to arbitrarily low values, but mse will soon increase the more you tamper with the data.
Willis Eschenbach: For example … IF we can get a lower MSE by infilling data, then we could run the MSE down to zero by using the “virtual stations” approach I outlined above …
Comments?
You can not reduce the mse to 0. I do not understand why you think that you can.
Matthew R Marler says:
July 2, 2014 at 8:37 am
Thanks, Matthew, I don’t think you can. I’m just following out your line of thought, viz:
What I said is that IF we can get a lower MSE by infilling as you claim, then we could infill everywhere by using virtual stations and get an arbitrarily small MSE …
I wasn’t making a claim … I was using the technique called “reductio ad absurdam” on your claim.
Comments?
w.
evanmjones says:
July 1, 2014 at 9:00 am
“To the half-dozen posts above: Yes, if you are going to make an attempt at a “true signal”, one must, perforce, confine oneself to the subset of stations capable of providing such. That is what we do for our paper.”
——————
In my opinion, every Surface Station out there generates its own “true signal”. A signal that is “true” only for that specific Surface Station itself.
But the big question is, ….. is each individual “true signal” also an accurate and/or correct signal that has not degraded (increased/decreased) “over time” due to physical changes within its local environment?
Said “true signal” that is generated by every Surface Station is also subject to daily (24 hour) increases and decreases that are directly related to other randomly occurring environmental factors such as, to wit:
1. length of daytime/nighttime.
2. the amount of direct solar irradiance each day.
3. the “daily” angle of incidence of solar irradiance to the surface & objects residing on surface.
4. the “seasonal” angle of incidence of solar irradiance to the surface & residing objects.
5. near surface air movement (winds & thermals).
6. the direction of flow of near surface air movement in respect to Surface Station location.
7. the temperature of the “inflowing” near surface air mass
8. the amount of H2O vapor (humidity) in the near surface air.
9. the amount of H2O vapor (clouds, fogs, mists) in the near surface air.
10. the time of day, amount of and temperature of the precipitation (H2O) that alights on surface.
11. the amount of thermal (heat) energy that is retained by and slowly emitted from and/or conducted to the near surface atmosphere relative to the mass density of the object that absorbed said thermal energy. Eg: Heat Islands, large volumes of water, etc.
Given the above, how can one possibly “filter out” an “over time” degraded signal from one (1) or many of said daily “true signals” …… when no two (2) daily “true signals” are the result of exactly the same environmental factors? None, …. zero, zilch, nada.
Is Climate Science a new game of Atmospheric Horseshoes ….. where “close” counts …. and the closest “distance” is determined by the highest Degreed player that is doing the “measuring”?
I would like inquire in a little more detail what control is used for the krigging of the regional field?
What do you use for the very first krigging?
Since kigging a regional field is necessary to determine outliers and breakpoint, it follows that the first krigged field has no adjustments for it’s control. There may be breakpoints for gaps in the records.
Someway and somehow you identify a station that needs an empirical breakpoint because it diverges from the regional trend by a key threshold. You insert the breakpoint. One semi-long record becomes two semi-short records.
THEN WHAT? Does that altered station go back into the pool of krigging control points?
What are your options?
A) Remove the station from the pool available for krigging? There soon would be no stations left to give a regional trend.
B) Preplace the original station with the one with an extra break point. ( After all, breakpoints only “improve the data.” right? /sarc. ) Before long the krigged field is totally dominated by stations that have been subjected to prior krigging. We now have a perpetual motion machine endlessly krigging regional trends to test for new break points at stations already ground into pulp.
Willis Eschenbach: What I said is that IF we can get a lower MSE by infilling as you claim, then we could infill everywhere by using virtual stations and get an arbitrarily small MSE …
The minimum achievable mse is obtained by a Bayesian method, as presented in the book by Samaniego, that I cited. How you get from “IF we can get a lower MSE by infilling as you claim” (when my claim referenced Bayesian methods), to “… arbitrarily small MSE” is a mystery to me.