Guest Post by Willis Eschenbach
In an insightful post at WUWT by Bob Dedekind, he talked about a problem with temperature adjustments. He pointed out that the stations are maintained, by doing things like periodically cutting back the trees that are encroaching, or by painting the Stevenson Screen. He noted that that if we try to “homogenize” these stations, we get an erroneous result. This led me to a consideration about the “scalpel method” used by the Berkeley Earth folks to correct discontinuities in the temperature record.
The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.
There are a variety of ways to deal with that, which are grouped under the rubric of “homogenization”. A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.
The method that I’ve recommended in the past is called the “scalpel method”. To see how it works, suppose there is a station move. The scalpel method cuts the data at the time of the move, and simply considers it as two station records, one at the original location, and one at the new location. What’s not to like? Well, here’s what I posted over at that thread. The Berkeley Earth dataset is homogenized by the scalpel method, and both Zeke Hausfather and Steven Mosher have assisted the Berkeley folks in their work. Both of them had commented on Bob’s post, so I asked them the following.
Mosh and/or Zeke, Stephen Rasey above and Bob Dedekind in the head post raise several points that I hadn’t considered. Let me summarize them, they can correct me if I’m wrong.
• In any kind of sawtooth-shaped wave of a temperature record subject to periodic or episodic maintenance or change, e.g. painting a Stephenson screen, the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.
• Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.
• As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.
Now this makes perfect sense to me. You can see what would happen with a thought experiment. If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.
So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.
In your reply, please consider that I have long thought and written that the scalpel method was the best of a bad lot of methods, all methods have problems but I thought the scalpel method avoided most of them … so don’t thump me on the head, I’m only the messenger here.
w.
Unfortunately, it seems that they’d stopped reading the post by that point, as I got no answer. So I’m here to ask it again …
My best to both Zeke and Mosh, who I have no intention of putting on the spot. It’s just that as a long time advocate of the scalpel method myself, I’d like to know the answer before I continue to support it.
Regards to all,
w.
What defines the radius of control for the regional krigging around any given station under study for breakpoints and segment trend testing?
How far might the most distant control point be?
What are the minimum number of stations that a krigging must include?
Is there a Maximum number?
Is there any weighting or inclusion in the control pool based upon the length of the segment(s)?
I think there is a distinct tendency for breakpoints to appear more frequently when there are more stations in the neighborhood in which to generate a regional krigged field of trends. This is at once logical and a red flag.
It is logical in that that it is difficult to determine an empirical breakpoint in a station record if it is the only one for one hundred miles. Early airport stations, such as DENVER STAPLETON AIRPORT fit this bill. But later in a station’s history, there are many neighbors that can quibble with its trend and suggest breakpoints, deserved or not.
It is a red flag in that the density of breakpoints, if deserved by station microsite issues, TOBS, and instrument changes, are unlikely to be a function of the number of other stations within a day’s drive. Yet that appears to be what is happening.
The empirical breakpoints of a station are at least a partial function of the station’s neighbors and appear not to be intrinsic to the station’s own behavior.
Willis Eschenbach: Your claim is that using the local average is better than leaving it out.
Matthew R Marler : It’s a ranking: the mse of the overall estimated mean is smaller when the local average is used in place of the overall average: (1) depending on how different the true local average is from the national average (between-location variance); (2) depending on the precisions of the individual temperature recordings (at location measurement variance); (3) depending on how well the overall distribution (from place to place) can be approximated by a functional form (Gaussian etc, with estimated parameters.)
The Bayesian estimation procedure does not actually “solve” a problem, in any intuitive sense of “solve”; it uses all distribution information to reduce the mse of the estimate. It’s explained in Samaniego’s book “A comparison of frequentist and Bayesian methods of estimation”, and most other introductions to Bayesian estimation.
This began with a comparison of using the overall mean (equivalent except for degrees of freedom to omitting the observation) to the local mean. Unless all of the local means are the same, you get a lower mean square error in estimating the local means and the overall mean if you impute the local mean for the missing value.
EDIT of 10:24am:
It is a red flag in that the density of breakpoints, if deserved by station microsite issues, TOBS, and instrument changes, should be independent of the number of other stations within a day’s drive. Yet it appears that the frequency of breakpoints increases with the number of neighboring stations.
Stephen Rasey: The empirical breakpoints of a station are at least a partial function of the station’s neighbors and appear not to be intrinsic to the station’s own behavior.
It’s not an either/or. It compares the size of the jump at the possible break point to the standard deviation of the neighbor measurements.
Stephen Rasey: EDIT of 10:24am:
It is a red flag in that the density of breakpoints, if deserved by station microsite issues, TOBS, and instrument changes, should be independent of the number of other stations within a day’s drive. Yet it appears that the frequency of breakpoints increases with the number of neighboring stations.
That is a really interesting comment. Is it possible that there are more thermometers in the regions that have the greatest change?
Matthew R Marler at 9:58 pm
Samaniego’s book “A comparison of frequentist and Bayesian methods of estimation”, and most other introductions to Bayesian estimation.
I have not read that book. Nor am I likely to in the time I have left on this planet.
What is the essence to the methodology?
What are the key assumptions?
Bayesian analysis has as its Achilles Heel the issue of “Prior” estimates of distributions. IPCC accepted a 0-18 uniform prior for the climate sensitivity in an egregious case of Thumb on the Scales in a desperate and transparent effort to keep a climate CO2 sensitivity 4.5 deg C per doubling as a viable high estimate.
I am a Bayesian. That doesn’t mean I accept the work of all Bayesians in all situations.
@Matthew R Marler at 10:34 am
Is it possible that there are more thermometers in the regions that have the greatest change?
“greatest change?”
Change in what?
I think the existence of a breakpoint at a station has less to do with what happens at the station and more to do with the number of and nearness of neighboring thermometers in the krigged field of trends.
I further suspect that as more stations are broken into smaller and smaller segments and returned to the pool of krigging control points (a guess on my account, see above), the krigged field of trends takes on a “fabricated” shape with unrealistically small uncertainty. Which is what perpetuates the processes desire to keep breaking segments into ones shorter than ten years.
@Matthew R Marler 10:28 am
This began with a comparison of using the overall mean (equivalent except for degrees of freedom to omitting the observation) to the local mean. Unless all of the local means are the same, you get a lower mean square error in estimating the local means and the overall mean if you impute the local mean for the missing value.
Impute the local, regional krigged mean estimate — AND ITS ESTIMATED ERROR — for the missing value.
Every thing you do to a dataset adds error and uncertainty.
Data Stream: 12, 12, 12, 12, 12, 21, 12, 12, 12, 12,
Wow. that 21 sure is an outlier. You know, it is probably a transposition.
PROBABLY. There is a non-zero probability that it is correct. (I know, because it’s my data and it is real.)
Changing the data stream to
Data Stream: 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
is not as correct as
Data Stream: 12, 12, 12, 12, 12, (12, 12, 21), 12, 12, 12, 12,
where (12,12,21) denotes a low, most like, high estimate on some non-symmetric distribution.
The latter preserves the addition of uncertainty to the overall mean result.
Speaking of mse (mean standard error),
Here is a simple question.
Given a 30 day month’s daily Tave measurements
c(9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11)
The month’s Tmean is obviously 10.00,
but what is the month’s Tmse?
StDev = 1.000, Count = 30,
mse = StDev/sqrt(count) = 0.182.
Wrong.
It is a trick question.
We never measure any 9 or 11.
We never measure any daily Tave.
We measure 30 daily Tmin and 30 daily Tmax.
So let’s assume that our
30 Tmins are c(4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6)
and
30 Tmax are c(14,16,14,16,4,16,4,16,4,16,14,16,14,16,4,16,4,16,4,16,14,16,14,16,4,16,4,16,4,16)
The monthly Tavg from the 30 Tmins and 30 Tmaxs = 10.000
but the StDev = 5.099, count = 60
Tmse = 5.099/sqrt(60) = 0.658 deg C.
That’s a big error bar when we are dealing with trends of 0.20 deg / decade.
It is a huge error bar if we break long temperature records into sub-decade long segments. For this reason, if no other, BEST’s scalpel is making a bad situation worse.
Even if the temperature sensor is pristine, class 1, and error free, the fact that the original data is daily mins and maxes, means there is about a half degree uncertainty in the monthly mean, which will pass through to the monthly anomaly.
Where in all this work about mse are we keeping track of the original Tmse in the monthly anomalies? Keep in mind, every infilled data point must contain not only the Tmse that should be in the station data, but the Tmse present in the control points of the krigging field. As you infill the data, you are adding noise you must not ignore.
South African mining engineer & geostatistician Danie G. Krige, father of kriging, died last year.
Stephen Rasey, lots of good comments.
Yes, estimate the imputed value and the error of its estimate. With multiple imputation, you draw samples from the data (where the exist) and from the distributions of the imputed values, when estimating the precision of the overall estimate.
The key feature of Samaniego’s book is its evaluation of when the Bayes estimate is likely to be better than the mle and fiducial distribution in practice: basically, iff the prior distribution is accurate enough, what he calls passing the “threshold”. Thus, he supports your idea that the prior is the Achilles heel. Lots of us agree anyway, but it is good to have a scholarly work on the subject.
@Matthew R Marler at 10:34 am
Is it possible that there are more thermometers in the regions that have the greatest change?
“greatest change?”
Change in what?
The greatest change in temperature. You said that there are more breakpoints in the regions that have the most thermometers, and I was wondering if those are the regions with the largest number of industrial developments and such in that period of time. If so, the larger number of breakpoints would be an expected outcome
Stephen Rasey says:
July 2, 2014 at 12:06 pm
Excellent points, Stephen.
I have long held that the reason that the standard error of the ocean heat content (OHC) values is incorrectly claimed to be so small is that the error in not carried forwards in the step where they remove the “climatology”. Instead, the “anomalies” with the monthly average values removed are treated as observational points with no associated error … when in fact they have an error equal to the standard error of the mean of the monthly measurements.
w.
But the big question is, ….. is each individual “true signal” also an accurate and/or correct signal that has not degraded (increased/decreased) “over time” due to physical changes within its local environment?
To that end, we drop any station that has moved to or from an unknown or differently rated location. We also drop an unmoved station where encroaching microsite has changed the rating.
We also drop a station if TOBS is “reversed” because that produces a large, spurious step change.
Stephen Rasey and Willis Eschenbach: Speaking of mse (mean standard error),
mse stands for “mean squared error”. It is the sum of the square of the bias and the variance of the estimator. The bias is the difference between the mean of the estimator and the true value. Siting a thermometer next to an air conditioner heat exchanger, for example, adds a bias no matter how precise the thermometer is. Siting, and drift are bias issues, related to accuracy.
Matthew R Marler says:
July 2, 2014 at 5:29 pm (Edit)
Ahhh … I was under the impression you were talking of the standard error of the mean (sem, not mse).
I don’t see how you could ever hope to calculate the bias of the temperature dataset … so I’m not clear now about your previous claim of the effect of infilling on the mse.
w.
Willis Eschenbach: I don’t see how you could ever hope to calculate the bias of the temperature dataset … so I’m not clear now about your previous claim of the effect of infilling on the mse.
With a particular data set, you can’t tell for sure that you have the estimate with the least obtained bias, unless you have some other estimate of the parameter of interest. I think I wrote that earlier, but perhaps not on this thread. That the Bayes estimators can’t be beaten over many samples is a theorem; one of the points of Samaniego’s book is that you can’t expect (!) the result unless the prior is sufficiently accurate.
However, there is no reason to think that the overall mean is a better estimate for infilling than the Bayesian estimate of the local mean, which was where we started.
mse is mean standard error in my example. I was explicit in the formula, name and abbreviation.
I’ve never heard of mean squared error,
The greatest change in temperature. You said that there are more breakpoints in the regions that have the most thermometers, and I was wondering if those are the regions with the largest number of industrial developments and such in that period of time. If so, the larger number of breakpoints would be an expected outcome.
Why would it be the expected outcome? I do not disagree. But the reason there should be more breakpoints in places with a large number of thermometers is key to grokking the problem.
Breakpoints should reflect a problem with THAT station. Either it is an obvious data gap, a meta-data know change in recording methodology and/or location.(with presumed significant change in local climate) —- OR it is some UNKNOWN change in the station, a change only highlighted by its difference to surrounding stations.
This last point would only be valid if and only if the regional trend is trustworthy enough to identify unknown problems at a station. Trustworthiness of the regional trend has yet to be proven to me. The only peer reviewed paper I’ve seen (from AGU 2012) was on a synthetic case where breakpoints were not an element tested. But I see an enormous problem with the krigging trend as a function of time as different control points have different segments come and go.
It is easy for me to see that an increase in breakpoints are a natural artifact of as the number of comparing stations increase. It signifies a problem with region instability and a too-sensitive decision criteria on wielding the scalpel.
Ok, I was wrong and learned something.
mse is mean squared error.
mean standard error (= StDev of sample / SQRT (number of samples) )
and might be abbreviated as sey, sem, or RMSE
http://epitools.ausvet.com.au/content.php?page=UserGuide15
evanmjones says:
July 2, 2014 at 4:06 pm
“To that end, we drop any station that has moved to or from an unknown or differently rated location. We also drop an unmoved station where encroaching microsite has changed the rating.”
—————-
If you keep eliminating stations …. then you have to keep extrapolating temperature over a farther and farther greater area which introduces an even greater uncertainty in the temperature record, does it not?
Of course I don’t think it makes any difference, one way or the other, how you do it.
@Willis Eschenbach at 7/2 3:00 pm
I think the claimed error bars on OHC for the years 1955 to 2005 have a much simpler and basic explanation.
Suppose we want to study the changes in heights and weights for the adult population of the United States. Since 2004 we have been measuring every 1000th adult that walked into every doctor’s office around the country. Before long you’d have a good tight estimate on the mean with a narrow mean standard error. Of course there are caveats. We are not measuring children, just adults. A little deeper in the fine print, we see that we measure people who “walked” into Doctors offices. If they are in a wheel chair, or on crutches, they don’t get measured. Deeper down, we realize that people who cannot get to Doctor’s offices don’t get measured either.
But It is still a good dataset! (If you remember the caveats).
Is there any way we can find some data from before this well designed study started to find any trend in heights of adults back to 1950? “Wow, look what I found! Here is some data from a well designed study, high quality control, that gives me heights and weights of adults. They’ve been doing it for years. It’s just what we need!”
Oh, who conducted the study?
“The U.S. Army medical corps.”
Where did they collect the data?
“Fort Benning and Fort Jackson Basic Combat Training centers”
So…. you have a good handle on mean and standard deviation on adults…. who are fit and young enough to join (or be drafted) into the Army — at boot camp. That’s what you are going to splice onto your new doctor visit dataset?
Levitus 2009 did something very similar.
Today we have a great ARGO data collection effort. It only measures the top 2000 m of ocean. It only reports from open ocean, so forget the arctic regions. Continental shelves are not deep enough for the bouy hibernation. Restricted seas, like the sea of Japan may be over or under sampled. But it is a good dataset.
Levitus looked for a well collected dataset at times earlier than 2003 when ARGO was started. He found such a dataset that was originally collected under contract with the Office of Naval Research who were keenly interested in ocean temperature profiles as an element for the hiding and detection of submarines (our’s and their’s). Very well collected data. The problem is they focused measurements where submarines patrol: NW Pacific, NE Pacific, NW Atlantic, NE Atlantic, Barrents Sea. Within 1500 miles of the coast. (See the Abraham maps in the links above). It is sparse sampling, but high quality — where</b. it was done. But just like the apocryphal boot camp study above, they have nice tight narrow standard deviations on an artificially restricted dataset they pass off as representative of the data they collect today.
It is one thing for Levitus 2009 to have been published. It is quite another that its conclusions are repeated today with a straight face.
If you keep eliminating stations …. then you have to keep extrapolating temperature over a farther and farther greater area which introduces an even greater uncertainty in the temperature record, does it not?
Yes, but there is no getting away from it. And, fortunately, there is oversampling, and even the included Class 1\2s alone have a pretty good distribution. Those are the ones they should use. The others must either be heavily adjusted for microsite, or dropped (dropped, if possible). Besides, what’s the use of extrapolating less if the slack is being taken up by fatally flawed data in the first place?
Stephen Rasey: Why would it be the expected outcome? I do not disagree. But the reason there should be more breakpoints in places with a large number of thermometers is key to grokking the problem.
On that we agree.
evanmjones says:
July 3, 2014 at 10:06 am
“Besides, what’s the use of extrapolating less if the slack is being taken up by fatally flawed data in the first place?”
————————
My thoughts exactly. It is all “fatally flawed data” with no hope of ever getting it corrected or straightened out.
But there sure are a lot of people that are obsessed with trying to do that very thing.
@Willis Eschenbach at 3:00 pm
I have long held that the reason that the standard error of the ocean heat content (OHC) values is incorrectly claimed to be so small is that the error in not carried forwards in the step where they remove the “climatology”. Instead, the “anomalies” with the monthly average values removed are treated as observational points with no associated error … when in fact they have an error equal to the standard error of the mean of the monthly measurements.
Willis, I agree with you completely. I believe with almost ALL work with anomalies, a poor job is being done with the accounting of uncertainty in the statistics.
In my rather lengthy reply of July 3, 9:29 am I did not mean to disagree or discount your observation. I chose instead to pick upon a separate, easier to identify, and more basic error in the OHC (Ocean Heat Content) dataset from measurements made prior to 2003 — unrealistic measurement uncertainty coming from a highly biased and sparse sampling (in time and space) of Temperatures below 300 m.
Levitus-2009, reports a 1.8 x 10^22 Joules, increase from 1955 to 2008. Keeping in mind that it takes 27.5 ZJ (= 2.7 x10^21 Joules) to raise the 0-2000 m interval of the ocean 0.01 deg C, that means Levitus total temperature change is less than 0.08 deg C. As in you Decimals of Precision post, if 100,000 measurments per year from ARGO might justify a 0.005 deg C accuracy, 1000 measurements per year can do no better than 0.05 deg C and 10 measurement/year can do no better than 0.5 deg C.
On top of all that sample bias, I too believe that they are disposing of the sample uncertainty when they are taking their anomalies. But we have to get deeper in the weeds to prove it. Such as how are they binning in time and space the sub 300 meter readings? What do they do with cells that have zero, 1, or two readings? What happens if they have 5 measurements in a vol-time grid cell for May 1965, but no more until July 1969? What happens if a sub 1000m vol-grid cell is never sampled in January until 2004 with the first ARGO to visit the cell?