Problems With The Scalpel Method

Guest Post by Willis Eschenbach

In an insightful post at WUWT by Bob Dedekind, he talked about a problem with temperature adjustments. He pointed out that the stations are maintained, by doing things like periodically cutting back the trees that are encroaching, or by painting the Stevenson Screen. He noted that that if we try to “homogenize” these stations, we get an erroneous result. This led me to a consideration about the “scalpel method” used by the Berkeley Earth folks to correct discontinuities in the temperature record.

The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.

There are a variety of ways to deal with that, which are grouped under the rubric of “homogenization”. A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.

The method that I’ve recommended in the past is called the “scalpel method”. To see how it works, suppose there is a station move. The scalpel method cuts the data at the time of the move, and simply considers it as two station records, one at the original location, and one at the new location. What’s not to like? Well, here’s what I posted over at that thread. The Berkeley Earth dataset is homogenized by the scalpel method, and both Zeke Hausfather and Steven Mosher have assisted the Berkeley folks in their work. Both of them had commented on Bob’s post, so I asked them the following.

Mosh and/or Zeke, Stephen Rasey above and Bob Dedekind in the head post raise several points that I hadn’t considered. Let me summarize them, they can correct me if I’m wrong.

• In any kind of sawtooth-shaped wave of a temperature record subject to periodic or episodic maintenance or change, e.g. painting a Stephenson screen, the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.

• Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.

• As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.

Now this makes perfect sense to me. You can see what would happen with a thought experiment. If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.

So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.

In your reply, please consider that I have long thought and written that the scalpel method was the best of a bad lot of methods, all methods have problems but I thought the scalpel method avoided most of them … so don’t thump me on the head, I’m only the messenger here.

w.

Unfortunately, it seems that they’d stopped reading the post by that point, as I got no answer. So I’m here to ask it again …

My best to both Zeke and Mosh, who I have no intention of putting on the spot. It’s just that as a long time advocate of the scalpel method myself, I’d like to know the answer before I continue to support it.

Regards to all,

w.

About these ads

181 thoughts on “Problems With The Scalpel Method

  1. Any data handling method that can produce a positive temperature trend is highly sought after amongst CAGW supporters in these days of no obvious warming. TOBS, homogenisation, UHI, relocation and loss of sites can all be pressed into the service of The Cause in some way or another. There is no consideration of Scientific Method here, it is now all just politics.

  2. Link to Dedekind’s post does not work.

    Do you assume that the drift is always positive: a tree creating a shadow, new asphalt, new buildings, fading paint, …

    Is it possible to find signal that is smaller than the measurement errors. This is the big question of the temperature records.

  3. same problem as exposed for GISS pair-wise correction. The method cannot catch gradual drift followed by step-wise correction. Thus it introduces bias into signals that have no bias.

    however, since the bias tends to introduce warming and warming was expected, the error went undetected by the programmers.

  4. no conspiracy is required. programmers never look for errors when the data gives the expected result. that is how testing is done most of the time. you only look for problems when you don’t get the answer you expect. so, if you expect warming, you only look for bugs when the results don’t show warming. as a result, bugs that cause warming are not likely to be found – at least not by the folks writing the code.

  5. “the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.”

    Could we create a subset containing the first measurement only from each change, if these are the most accurate?

    Much fewer measurements, but illuminating, *if*, the trend differs from using all of the measurements?

  6. steverichards1984 says: June 28, 2014 at 11:39 pm
    subset containing the first measurement only….
    Yes, please. I’ve often advocated a look at a large temperature subset composed of the first 5 years of operation only, of a new or relocated station. I lack the means but I promote the idea.

    More philosophically, it is interesting how the principle of ‘adjustment’ has grown so much in climate work. It’s rather alien to most other fields that I know. I wonder why climate people put themselves in this position of allowing subjectivity to override what is probably good original data in so many cases.

  7. That is a well and long known problem

    http://climateaudit.org/2011/10/31/best-menne-slices/#comment-307953

    http://wattsupwiththat.com/2014/01/29/important-study-on-temperature-adjustments-homogenization-can-lead-to-a-significant-overestimate-of-rising-trends-of-surface-air-temperature/

    I don’t think there is much enthusiasm for any improvement among those invested in AWG.

    Halving temperature trends over land may give a better guess than BEST, That matches better with McKitricks paper, with Watt’s draft and with lower troposphere satellite trends.

  8. “…a temperature record subject to periodic or episodic maintenance or change…”. Have any tests been done to determine the magnitude of changes such as repainting compared to daily dirt and dust build-up? I have a white car which progessively becomes quite grey until a good rain storm. I would imagine such dirt build up could have a significant effect on a Stevenson screen, between rain storms?

  9. Splitting a record at a breakpoint has the same effect as correcting the breakpoint. If the breakpoint was caused by station maintenance or other phenomena that RESTORES earlier observing conditions after a period of gradually increasing bias, correcting the breakpoint or splitting the record will preserve the biased trend and eliminate a needed correction. If a breakpoint is caused by a change in TOB, the breakpoint needs to be corrected or the record needs to be split to eliminate the discontinuity. If a breakpoint is cause by a station move, we can’t be sure whether we should correct it or leave it alone. If the station was moved because of a gradually increasing [urban?] bias and the station was moved to an observing location similar to the station’s early location, correcting the breakpoint will preserve the period of increasing urban bias. If the station wasn’t moved because the observing site wasn’t degrading, then correction is probably warranted.

    WIthout definitive meta-data, one can’t be sure which course is best. However, only one large shift per station which cools the past can be attributed to a change is TOB, along with any pairs of offsetting large shifts. All other corrections that are undocumented probably should be included in the uncertainty of the observed trend (corrected for documented biases). For example, global warming in the 20th century amounted to 0.6 (observed change after correcting for documented artifacts) – 0.8 degC (after correcting all apparent artifacts).

  10. It is clear beyond doubt (see for example the recent articles on Steve Goddard’s claim regarding missing data and infilling) and the poor siting issues that the surface station survey highlighted, that the land based thermometer record is not fit for purpose. Indeed, it never could be, since it has always been strained well beyond its original and design purpose. The margins of error far exceed the very small signal that we are seeking to wean out of it.

    If Climate Scientists were ‘honest’ they would, long ago, have given up on the land based thermometer record and accepted that the margins of error are so large that it is useless for the purposes to which they are trying to put it. An honest assessment of that record leads one to conclude that we do not know whether it is today warmer than it was in the 1880s or in the 1930s, but as far as the US is concerned, it was probably warmer in the 1930s than it is today..

    The only reliable instrument temperature record is the satellite record, and that also has a few issues, and most notably the data length is presently way too short to be able to have confidence in what it reveals.

    That said, there is no first order correlation between the atmosheric level of CO2 and temperature. The proper interpretation of the satellite record is that there is no linear temperature trend, and merely a one off step change in temperature in and around the Super El Nino of 1998.

    Since no one suggests that the Super El Nino was caused by the then present level of CO2 in the atmosphere, and since there is no known or understood mechanism whereby CO2 could cause such an El Nino, the take home conclusion from the satellite data record is that climate sensitivity to CO2 is so small (at current levels, ie., circa 360ppm and above) that it cannot be measured using our best and most advanced and sophisticated measuring devices. The signal, if any, to CO2 cannot be seperated from the noise of natural variability.

    I have always observed that talking about climate sensitivity is futile, at any rate until such time as absolutely everything is known and understood about natural variation, what are its constituent forcings and what are the lower and upper bounds of each and every constituent forcing that goes to make up natural variation.

    Since the only reliable observational evidence suggests that sensitivity to CO2 is so small, it is time to completely re-evaluate some of the corner stones upon which the AGW hypothesis is built. It is at odds with the only reliable observational evidence (albeit that data set is too short to give complete confidence), and that sugggests that something fundamental is wrong with the conjecture.

  11. Per Willis

    “…As a result, the raw data may not reflect the actual temperatures….”
    //////////////////////////

    Wrong; the raw data is the actual temperature at the location where the raw data is measured.

    What you mean is whether there are some factors at work which have meant that the actual temperature measured (ie., the raw data) should not be regarded as representaive of temperatures because it has been distorted (upwards or downwards) due to some extrinsic factor (in which i include changes in the condition of the screen, instrumentation, TOBs as well as more external factors such as changes in vegetaion, nearby building etc). .

  12. Global cooling says:
    June 28, 2014 at 11:03 pm

    Link to Dedekind’s post does not work.

    Thanks, fixed.

    Do you assume that the drift is always positive: a tree creating a shadow, new asphalt, new buildings, fading paint, …

    While in theory the jumps should be equally positive or negative, most human activities tend to raise the local temperature. In particular the growth of the cities has led to UHI. As a result, when many of the weather stations moved to nearby airports after WWII, there would be a sharp cooling of the record.

    In addition, if you just leave a met station alone, the aging of the paint and the growth of surrounding vegetation cutting out the wind both tend to warm the results.

    However, there are changes that cool the station, so yes, the jumps will go both ways. But that doesn’t fix the problem. The scalpel method is removing the very information we need to keep from going wrong.

    Is it possible to find signal that is smaller than the measurement errors. This is the big question of the temperature records.

    Mmmm … in theory, sure. Signal engineers do it every day. But in the temperature records? Who knows.

    w.

  13. I don’t understand this constant fiddling with data.

    Consider; you perform an experiment, you get some results (ie., the actual results of the experiment which is the raw data that you have found). You then interpret these results, and set out your findings and conclusions (which will by necessity discuss the reliability amd mergins of errors of the actual results of the experiment). But you never substitute the actual results of the experiments with your own interpreted results, and claim that your own interpreted results are the actual results of the experiment conducted.

    When someone seeks to replicate the experiment, they are seeking to replicate whether the same raw data is achieved. When you seek to review an earlier performed experiment, two distinct issues arise;
    1. Does the replicated experiment produce the same raw data?
    2 Does the interpretation which the previous experimentor gave to the findings withstand scientific scrutiny, or is there a different (possibly better) interpretation of the raw data?

    These should never be confused.

    The raw data should always remain collated and archived so that others coming after can consider what they consider the raw data shows. Given advances in techology and understanding, later generations may well have a very different take on what the raw data is telling. Unfortunately it appears that much of the original unadjusted raw data, on a global basis, is no longer available.

    If we cannot detect the effects of UHI in the land based thermoter record, given that UHI is a huge signal and we know that over the years urbanisation has crept and that station drops outs have emphasised urban stations over truly rural stations, there is no prospect of seeing the far weaker signal of CO2.

  14. Willis Eschenbach says:
    June 29, 2014 at 1:03 am
    //////////////////

    Commonsense would suggest that the majority of recent adjustments should be to cool recent measurements.

    Given the effects of UHI and urban crawl, switch to airports etc (with more aircraft traffic and more powerful jet engines compared to props etc) these past 40 or so years, would be that adjustments to the ‘record’ for 2014 through to say 1970 should be such that these measurements are lowered (since the raw data would be artificially too high due to warming pollution by UHI etc).

    Yet despite this being the commonsense position, it appears that the reverse is happening. Why???

  15. In Nov 1, 2011, Steve McIntyre writes:

    One obvious diagnostic that BEST did not provide – presumably because of their undue haste in the answer – is a histogram of the thousands of steps eliminated in the slicing. If they provide this, I predict that the histogram will show the bias that I postulate. (This is pretty much guaranteed since the BEST trend is greater than CRU trend on GHCN data.)

    That was a while ago. Has BEST published such a histogram of breakpoint trend offsets?

    On the majority of BEST stations and it’s breakpoints I investigate, I am appalled at the shortness of the segments produced by the scalpel. We’ve just seen Lulling, TX. I’ve written about Stapleton Airport Denver,
    CO, where the BEST breakpoints do not match airport expansion events, yet BEST misses the opening and closing of the airport!!!

    People, I’ll accept a breakpoint in a station if it is move with a significant elevation change. No Airport in the world could have a move breakpoint based on elevation change. I’ll grant you that moving a temperature station at LAX from east end of the runway to the west end of the runway might warrant a breakpoint. But a climate change within the bounds of an airport is the exception, not the rule. Let us see from BEST how many AIRPORT stations have c(0,1,2,3,4,…) breakpoints in their history. I bet 90% of them make no sense. If there is a station move WITHIN an airport, and it is for microsite conditions, it does not deserve a break. If it is for maintenance, it does not deserve a break. If it is to move it away from an expanding terminal, it does not deserve a break. If it is moved next to the ocean, Ok, give it a breakpoint. How often does than happen? According to BEST it happens all the time.

  16. Richard Verney says:
    “If we cannot detect the effects of UHI in the land based thermometer record, given that UHI is a huge signal and we know that over the years urbanization has crept and that station drops outs have emphasized urban stations over truly rural stations, there is no prospect of seeing the far weaker signal of CO2.”

    Perhaps the answer is to follow the advice of John Daly and use only those ´rural´ stations with a long record in areas where the response to CO2 ( if discernable) will be at its highest and competition with water vapour at its lowest – in the Arctic and Antarctic regions where the low temperatures give the highest IR output in the region where CO2 has its highest absorption bands. Given that it has been stated that we only need 50 or so records to give an accurate GTA he offers 60 plus sites that meet the criteria of being minimally affected by UHI effects, most of which do not show any long term warming trend with many showing cooling.

    http://www.john-daly.com/ges/surftmp/surftemp.htm

  17. Temporal UHI (Urban Heat Island) effect is another continuous drift, this is why BEST was unable to identify it. It also goes into the positive direction, because, although population explosion was already over 2 decades ago (global population below age 15 is not increasing any more), population density, along with economic activity keeps increasing in most places, due to increasing life expectancy and ongoing economic development. Which is a good thing, after all.

    It is far from being negligible, UHI bias alone can be as much as half the trend or more.

  18. I think a reasonable request of BEST is to produce a graph:
    X: Number of Years
    Y: Number of station segments whose length >= X
    By lines:: For Years = (1902, 2012, by= 5) That would be about 21 curves.

    That would let us easily see how many of the 40,000 stations BEST claims have a segment length of say 40 years for 2002. Or >= 20 years for 1992. I think most observers would be shocked at how few such segments remain in the record.

    We made this request to Zeke toward the end of WUWT Video Zeke Hausfather Explains…, Dec. 18, 2013.

    Thanks, Zeke. Not only is the number of stations important, but the Length of the usable records is important. I am not the only one who would like to see a distribution of station lengths between breakpoints at points in time or across the whole dataset.

    Richard H. and I made some very rough estimates from the raw files. BEST has this data at hand. They could post it tomorrow, if they wanted to.

  19. Richard Verney

    I always appreciate your comments. As I commented on the other thread

    ‘It is absurd that a global policy is being decided by our governments on the basis that they think we know to a considerable degree of accuracy the global temperature of land and ocean over the last 150 years.

    Sometimes those producing important data really need to use the words ‘ very approximately’ and ‘roughly’ and ‘there are numerous caveats’ or even ‘we don’t really know.’

    Tonyb
    Tony Brown -Climate Reason

  20. In my 2:21 post I mentioned that Richard H and I investigated some data on station length from the raw files BEST listed in an index. This is a plot of stations of greater length than X (for the entire life of the dataset). There are three curves based upon the percent of missing data.
    The total number of sites was 35,000,
    but only 10,000 of them had less than 10% missing data
    and only 3000 of them had <10% missing data and greater than 38 years long.

  21. I cannot remember the number of times I have written on blogs that ” IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA” unless you have good, scientific and mathematical analysis to allow you to do so without any doubt or favour.

    STOP IT !!

  22. The problem is an experimental error. Trying to fix an experimental area after the experiment using a statistical/processing methodology seems superficial. Fixing the flaw in the experimental design and rerunning the experiment is the way to go. It would seem better to remove stations that have moved or have had significant changes in land use. This may not be an option in most parts of the world but the US record may act as a benchmark for how one should assess global records.

  23. “Stephen Richards says:
    June 29, 2014 at 2:50 am

    ” IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA” ”

    Absolutely right. If there are perceived issues with historical data then do analysis on the raw data to show things eg like measured rural temperature rises more slowly, than airport or city data etc. (if it does).

    Do they even still have the historical raw data? As I understand it, lots of stations are no longer included in analysis, there are a myriad ways of deselecting stations to create any sort of trend you may wish. At least with raw data you are measuring what we experience and it is what we experience that determines whether “things are getting worse or better”. Signs of any “Thermogeddon” would appear in raw data with more certainty than in treated data.

  24. tonyb says:
    June 29, 2014 at 2:21 am
    ///////////
    Thanks.

    Whilst I do not like proxies, I consider that historical and archaelogical record provide some of the best proxy evidence available. At least we are dealing with fact, even if it is relative and not absolutely quantative in nature. I consider that climate scientist could do with studying history. Somehow the idea that there was some form of climate stasis prior to the industrial revelotion has taken root simply because people did not live through prior times, and have no insight into the history of those times.

    I applaud your reconstruction of CET since not only does it provide us with more data, it is a valuable insight into the history of those times, and what people thought of their history. Whilst I am not a fan of the globalisation of climate, instead I consider that in the terms that we are talking about it is inherently regional in nature, there is no particular reason why CET should not be a good trend marker for the Northern Hemisphere, particularly the near Northern European Continent. So your reconstruction is an extremely valuable tool.

    A point to ponder on. I consider that the next 10 to 12 years could be compelling. IF there is no super El Nino event, and IF temperatures were to cool at about 0.1 to 0.12degC per decade (and as you know from CET there has been a significant fall in temperatures since 2000, especially the ‘winter’ period), then we will be in a remarkable position.

    I know that this contains some IFs, but if it were to come to pass, we would find ourselves (according to the satellite data) at about the same temperature as 1979. There would, in this scenario, be no warming during the entirety of the satellite record, and during this period about 80% of all manmade CO2 emissions would have taken place (ie., the emissions from 1979 to say 2025).

    I do not think that many people appreciate that that would be the resulting scenario, of what would be quite a modest drop in temperature, and the knock on consequence is huge, since this would make it very difficult for someone to argue that manmade CO2 emissions are significant (if 80% of all such emissions has resulted in no measured increase at all!),

  25. Addendum to my 2:45 am post:
    The total number of sites was 35,000
    but only 10,000 of them had less than 10% missing data
    and only 3000 of them had <10% missing data and greater than 38 years long.

    To be clear, these numbers came from a superficial look at the raw data files Zeke and Mosher provide links to in a index page. These values are prior to the use of the BEST scalpel. BEST has ready and fast access to post-scalpel segment length. Constructing our suggested census plots by sampled year should be easy and no burden compared to other processing they do.

    Another interesting chart that Richard H and I briefly explored is a census map of 1×1 degree grid based upon post-segment length. For instance color code the 1×1 deg cells by the number of segments that exist that are longer that 20 years and cover the year 2005. I hypothesize it would be a pretty blank map. 2×2 deg cells? (that is about 100×125 mile cells).

  26. ferdberple says:
    June 28, 2014 at 11:26 pm

    Us component/pcb/system test and reliability engineers tried to break things without exceeding the specification. Once a problem was identified one of the Not Invented Here/It’s not our hardware responses was that condition will never happen in the field; Test reponse “how do you know?” Design Answer “because I/we designed it”. Test engineer “ou designed in this fault. how do you KNOW?” Designer “yes but….”

    We’re now in the yes but scenario.

    Willis’ question is unanswerable as the error is designed in, and will happen in the field.

  27. “the most accurate measurements are those immediately following the change. Following that, there is a gradual drift in the temperature until the following maintenance.”

    The drift is always positive, that’s why it doesn’t work. If the drift were random it would work.

  28. Didn’t Anthony have an experiment running to test Stevenson screens under different maintenance regimes? Is there anything from those tests that would be of use in this discussion?

  29. The breakpoints should be equally distributed between positive and negative adjustments of roughly the same magnitude.

    And the positive and negative breakpoints should be equally distributed through time. In 1910, there should be 200 negative breakpoints and 200 positive adjustments of roughly the same magnitude. And this should be roughly consistent from the start of the record to the end of the record.

    This is how bias can be detected. Simple histograms. And then we can determine whether the excess of negative adjustments in 1944 is valid for example. What is the main reason for this.

    There should not be a trend through time in the breakpoints unless it can be proven that they should vary through time.

    BEST hasn’t shown this that I am aware of. The description of the simple breakpoint/scapel method suggests that the breakpoints should be random through time. Are they? Why haven’t they shown this and explained it?

    We know the NCDC has shown their adjustment have a systematic trend through time with a maximum negative adjustment of -0.35C to -0.55C in the 1930s to 1940s. Why? Does the TOBs adjustment vary throughout time? Why? Shouldn’t the other changes vary randomly through time? Why not?

    Should people be allowed to change the data without fully explaining it.

  30. FWIW, I designed and built a precision Web/Internet enabled temperature monitor system about a decade ago (selling them to customers like HP Research and the Navy Metrology Lab), and although we never built a version for outdoor environmental monitoring and data acquisition, I did do research on things like radiation shields/Stevenson screens and the like, thinking we might sell to that market.

    One of the more interesting things I discovered was that some white paints (Krylon flat white spray paint being one of them) can actually have a far higher emissivity in the infrared spectrum than most other colors, including some *black* paints. (I don’t have the link handy, but I remember one source of this was a NASA document comparing the IR emissivity of a fairly large number of coatings. Krylon’s aerosol flat white was among the paints with the very highest IR emissivity which, of course, also means it’s near the top in IR absorption.

    Moral: Visible ain’t IR, and your eyes can’t tell by looking whether that paint absorbs or reflects heat. The fact that the really high emissivity was for flat white paint does call into question whether and how weathering/aging might dramatically increase the thermal absorption of white paints or even molded plastic radiation shields over time and hints that glossiness is at least as important as color.). Every paint I’ve ever encountered tends to get flat and/or chalky over time as it ages and oxidizes. As a result, repainting a shield could either raise or lower the temperature inside! If anyone were *really* interested in actual climate science, this would be a topic of research, but the Global Warming narrative is better served by ignoring it, so don’t hold your breath. One more reason why climate science and good temperature measurements are harder than they appear.

    (BTW, most temp sensors and instruments, even quite a few of the expensive ones, give pretty crappy readings that are frequently subject to offset errors of a degree or more (C or F, take your pick…) Thermocouples are especially problematic, as almost no one actually gets all the intermediate junction stuff and cold junction compensation right. Some systems I’ve seen correlate to the temp of the galvanized pole they’re mounted on better than they do to ambient air temp. (Further, I’m amazed at how many so-called high precision temperature measurement systems ship TCs with the ends twisted together rather than properly welded.) I prefer platinum RTDs for accurate temp measurements, but doing that right requires precision excitation power supplies and bridge completion resistors that are stable across your entire temp range over time. These things are expensive and no one wants to pay for them. Bottom line: accurately and precisely measuring temperature electronically is much harder than it appears, and it’s often done very poorly. I strongly suspect that hundred year-old temperature recohandhand recorded from mercury thermometers were dramatically more accurate and consistent than what we’re getting today.)

  31. Pick out a set of 100 sites that give a quasi satisfactory spread of locations in contiguous USA.
    Sites that can be maintained well and not on top of mountains.
    Sites that have TOBS perfect
    Exclude Hawaii and Alaska.
    Accept that it is a proxy and not perfect.
    Use 50 of the sites for the accepted temperature record.
    Use 50 adjacent sites as potential back ups if sites go down.
    Put a caveat up this temp is reconstructed using real raw data with * so many substitute sites.
    Allow Nick Stokes to use the anomalies from this method to do his sums as he cannot understand anomalies are just the variation from the absolute at each site.
    When problems arise infill the missing days from the average for that day over all past years .
    Put a caveat up that the record has been amended this way for that year.
    Wash, Rinse, Repeat.

  32. Willis — I think you’re basically right. There’s a fair amount of subtlety to all this and I wouldn’t bet much much money on you (we) being correct, but I don’t see any obvious flaws in your discussion.

    One minor point made by others, probably not all the biases increase over time. For example, vegetation growth (corrected by occasional trimming) would probably be a negative bias. But I suspect most of the biases are positive. (Does the sum of all biases that change over time = UHI?)

    It occurs to me that if this were a physical system we were designing where we had a great deal of control over the system and the measurement procedures, we’d probably try to design all the time variant biases out. If we couldn’t do that, we’d likely have a calibration procedure we ran from time to time so we could estimate the biases and correct them out. What we (probably) wouldn’t do was ignore calibration and try to correct the biases out based on the measurements with no supporting data on the biases. But that seems to be what we’re forced to do with climate data. I’m not smart enough to know for sure that can’t be done, but I wouldn’t have the slightest idea how to go about doing it.

    Anyway, good luck with this. I’m sure you’ll let us know how it works out.

  33. Oh I see, so they check each station to see if it was a physical change…or weather
    …and if it’s weather, they don’t splice it

  34. “A temperature dataset is said to be “homogenized” when all effects other that temperature effects have been removed from the data.”

    Interesting. The thermometer only reads the temperature. So, how does one determine, site-by-site, which “other than temperature” effects should be removed?

    I’m in agreement with those who say that only a site that doesn’t have any other than temperature effects should be used.

    I suspect we could look at any improperly sited station and spend days discussing the possible adjustment(s) necessary to make the data for that station reasonably correct.

    I’m making a huge assumption: that the folks here ultimately would agree on the adjustments for that station. I’m of the opinion that we could not.

    So, If we can’t get one right…

  35. SandyInLimousin says:
    June 29, 2014 at 3:47 am
    ==============
    i routinely get systems designers telling me that the odds of a particular event are billions to one against, so we can safely ignore it in the design. then I show them with a simple data mining exercise that we routinely have many of these billion to one events. most people are very poor at estimating events when it is in their favor to estimate poorly.

  36. BEST must have the data, showing how much the offset was on each slice. This should average out to zero if the slice is bias free. It is my understanding that BEST has not made this information available, despite being asked. why not?

  37. Correcting slice bias is not difficult. Add up all the offsets from slicing. some will be positive and some negative. whatever the residual, add the inverse into the final result to remove any trend created by slicing. but without knowing the offsets due to slicing, there is no way to know if it introduced a trend or not.

  38. The underlying problem is that most temperature records have discontinuities. There are station moves, and changing instruments, and routine maintainence, and the like. As a result, the raw data may not reflect the actual temperatures.

    Another underlying problem is the assumption that a discontinuity in the record is a problem in the record.

  39. Willis, maybe Zeke and Mosher were no longer interested in the discussion, but I did answer your question.

    That is why you should not only correct jumps known in metadata, but also perform statistical homogenization to remove the unknown jumps and gradual inhomogeneities. Other fields of science often use absolute homogenization methods (finance and biology), with which you can only remove jumps. In climatology relative homogenization methods are used that also remove trends if the local trend in one station does not fit to the trends in the region. Evan Jones may be able to tell you more and is seen here as a more reliable source and not moderated.

    P.S. To all the people that are shocked that the raw data is changed before computing a trend: that is called data processing. Not much science and engineering is done without.

    Relative homogenization methods, compute the difference between the station you are interested in and its neighbours. If the mean of this difference is not constant, there is something happening at one station, that does not happen at the other. Thus such changes are removed as well as possible in homogenization to be able to compute trends with more reliability (and the raw data is not changed and can also be downloaded from NOAA). In relative homogenization you compare the mean value of the difference before and after a potential date of change, if this difference is statistically significant a break is detected or in case of BEST the scalpel is set and the series are split. It does not matter whether the means are different due to a gradual change or due to a jump. What matters is that the difference in the means before and after is large enough.

    This part of the BEST method is no problem. With Zeke, BEST, NOAA and some European groups we are working on a validation study aimed especially at gradual inhomogeneities. This complements the existing studies where a combination of break inhomogeneities and gradual ones were used for validation.

  40. ps: i’ve used the term offset as the difference between the end points each side of the slice. there may be a better term. Willis et al contend that the majority of these offsets will be in one direction, leading to bias over time. unless BEST corrects for the residual – the net sum of the positive and negative biases – the slice must introduce an artificial trend in the result. BEST should publish the offset data so it can be evaluated, to see if the slicing created an artificial trend.

  41. BEST by year, what is the net total of the difference between the endpoints for all the slices?

  42. ferdberple clarifies: “The method cannot catch gradual drift followed by step-wise correction. Thus it introduces bias into signals that have no bias.”

    And there’s a perfect mechanism for a warm bias in the simple fading of initially bright white paint of temperature stations. Paint was what ruined the Hubble telescope mirror for lack of double checking. They only used a single flatness detector that has a bit of paint chipped off it so they altered the massive mirror to match the paint chip. Later a Metric system conversion left out crashed a Mars lander. So here is Willis asking ahead of time, prior to launch of an American carbon tax, “hey, have you guys checked for this little error issue I ran into?”

    Dirt simple errors being missed in big technical projects often lead to disaster even for rocket scientists with an extreme interest in not being wrong, unlike the case for climatologists.

  43. pps: one would also need to consider the gridding when apportioning the residuals from slicing. even if they added up to zero in absolute terms, this could still introduce bias when gridded.

  44. Willis: “Is it possible to find signal that is smaller than the measurement errors. This is the big question of the temperature records.
    Mmmm … in theory, sure. Signal engineers do it every day. But in the temperature records? Who knows.”

    This statement needs a little clarification. Yes algorithms are in use that easily detect signals an order of magnitude or two below noise level. However (and this is a big however!), these algorithms are searching for signals with known characteristics. Typically these will be sine waves with specific modulation characteristics. The less that is known about a signal, the better the signal to noise ratio must be and the longer the detection period must be for detection.

    The point is that searching for a long term trend in our USHCN records is not analogous to decoding data transmissions from a distant Voyager satellite. From a signal analysis perspective, reliably detecting that trend would require a significantly positive signal to noise ratio. (trend greater than noise over the observation period)

    An example of the problem of finding a trend signal in noise was presented to us by a popular hockey stick graph developed from tree ring analysis. The analysis algorithm emphasized records that matched a modern thermometer record. Since the thermometer record had a positive trend, the algorithm dug into the noisy tree ring data pulled out a positive trend result. Of course, the algorithm was also able to find false positive trends in pure random noise most of the time too. It found what it was designed to find.

  45. Willis I like the scalpel because it creates a break where there is a shift in data bias. That is a good way to start.

    Suppose the change was trimming a tree, painting the enclosure or cutting down a nearby hedge. The influence of this class of change is gradual – we can assume it grew to be a problem gradually so the effect is approximately linear.

    The bias (up or down) is the difference between the last and first data of the two sets but there would have to be some brave corrections because the week following the change is not the same as the one before it.

    Suppose the new data were consistently 1 degree cooler than the old. Doesn’t work. We have to take a whole year. But a year can be a degree colder than the last. Big problem.

    If there really was a 1 degree bias, we have to assume the early part of a data set is ‘right’. The step change has to be applied to the old data to ‘fix’ it assuming the change is linear.

    Step change = s
    Number of observations = n
    N is the position in the series
    D1 = data value 1
    D1c = Corrected value 1

    Linear Correction:
    D1c = D1+(N1-1)*s/n
    D2c = D2+(N2-1)*s/n
    D3c = D3+(N3-1)*s/n
    Etc

    That works for steps up or down.

    The data set length can be anything. All it needs is the value and sign of the step.

    Other types of episodic change like paving nearby sidewalk need offsets, not corrections for drift because that is a different class of change.

  46. A couple of points.

    If we assume that the discontinuities represent corrections back to the initial conditions, then the climate signal is best measured immediately after the discontinuities by connecting those as points and ignoring what’s in between. This is based on the idea that the discontinuity is due to a major human intervention that recalibrated the station, and that afterwards bias creeps in. Going all the way with the saw-tooth analogy, you have a region with absolutely no trend, on which sit a bunch of saws, teeth up, and the teeth having wildly varying sizes (different periods between recalibrations), but the same thickness between the back of the saw blade and the inner notch of a tooth. The best measurement of the underlying surface is the bottom of the notches, not the teeth themselves. .

    Also, you could assume that the bias signal accumulating in all the stations is half the maximum tooth height, then subtract that from all the readings, and simply average all the stations. Since the adjustments are asynchronus, this would probably give you a pretty good overall picture of the climate. This implies that the raw data is probably better than the adjusted data for seeing the major trends, since removing similar snippets from a periodic waveform will create a major false trend. If also agrees with the simple logic that thousands of station monitors were probably not all horribly misreading their thermometers, and that this misreading grows vastly worse in the past. Their readings were with X degrees now, and were within X degrees in the past.

    I’m also confident that it’s a whole lot easier to warm a station than to cool it, if we disallow large trees growing over a site. I think Anthony’s surface station project could confirm this by simply counting how many stations should be rejected because they should be reading too hot instead of too cold.

  47. Quoting Willis “quotes” from the article:

    Since the Berkeley Earth “scalpel” method would slice these into separate records at the time of the discontinuities caused by the maintenance, it throws away the trend correction information obtained at the time when the episodic maintenance removes the instrumental drift from the record.

    As a result, the scalpel method “bakes in” the gradual drift that occurs in between the corrections.

    So I’d like to know if and how the “scalpel” method avoids this problem … because I sure can’t think of a way to avoid it.

    So I’m here to ask it again …
    ———–

    Willis, I will offer my solution for your stated problem but you will have to determine if it is applicable and if it can be implemented or not.

    First of all, a discontinuity “flag” character would have to be chosen/selected that would be appended to the “first” temperature reading that was recorded after said maintenance was performed at/on the Surface Station. Said “flag” would thus denote a “maintenance break-point” in the daily temperature data. …. And for “talking” purposes I will choose the alpha character “m” with both the capital “M” and small “m” having significance.

    Next, a new “maintenance” program would have to be written that would “scan” the temperature data file for each Surface Station looking for any capital “M” discontinuity “flags” and if found, it would calculate the “trend” variance between said “M” flagged temperature value and the one previous to it. And via that actual “trend” value said maintenance program would algebraically “add” it to said previous temperature …. and then via an algorithm sequential “decrease” in/of said “trend” value would add said newly calculated “trend” values to all sequentially previous temperature data until it detects a small “m” discontinuity “flag” or a “default” date. The program would then change the originally noted discontinuity “flag” from a capital “M” to a small ”m” thus signifying that “trend” corrections had been applied to all temperature data inclusive between the small “m” discontinuity “flags”.

    Of course, the above requires use of the “raw” and/or “trend” corrected data each time said new “maintenance” program is executed.

    If that was not what you were “asking for” ….. then my bad, …. I assumed wrong.

  48. @Bill Illis at 5:23 am
    This is how bias can be detected. Simple histograms. And then we can determine whether the excess of negative adjustments in 1944 is valid for example

    In addition to simple histograms, I think we need to see scatter plots of
    Y: BreakPoint Offsets vs. X: Length of segment prior to the breakpoint.
    For different 5-year semi-decades.

    I can envision it possible for the simple histogram to show no bias in breakpoints, but only because the offsets after short segments counteract offsets of opposite sign after longer segments. A scatter plot of Offset vs prior segment length could show an interesting trend and how it changes for different periods of the total record.

    Better yet, why don’t we just have access to a table of:
    StationID, BreakPoint date, Segment Length Prior to break, Trend value before Break, Trend value after break.

  49. Stephen Richards says:
    June 29, 2014 at 2:50 am

    I cannot remember the number of times I have written on blogs that ” IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA” unless you have good, scientific and mathematical analysis to allow you to do so without any doubt or favour.

    STOP IT !!

    Thanks, Stephen. And I imagine that just as many times as you have written that, someone has responded along the lines that almost all data contains a variety of errors.

    For example, there are a number of temperature stations that have occasionally erroneously reported their data in Fahrenheit rather than Celsius. According to you, we should never alter the data, we should just use the incorrect data “as is”. So we should use the Fahrenheit figures, rather than daring to ALTER PAST DATA by converting them to the correct Celsius figures …

    Do you see how crazy you sound with your absolute dicta? Science is rarely that black and white.

    Here’s another example. Say that we have a change in the time of observation. Suppose that for years we’ve been taking afternoon temperatures at 3 PM at all of our temperature stations, and then we start taking them at 2 PM.

    Of course the 2PM figures are warmer than the 3PM figures, so when you consider the raw data, it looks like we have massive global warming. Now if someone goes around saying “I have raw unaltered historical data which proves that there is global warming”, what will you say in response?

    Me, I’ll say “No, there’s no global warming. You just haven’t accounted for the change in observation times”.

    But in that situation, if you say that we must accept the observational data exactly as it was recorded because ”IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA”, then you’ve just put your full weight behind a highly misleading (although totally accurate and unaltered) temperature dataset.

    When you realize that there are errors in historical data, there are two basic choices—throw out the data, or correct it for the bias caused by the change in the time of observation.

    And while you can make valid arguments for one or the other, correcting the known errors in a dataset is a valid scientific choice, one made by reputable scientists in a host of fields.

    So I fear that you’ll be a voice crying in the wilderness forever if you think we should keep Fahrenheit readings in place of Celsius readings or that we should not correct for the known bias caused by the change in times of observation …

    Best regards,

    w.

  50. Bill Illis says:
    June 29, 2014 at 5:23 am

    The breakpoints should be equally distributed between positive and negative adjustments of roughly the same magnitude.

    Thanks, Bill, but why? Suppose all of the temperature datasets were perfect except for a change in time of observation in 1968. Almost all of the adjustments will be in the same direction, not randomly distributed plus and minus. As a result, we can’t test with a histogram as you propose …

    w.

  51. I never said it was practical….
    Two parallel measurement stations required. A change at one of them could be classified as restorative (such as painting the outside), in which case the new measurement is taken as more accurate. The second station is needed to control for a coincidental real discontinuity.Then the restored unit can be used to correct for the degradation trend (there could be multiple degradation trends, actually). Replacing an aged thermometer with a new one could be classified similarly. However, changing measurement technology might not fall into a “restorative” category, in which case, the second station can be used to help make adjustments to the modified station’s output if the reported numbers are intended to create a continuous record. However, it is imperative the original raw data be preserved. A second station could also help prevent data loss when new tech fails. Recording metadata is vital if you ever hope to make meaningful corrections.

    This was a great post, Willis.

  52. Victor Venema says:
    June 29, 2014 at 7:09 am

    Willis, maybe Zeke and Mosher were no longer interested in the discussion, but I did answer your question.

    That is why you should not only correct jumps known in metadata, but also perform statistical homogenization to remove the unknown jumps and gradual inhomogeneities. Other fields of science often use absolute homogenization methods (finance and biology), with which you can only remove jumps. In climatology relative homogenization methods are used that also remove trends if the local trend in one station does not fit to the trends in the region. Evan Jones may be able to tell you more and is seen here as a more reliable source and not moderated.

    P.S. To all the people that are shocked that the raw data is changed before computing a trend: that is called data processing. Not much science and engineering is done without.

    Thanks for your reply, Victor, but that doesn’t solve the conundrum. Let me present it again:

    If we have a bunch of trendless sawtooth waves of varying frequencies, and we chop them at their respective discontinuities, average their first differences, and cumulatively sum the averages, we will get a strong positive trend despite the fact that there is absolutely no trend in the sawtooth waves themselves.

    Your method, using “statistical homogenization to remove the unknown jumps and gradual inhomogeneities”, will not fix the bogus trend created out of thin air by the scalpel method.

    My question was, it is even possible to fix that spurious trend created by the scalpel method, and if so, how are the Berkeley Earth folks doing it?

    Much appreciated,

    w.

  53. ferdberple says:
    June 29, 2014 at 7:14 am

    ps: i’ve used the term offset as the difference between the end points each side of the slice. there may be a better term. Willis et al contend that the majority of these offsets will be in one direction, leading to bias over time. unless BEST corrects for the residual – the net sum of the positive and negative biases – the slice must introduce an artificial trend in the result. BEST should publish the offset data so it can be evaluated, to see if the slicing created an artificial trend.

    ferd, thanks for your thoughts. A small correction. To me, the problem is not just that majority of jumps will be in one direction, leading to an overall trend.

    The problem is that even if by chance the jumps are randomly distributed and there is no change in the overall trend, it plays havoc with the individual station trends.

    As an explanatory example, suppose we have the following equations

    2 + 2 = 5
    2 + 2 = 3

    Both of them are obviously wrong … so average the two equations (just like we average our station trends) and we get

    2 + 2 = 4

    … I’m sure you can see the problem. Getting the correct overall result does NOT mean that the underlying “corrected” data is now valid …

    All the best,

    w.

  54. Relative homogenization methods, compute the difference between the station you are interested in and its neighbours. If the mean of this difference is not constant, there is something happening at one station, that does not happen at the other. Thus such changes are removed as well as possible in homogenization to be able to compute trends with more reliability (and the raw data is not changed and can also be downloaded from NOAA).

    Hullo, Doc. V.

    Problem is that the “something happening” appears to be Good Siting. And the result of that is that the most of the 20% of well sited stations are identified as outliers and are adjusted to conform with the readings of the 80% of poorly sited stations.

    And since microsite bias is continual, without breakpoint, the problem is not identified by BEST.

    The result is that the song of my precious Class 1\2 stations is silenced. Silenced as if it had never been sung. My beautiful song. My “true signal”. Gone with the Wind. Blown away.

    And unless a deconstruction of the adjusted data is applied, there is not the slightest trace that their song was ever sung in the first place.

  55. The problem is that even if by chance the jumps are randomly distributed and there is no change in the overall trend, it plays havoc with the individual station trends
    ==============
    agreed. my thoughts were specific to calculating the overall trend. as soon as you start infilling the errors in the individual stations will blow my approach out of the water. a similar argument could also be applied to anomalies. since the individual stations have false trends, their anomalies will also be wrong, further aggravating the errors.

    since this problem very much resembles pair-wise correction, the same problems are likely to persist. the underlying issue is that the method is sensitive to the rate of degradation of the signal, which leads to bias when degradation occurs slowly.

  56. Willis
    You say
    So we should use the Fahrenheit figures, rather than daring to ALTER PAST DATA by converting them to the correct Celsius figures

    I don’t think anyone would suggest that, especially as most of the world uses Celsius and the USA is an exception, even the UK uses Celsius with Fahrenheit as a bracketed value for older readers (known colloquially as Old Money). For global values one or other has to be converted. Adding a fudge factor to the changed value would be unacceptable as would not noting why the change was made. Changing historical datasets in the way Paul Homewood and at one time only Steven Goddard describe is not acceptable, that is adding a fudge factor to data someone else is going to use and will confirm your results and not your theory.

    I’m quite happy for the original unedited data to be all that a government agency publishes, corrections and fudge factors can be added by any researcher. The Data keepers should be just that, not the arbiters of what the data recorder meant when he wrote the figures on the piece of paper, or the automated station meant when it sent the electrons down the wire.

  57. And unless a deconstruction of the adjusted data is applied, there is not the slightest trace that their song was ever sung in the first place.
    =========
    agreed. I didn’t consider that infilling and anomalies would mask the false trends in individual stations, making post slice correction effectively impossible. the sum of the absolute value of the offsets is a measure of the maximum error in BEST, with the sum of the offsets a measure of the minimum error. so, publication of the detailed offsets due to slicing would appear to be the next step in validation of BEST methodology. residuals are not sufficient.

  58. the really interesting thing about the slice method is that when it was proposed it seemed like a good idea. it is only now, much later in the day, that folks are realizing that it is sensitive to a certain class of errors.

    it seems likely that pair-wise correction was the same. the researchers were trying to correct a specific problem, and the method worked for the cases studied. it was only much later, when the effects started to diverge from reality, that there was any indication there were problems.

    like a computer system that randomly changes 0`s to 1`s and 1`s to 0`s. If it happens quickly enough you can find the error. but if the problem works slowly enough, over time your system will die and there is nothing you can do to prevent it. it is almost impossible to detect slow moving errors.

  59. Hi Willis,

    Sorry for not getting back to you earlier; just landed back in SF after a flight from NYC.

    The performance of homogenization methods in the presence of saw-tooth inhomogenities is certainly something that could be tested better using synthetic data. However, as Victor mentioned, relative homogenization methods look at the time-evolution of differences from surrounding stations. If the gradual part of the sawtooth was being ignored, the station in question would diverge further and further away from its neighbors over time and trigger a breakpoint.

    There are a number of examples of apparent sawtooth patterns relative to surrounding stations in the Berkeley data that seem to be correctly adjusted; I haven’t found an example of poor adjustment that creates a biased trend relative to surrounding stations, but I’d encourage folks to look for them.

    Here are a few examples of sawtooth and gradual trend inhomogeneities seem to be correctly adjusted:

    http://berkeleyearth.lbl.gov/stations/169993

    http://berkeleyearth.lbl.gov/stations/30748

    http://berkeleyearth.lbl.gov/stations/156164

    http://berkeleyearth.lbl.gov/stations/161705

    http://berkeleyearth.lbl.gov/stations/33493

    http://berkeleyearth.lbl.gov/stations/34034

    Its also worth mentioning that Berkeley has a second type of homogenization that would catch spuriously inflated trends, at least if they were isolated. The kriging process downweights stations with divergent trends via-a-vis surrounding stations when creating the regional temperature field, after all stations have been homogenized.

  60. Willis: “Your method, using “statistical homogenization to remove the unknown jumps and gradual inhomogeneities”, will not fix the bogus trend created out of thin air by the scalpel method.”

    I do not have a method, but only validated the methods of others up to now. The normal way of homogenization is not the scalpel method. In the normal way, the neighbours are also used to compute the corrections. This makes the long-term trend of the station with the gradual inhomogeneity similar to the one of the neighbours. I do not expect standard methods to have more problems with gradual inhomogeneities as with jump inhomogeneities.

    Willis: “My question was, it is even possible to fix that spurious trend created by the scalpel method, and if so, how are the Berkeley Earth folks doing it?”

    I understand their article right, BEST reduces the weight of data with gradual inhomogeneities. I would personally prefer to remove it, but they prefer to be able to say that they used all the data and did not remove anything. If the weight is small enough, that would be similar to removing the data. That is the part of the algorithm, I would study, not the scalpel mentioned in the title of this blog post.

    Hello pruf Evan Jones, the quality of station placement is another problem as the one mentioned in this post. I do not want to redo our previous discussion, which would be off topic here. Do you have any new arguments since our last long, civil and interesting discussion?

  61. Also, Willis, Berkeley really doesn’t optimize for getting the corrected underlying data as accurate possible; rather, it focuses more on generating an accurate regional-level field. It will remove urban heating, in Las Vegas for example, even though thats a “real” local temperature effect. It also produces temperature fields that may be a bit too smooth, though its difficult to test given the absence of good ground-truth high-resolution spatially complete data.

  62. How many knobs are there?

    Show us the effect on the result of turning each knob from zero to ten.

    The BEST slice/dice knob is obviously already turned to eleven by Mophead Mosher:

    Does the clear overzealousness of the chopping affect the trend? YES OR NO? We don’t know. So we don’t trust your black box. Where is the online version that lets us play with the settings? This algorithm matches the other series out there only too well early on but then becomes a climate model matching outlier in the last decade. Why? Where is the discussion of this in the peer reviewed literature? I’ve compared it here to HadCRUT3, the Climategate University version put out before Phil Jones joined a Saudi Arabian university that he used as his affiliation in his HadCRUT4 up-adjusted version:

    http://woodfortrees.org/plot/best/mean:30/plot/hadcrut3vgl/mean:30

    In a field with a bladeless hockey stick making it into top journal Nature you really do have to show your work instead of just releasing software few know how to run, since no, we no longer trust you. You would think you would jump at the chance to convince us further. Maybe start with finding that blade in the Marcott input data, or if you can’t find it, work on getting that paper retracted and all of the “scientists” involved fired if not arrested. That is how it is done in normal non-activist science such as medical research:

    “A former Iowa State University scientist who admitted faking lab results used to obtain millions of dollars in grant money for AIDS research has been charged with four felony counts of making false statements, an indictment filed in federal court shows.”

    I’m even amazed Berkeley didn’t sue you guys for using their name like Harvard sued a company called Veritas. Legally, it’s very easy and in fact guaranteed that the public will associate the BEST plot with Berkeley University though at least you didn’t also swipe their logo. Your results helps the Obama administration feel justified in stereotyping us skeptics as Moon landing denying members of the Flat Earth Society. So we are asking you for clarification, loudly. Are the other climate model falsifying temperature products wrong or are you wrong? Are climate models and thus climate alarm now falsified or not?

  63. Zeke Hausfather says:
    June 29, 2014 at 12:01 pm

    Hi Willis,

    Sorry for not getting back to you earlier; just landed back in SF after a flight from NYC.

    Thanks, Zeke, no worries about the timing. I’m well aware people have time constraints.

    … There are a number of examples of apparent sawtooth patterns relative to surrounding stations in the Berkeley data that seem to be correctly adjusted; I haven’t found an example of poor adjustment that creates a biased trend relative to surrounding stations, but I’d encourage folks to look for them.

    Here are a few examples of sawtooth and gradual trend inhomogeneities seem to be correctly adjusted:

    http://berkeleyearth.lbl.gov/stations/169993

    I fear I’m not following that one. It shows the record for Savannah, GA, with three station moves and no less than eight “empirical breaks”. These are identified by some computer algorithm whose exact details are unimportant to this discussion.

    What is important is your claim that identifying these eight! “empirical breaks” and using the scalpel on them means they are “correctly adjusted” … what is the evidence for that?

    Next, you say that

    I haven’t found an example of poor adjustment that creates a biased trend relative to surrounding stations

    Unfortunately, you’ve fallen into the common trap of assuming that GOOD CORRELATION OF DATASETS MEANS GOOD CORRELATION OF TRENDS. I’ve demonstrated this in the past using both pseudodata and actual data. Here’s the pseudodata:

    Note that the trends vary from the floor to the ceiling … why is this important? Because in all cases the correlation between all individual pairs of pseudodata is above 90%.

    Now, because they are so highly correlated, your whiz-bang algorithm would “adjust” them so the trends are all quite similar …

    Nor is this just a theoretical problem. Here are the trends from a group of stations within 500 miles of Anchorage, all of which have a correlation over 0.5 with Anchorage. Despite that, their trends vary by a factor of three.

    So i fear that the fact that after you can’t find any “biased trends relative to surrounding stations” is not evidence that you’ve done it right as you claim, quite the opposite—as the Alaska example shows, if you adjust those so none of the trends are “biased relative to surrounding stations” that’s evidence you’ve done it wrong.

    w.

  64. Zeke:

    “I haven’t found an example of poor adjustment that creates a biased trend relative to surrounding stations, but I’d encourage folks to look for them.

    OK, here’s one: Auckland in New Zealand.
    BEST shows 0.99±0.25°C/century from 1910 for Auckland. The correct value (manually adjusting for known UHI and shelter) is closer to 0.5±0.3°C/century.

    “Berkeley really doesn’t optimize for getting the corrected underlying data as accurate possible; rather, it focuses more on generating an accurate regional-level field.”

    I can’t see how one can get a regionally correct value without first obtaining accurate underlying data, when the regional values are based on the underlying data. For example, I have no doubt that the incorrect Auckland series was used by BEST to adjust other NZ sites, thereby introducing an error.

  65. Apologies, I didn’t close the link properly. Also a typo, the correct Auckland trend is 0.5±0.3°C/century.

    [Fixed. -w.]

  66. This example that our host pulled up yesterday would be worth looking at:

    Luling TX is also the one Paul Holmwood picked out for other reasons.

    Apparently this is a good site with stable MMTS since 1995 yet is seems several discontinuities are picked up in relations to its regional average.

    Does that indicate that there is a notable bias in the regional mean?

  67. ” Bob Dedekind: I can’t see how one can get a regionally correct value without first obtaining accurate underlying data,”

    Yes, this is what I’m questioning above. It seems like the method will just drag everything to the lowest common denominator.

    There is an implicit assumption that regional average is somehow more accurate than any individual station. In a network with 80% sub-standard stations, I really don’t see that as justified.

    This is what our host referred to as warm soup.

    Homogenisation really means just that, putting it through the blender. This just ensures uniformly poor quality.

  68. This makes the long-term trend of the station with the gradual inhomogeneity similar to the one of the neighbours.

    And that, in a nutshell, doc, is by beef.

  69. Greg Goodman says: June 29, 2014 at 2:35 pm
    “There is an implicit assumption that regional average is somehow more accurate than any individual station.”

    No, there isn’t. Regional averages are only used if the station data is doubtful or missing. Luling was a classic case where the algorithm correctly doubted.

    Infilling with interpolated before integrating is a neutral choice. It does not improve or degrade. Consider trapezoidal integration. If you add linearly interpolated points, it makes no difference at all. The scheme assumes all unknown points are linear interpolates.

    If it’s neutral, why do it? In USHCN, it’s just so you can keep a consistent set of climatologies in the mix, so their comings and goings don’t produce something spurious.

  70. Willis,

    Neighbor (or regional climatology) difference series don’t use correlations for anything. Rather, it uses the difference in temperature over time between the station in question and its neighbors. I realize that correlation often provides very little information about the trend, which is why its not a great indicator of potential bias.

    The Savannah example shows some sawtooth patterns in the neighbor difference series, but they are homogenized in such a way that both the gradual trend and the sharp correction are removed.
    .
    Bob Dedekind,

    Thats not a station record, thats a regional record. What specific stations in the Auckland area show sawtooth-type patterns being incorrectly adjusted to inflate the warming trend? Here is a list of Auckland-area stations: http://berkeleyearth.lbl.gov/station-list/location/36.17S-175.03E

  71. Hello pruf Evan Jones, the quality of station placement is another problem as the one mentioned in this post. I do not want to redo our previous discussion, which would be off topic here. Do you have any new arguments since our last long, civil and interesting discussion?

    Hey, Doc. V. BTW, I am no professor, sorry to say. Yes, siting is not the issue. Suturing zigzags into slopes is. I would agree that homogenization would not destroy such data, but suturing might well do. (That still doesn’t do much to reduce my — dare I say hatred? — of homogenization. But the H-monster is not the culprit here, I must concede.)

    Willis has hit the nail on the head, and the zigzag paint issue of the CRS units is a prime example of how such a fallacy might manifest.

    I’ll repeat to the others, Dr. Venema has treated me with great courtesy and professionalism. In our discussions since 2012 of the surface stations paper, he has begged to disagree, but has always argued to the point. There is information that we both are at an advantage to acquire: He is interested in whether the paper is for real. I, on the other hand, am interested in what form the criticism will take, especially after having dealt with the TOBS, moves, and MMTS-conversion issues.

    I think we both got what we came for.

    As for him, I do not think he will be too quick to adduce the point that, after all, adjusted data for both well and poorly sited stations are the same. And as for me, he has made me think more deeply beyond the stats to the mechanism in play. And I won’t be saying that TOBS doesn’t really matter that much.

  72. evanmjones says:And he has made me think more deeply beyond the stats to the mechanism in play.”

    That is good. That seems to be the main thing that would make the paper a lot more convincing. I have been thinking about this since our conversation, but I am unable to think of a mechanism that could explain the statistical results you found. Especially something that would cause artificial trends due to micrositing in the 1990s, but not since the US climate reference network was installed in 2004. Puzzling to me.

    Could you simply call me VIctor Venema? I can’t help it that I had to get a PhD to be allowed to do research. That is the way the system works.

    REPLY: You don’t need a PhD to be able to do research and publish papers, I’ve done three now, working on #4, and as many people like to point out, including yourself, I don’t have a PhD and according to many, I am too stupid to be in the same ranks with you. Yet, I do research and publish anyway. If the school of science didn’t have a foreign language requirement, and I didn’t have horrible unsolvable hearing problem, and the Dean of the School of Science wasn’t a prick at the time, and the ADA had been in place, that might have been different. TV/radio where I only had to speak was my salvation, I had a one-time chance and I took it. But, I know in the eyes of many in your position that career path makes me some sort of lowbrow victim of phrenology.

    The explanation to the problem you pose is based in the physics of heat sinks, but you’ll just have to wait for the paper. Though, ahead of time to help readers understand, I may post an article and/or experiment to show how the issue manifests itself.

    Bear in mind I don’t wish to start a dialog with you at the moment, mainly because you called your view of my religion into question without actually knowing what my view is, and I find that shameful and just as bad as the things you accuse me of. I’m only pointing out that PhD holders are not exclusive to research. No need to reply.

    Also, while I can’t prevent it, even though I hold copyright on my own words, I ask that you not turn my comment into another taunt at your blog. It would be a good gesture if in fact you believe what you write about what I should be doing. – Anthony

  73. @Zeke Hausfather at 12:01 pm
    Its also worth mentioning that Berkeley has a second type of homogenization that would catch spuriously inflated trends, at least if they were isolated. The kriging process downweights stations with divergent trends via-a-vis surrounding stations when creating the regional temperature field, after all stations have been homogenized.

    Posit:
    Well sited, Class 1 and 2 stations are the minority.
    There are studies that suggest that Class 1/2 stations have lower trends than other.

    The Best “second type of homogenization” would either,
    A.) catch spuriously deflated trends and downweight them.
    Therefore, the homogenization will have a tendency to downweight Well sited stations, a hypothesis consistent with findings in the Watts et al 2012 draft paper.
    B), or be treating inflated trends differently than deflated trends.

    Either A or B appear to be problematic.
    The problem is that we must be upweighting Class 1 and Class 2 stations compared to Class 3, 4, 5. There is a case to be made that Class 3, 4, 5 stations should be downweighted to disappear.

    Speaking of downweighting…. Do you (upweight,downweight) stations based upon the length of segments? Longer segments deserving greater weight, of course.

  74. I’m sorry Will E but your claim that a change from F to C is a bullsh*t strawman and you know it … … 100 meters = x feet … reporting in feet or meters does not change the measurement … its called a correction, not an edit …

  75. Zeke:

    “What specific stations in the Auckland area show sawtooth-type patterns being incorrectly adjusted to inflate the warming trend? Here is a list of Auckland-area stations: http://berkeleyearth.lbl.gov/station-list/location/36.17S-175.03E

    It’s difficult to work out what’s happening with your data, since it doesn’t make much sense.
    For example, if you look at Albert Park in Auckland, the data runs to the present (I presume – the X-axis isn’t graduated particularly well) yet the station closed in 1989.
    Where did you get your raw data from? If Albert Park has had another station spliced to its end, which station is that? How was it spliced?

  76. Zeke:

    “You can see alternative names on the right side of the station page: http://berkeleyearth.lbl.gov/stations/157062

    I’m sorry but I don’t get that. The list is:
    ALBERT PARK
    AUCKLAND
    AUCKLAND AERODROME
    AUCKLAND AIRP
    AUCKLAND AIRPORT
    AUCKLAND CITY
    AUCKLAND, ALBERT PAR
    As far as I can tell from this, there are two sites, Albert Park and Auckland Airport, but it certainly isn’t clear, because the chart has three red diamonds (Station moves) shown. How do you know when the station move happened? Do you look at metadata? Is the “station move” the same as a splice point?
    The elevation is given as 27m. Albert Park is 49m, Auckland Airport is 5m or less. Perhaps it’s the average?

    “Stations tend to get merged if they have overlapping identical temperature measurements under different names.”

    It is extremely unlikely that Albert Park and Auckland Airport had identical temperatures, simply because there is a well-documented 0.66°C difference between the two. Unless you’re talking about correlations, or anomalies.

  77. Well, you have to hide the decline somehow, Willis.

    That’s what climate science is all about, isn’t it?

    What a SNAFU!

    It will be years before the scientific profession recovers from the efforts of these shysters.

  78. Zeke:

    “There is a sawtooth-type signal in the difference series for Wellington – Kelburn around 1970-1980. You can see how both the gradual trend bias and the abrupt reversion to the mean are caught: http://berkeleyearth.lbl.gov/stations/18625

    Well, sort of, I’m battling to see the gradual trend reduction there, but it may be because only the breakpoint graph is shown, is there a gradual change graph somewhere as well?
    The problem at Kelburn is the growth of the shelter in the surrounding Botanical Gardens, that grew over the decades. The shelter clearances affected only trees close to the site, not the wider area. Hessell identified this in 1980.
    The last close shelter clearance was 1969, apparently, so I’m not sure what caused the 1970-1980 excursion. A building was put up close by in 1968, and the maximum temperature thermometer was replaced in 1969.
    If the BEST process reduces the trends, then this is a step in the right direction. I see no such adjustment in the NCDC approach.
    However, reducing trends is tricky. Do you check against raw data from other stations regionally, or adjusted data?

  79. Bob Dedekind,

    I believe that the difference series in question are calculated prior to adjustments by comparing each station to the raw station records of surrounding stations.

    Also, NCDC’s method should be able to pick out similar sawtooth patterns; see the M4 model in Menne and Williams 2009: ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/papers/menne-williams2009.pdf

    As I mentioned earlier, this could all be tested better using synthetic data, something that is planned as part of the new International Surface Temperature Initiative: http://www.geosci-instrum-method-data-syst-discuss.net/4/235/2014/gid-4-235-2014.html

  80. Zeke:

    “Also, NCDC’s method should be able to pick out similar sawtooth patterns; see the M4 model in Menne and Williams 2009: ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/papers/menne-williams2009.pdf

    You’re right, it should be able to pick out these patterns, but doesn’t.
    I have looked carefully through all the NCDC New Zealand stations adjustments. Not one shows any gradual trend reduction adjustment at all. If you can find one please point it out.

    Auckland is an excellent test case. It’s a long-running site (since 1853) that contains well-documented gradual UHI/shelter problems (quantified by both NIWA and the NZCSC). It also has a splice to Mangere with a 0.6°C difference – in other words a perfect Hansen-type situation.

    If an algorithm gets Auckland right, it will most likely work everywhere, at least for saw-tooth analysis. But getting Auckland wrong proves the algorithm needs work.

  81. @Zeke Hausfather at 12:01 pm
    Here are a few examples of sawtooth and gradual trend inhomogeneities seem to be correctly adjusted:
    Like Willis, what is the evidence that ANY of the breaks is a correct adjustment? . Much less ALL of them?

    Notes:
    #Moves, #Other Breaks, (3 Longest Segment since 1960 incllusive,) Difference from Regional

    http://berkeleyearth.lbl.gov/stations/169993

    SAVANNAH/MUNICIPAL, GA.
    2 moves, 8 other breaks, (18, 17, 10) year, -0.5 deg C

    http://berkeleyearth.lbl.gov/stations/30748

    JONESBORO 2 NE (?Arkansas?)
    6 moves (all since 1974), 14 Others, (16, 11, 8) years, -2.0 deg C

    http://berkeleyearth.lbl.gov/stations/156164

    TOYKO
    2 moves (1 in 2006), 5 Others, (40, 15, 8), +1.9 deg C

    http://berkeleyearth.lbl.gov/stations/161705

    LAS VEGAS MCCARRAN INTL AP (1936-current)
    2 moves (1996, 2008), 7 Others, ( 34, 13, 6 ) years, +2.5 deg C over regional

    http://berkeleyearth.lbl.gov/stations/33493

    FOLSOM DAM, (near San Fran, CA) (1893 to 1993)
    ?1 move 1957, 11 Others, (18, 15, NA) years, -1.0 deg C

    http://berkeleyearth.lbl.gov/stations/34034

    COLFAX (near Sacramento, CA) 1891-current
    7 moves (6 since 1972), 6 Other breaks, (18, 16, 5) years, -0.1 deg C
    This one bears a revisit.
    It is a flat regional trend difference with a few years of -1.0.
    The Raw Anomaly looks dead flat. BEST says it is 0.43 Deg / Century
    After break points applied it is 0.71 deg / Century, Regional is 0.79 deg / Century.

  82. Looking carefully at the BEST chart for Auckland, I’d guess that it merged Albert Park with Auckland Aero in 1962 (when Aero opened) and then joined Aero to Aero AWS in 2010. All well and good.

    But what happened around 1930? Perhaps Riverhead Forest (opened 1928) was spliced in between Albert Park and Aero, but it isn’t on the list.

    A mystery.

  83. More worrying is the lack in BEST of the six stations specifically identified by Hessell (1980) as good rural New Zealand sites “not known to be significantly affected in any of these ways [sheltering/urbanisation/screen changes]“.
    These sites are:
    -Te Aroha
    -Appleby
    -Waihopai
    -Lake Coleridge
    -Fairlie
    -Ophir
    Why were these good sites excluded, when poor sites like Albert Park were included?
    Should we be worried that Te Aroha’s trend is 0.23°C/century, Appleby’s is 0.52°C/century and Fairlie’s is 0.45°C/century (I haven’t calculated the others yet)?
    All these are somewhat less than the average.

  84. A partial solution would be to run two identical stations very close to each other and offset the maintainance by, say, two years. The two sets of data can then be plotted together and any offset between them can be attributed to the maintainance.

  85. In my experience there is no statistical solution to this conundrum. There are only fudge factors.

    Problems with combining different datasets which have actual sampling methodology differences cannot be resolved unless you go back and re-sample and re-submit for analysis. If the difference exists in the processing of the sample only, then you need to re-submit the sample, unless there is degradation of the sample over time. (Every case is different). And since weather/climate data is a case which is time -dependant, unless you have a time machine, in my humble opinion you can’t fix this problem, because you can’t re-sample nor re-submit for re-analysis.

    What is important is that the data is archived and it is stated clearly what the sampling and methodology was. What you can’t do is throw out the original raw data, or combine different datasets without noting the inherent limitations. If you could, the laws of physics would have to be changed. Sorry can’t be fixed.

  86. Problems with the scalpel method?

    Well, there is the number one problem. It’s still using surface station data.

    This data is not fit for the purpose BEST are trying to claim for it. No amount of extra time in the blender is going to unscramble the egg.

    Repeated attempts to use surface station data to “identify” climate changes in fractions of a degree speaks to motive.

  87. Just start first with a histogram of the breakpoint impacts over time. The next step is to figure out why.

    We are just arguing about nebulous suppositions but noone is starting at the first point about what the data actually shows.

  88. Zeke Hausfather says:
    June 29, 2014 at 2:57 pm

    Willis,

    Neighbor (or regional climatology) difference series don’t use correlations for anything. Rather, it uses the difference in temperature over time between the station in question and its neighbors. I realize that correlation often provides very little information about the trend, which is why its not a great indicator of potential bias.

    The Savannah example shows some sawtooth patterns in the neighbor difference series, but they are homogenized in such a way that both the gradual trend and the sharp correction are removed.

    You still haven’t grasped the nettle. Nearby trends are NOT correlated in the same manner as nearby station data. You seem to think that the goal is to adjust every station so it’s trend is not much different from that of its neighbors … but that is an UNnatural condition, not the natural condition.

    However, as you mention somewhere, we have little intact and high quality data, so it’s hard to tell.

    w.

  89. Sorry Zeke, but thats a crock.

    “Here are a few examples of sawtooth and gradual trend inhomogeneities seem to be correctly adjusted: http://berkeleyearth.lbl.gov/stations/169993

    I just looked at the first one in the list and the breakpoint algorithm found lots of issues in modern times and not a single issue from 1870 through 1930 when we were using primative measurements which I’d fully expect to vary wildly with the measurement devices themselves. We were riding around on horses for crying out loud.

  90. What happens to the overall trend if you halve the threshold of your automatic knife?

    Is it highly sensitive to that particular knob? Or is it robust to less or more frantic chopping?

    Why does it chop regularly instead of rarely?

    Are there multiple arcane parameters involved besides a simple threshold value?

    Is there a sudden last decade reason why your system busts out into the climate model stratosphere?

    What on Berkeley Earth are you *really* doing?

    How many knobs are there and what are their ranges of adjustment?

    Is this just another pretty alarmist merry go round?

    But if we’re all going to die why won’t you tell us why?

  91. Victor Venema writes “If the mean of this difference is not constant, there is something happening at one station, that does not happen at the other.”

    Makes the assumption that one station is better than another. Reality is almost certainly that there are some issues at all of the stations over the years and TOBs is certainly one that comes to mind.

    Also this makes the assumption that there can be no legitimate regional trends over the years which also seems wrong and altered large scale irrigation comes to mind for that.

  92. More on my 6:21 pm reply to Zeke Hausfather at 12:01 pm

    Lets revisit that TOKYO case

    http://berkeleyearth.lbl.gov/stations/156164

    2 moves (1 in 2006), 5 Others, (40, 15, 8), +1.9 deg C

    As BEST stations go, this one is in fewer pieces, only eight, from 1876 thru 2013.
    The Raw Temperature record shows a +4.0 deg rise, BEST says 2.59 deg C/century
    The Difference From regional shows aout a +1.9 deg risk
    So the Regional profile that the scalpel takes its orders from must show shows about 2.1 deg C of warming, and the table says 0.93 ± 0.10 deg C / century.

    Here is the deal. That raw Temp Rise for Tokyo has a large UHI component. We know from <a href=http://wattsupwiththat.com/2014/04/26/picking-cherry-blossoms/#comment-1622330?studies of Cherry Tree Festival records that the cities urban centers have warmed significantly enough to accelerate the cherry blossoms as much as a week ahead of the countryside.

    Ok, so BEST measures a spurious increase in trend against the region and adjusts Tokyo down to the regional trend. Oh! Happy Days, we’ve eliminated the UHI from the record. Rejoice! … Except we know from the cherry blossom records that all cities are experiencing acceleration in blooming. The regional record has a significant UHI component that BEST has just baked into the official adjusted “clean climate” record. And it will keep baking it in to every other city until the UHI is fully homogenized with all the stations.

    While we are on the subject of the TOKYO station record and its relatively few breakpoints… It doesn’t have a breakpoint I expected. March 1945 should have generated one heckofa breakpoint and probable station move. BEST doesn’t show one. BEST can tease out of the data 20 station moves and breakpoints for Lulling, TX. But BEST somehow feels no break point is warranted on a day a quarter million people die in a city-wide firestorm.

    I’m not a supporter of the BEST process. Never was. Never will be — I’ve seen enough.

  93. Willis Eschenbach says:

    Do you see how crazy you sound with your absolute dicta?

    Stephen did not offer an absolute dicta. He gave a well qualified conditional. You ignored the conditional, and removed it from what he said to make your non-sequitur. He said:

    “I cannot remember the number of times I have written on blogs that ” IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA” unless you have good, scientific and mathematical analysis to allow you to do so without any doubt or favour.”

    The part of that beginning with the word “unless” invalidates your counterexamples:

    Say that we have a change in the time of observation. Suppose that for years we’ve been taking afternoon temperatures at 3 PM at all of our temperature stations, and then we start taking them at 2 PM.

    A documented change in TOBS that is explicitly correctable is a “…good, scientific and mathematical analysis to allow you to do so without any doubt or favour.” Ditto the silliness you put up about F vs C unit conversion.

    The circumstance to which Stephen refers is one where the “error” is not known but assumed to exist, or is known to exist but is not explicitly correctable, i.e. where there is no “…good, scientific and mathematical analysis to allow you to [make a correction] without any doubt or favour.”

    But you removed that part from your quote of what Stephen had actually said:

    But in that situation, if you say that we must accept the observational data exactly as it was recorded because ”IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA”, then you’ve just put your full weight behind a highly misleading (although totally accurate and unaltered) temperature dataset.

    turning his reasonable conditional into your “absolute dicta” strawman.

  94. @ Victor Venema

    Especially something that would cause artificial trends due to micrositing in the 1990s, but not since the US climate reference network was installed in 2004. Puzzling to me.

    I can answer this definitively. It is because there was a strong warming trend in the 1990s, but a flat trend since 2001.

    Bad microsite does not create an artificial trend. It merely exaggerates a real, already-existing trend (warming or cooling). But from 2004 there is little if any trend. So there will be no trend to exaggerate from bad siting after 2004.

    In the 1990s, on the other hand, there was a strong warming trend (CO2 forcing + positive PDO effect). So that is where the heat sink effect is dominant.

    The microsite effect works both ways: From 1998 – 2008, there was cooling in the US (thanks to the 1998 El Nino start-point and the 2008 La Nina endpoint). And, yes, the poorly sited stations show significantly more cooling than the well sited stations.

    In short, bad siting exaggerates trend, either warming or cooling. but if there is a flat trend, there will be no exaggeration.

  95. Stephen Richards says: June 29, 2014 at 2:50 am
    “IT IS TOTALLY SCIENTIFICALLY UNACCEPTABLE TO ALTER PAST DATA”

    I agree with Willis, but the silly thing is, TOBS adjustment isn’t even doing that. The past data is actually the reading of the position of the markers, at whatever time. It isn’t a daily max. That requires an act of interpretation, indeed an assumption. The observer records say max marker at 80F at 5pm Tuesday. Is that a Tuesday or a Monday max? The data doesn’t say.

    In the past, it would probably have been assumed Tuesday. But quite often it would have been Monday, and it makes a difference, because of double counting warm afternoons. We now have the ability to quantify that assumption, with hourly data.

    Surely past assumptions aren’t sacrosanct.

  96. Nick writes “But quite often it would have been Monday, and it makes a difference, because of double counting warm afternoons.”

    That’s fine if you truly know when the readings were taken in the past but if its an assumption based on “policy” rather than actual meta data then you’re on shaky ground.

  97. Reading all this analysis of how to back out climate trends from past temperature data, I’m reminded of a well-known saying about candidates trying to win elections:

    If you’re explaining, you’re losing.

    It seems unfair that this saying could be relevant, because science is supposed to be all about dispassionate contemplation of what is and is not known, and what can and cannot be measured. But as the whole climate-science fiasco makes clear, as a group activity there is also a political element — getting your work accepted by others as valid and trustworthy — and that goes double when lots of money is involved. So if you find yourself having to “explain” over and over what you’ve done and “explain” over and over why it makes sense, it may be time to try another approach.

  98. Willis and Bob: my poor old brain is too tired to get into the minutiae of How adjustments are to be calculated.

    But my accounting/BI background leads me to something I’ve stated here more than once.

    Given the existence of wonderful data-handling and recording software, with the ability to slice and dice petabytes of data, there should be a concerted attempt to open-source a transactional record of temperatures.

    So that the original (however obtained, and that would be a dimension in the data: MMTS, Hg and eye-o-meter, etc), would always be TransactionType = original. Never altered. That temp, by that method, at that lat/long/alt.

    But, layered over that in separate records for the lat/long/alt, would be the adjustments.

    – F to C – the ‘oops’ factor
    – The first harmonisation: by what process, resulting in what adjustment, at this lat/long/alt
    – harmonisation 2 – and so on.

    Then the data query engine, to render up the temp for a datetime at lat/long/alt, simply sums the temp values it finds.

    This is called the Audit Trail in accounting, and woe betide the personages who delete old parts of it, alter existing transactions or otherwise finagle existing records. However, introducing new transactions, even to correct years-old mistakes, is always acceptable, as long as one states, who did this, why, when etc: Transactional context.

    If temp records were stored this way, we would not be chasing our tails trying to figure why history changed, why certain calculations seem to be applied Here and not There, and so on.

    We’d simply sum through the Types of adjustments and compare them.

    And all the time, those Originals, in all their human-error-prone glory, would be there for the researchers.

    So, just how hard would That approach be?

    One suggestion: cosy up to a Big Data vendor and suggest a pro-bono effort to put one together. It might just be easier than we all think.

  99. evanmjones says: “Bad microsite does not create an artificial trend. It merely exaggerates a real, already-existing trend (warming or cooling). But from 2004 there is little if any trend. So there will be no trend to exaggerate from bad siting after 2004.”

    I would call that a description of what your data shows, but not yet an explanation of what happened locally to the measurement.

  100. Bill Illis writes “The breakpoints should be equally distributed between positive and negative adjustments of roughly the same magnitude.”

    Why assume that? IMO there are many more ways to “unnaturally” warm something giving a warming bias than there are ways to cool something.

  101. In my experience there is no statistical solution to this conundrum. There are only fudge factors
    =============
    the problem is similar to name and address correction in mailing lists. you have a number of similar entries that may or may not be for the same person. how do you tell which entry is the “most correct”, so you can use that while discarding the others.

    Do you average all the entries for the same person? no, because that will create problems. the low quality entries will swamp out the good ones, leaving you with poor quality.

    Rather, what you need to do is rate the quality of the entries, not based on how similar they are to their neighbors, but rather on how good the source was for the entries. Then you throw out the entries that are from a poor source.

    This would appear to be the crux of the solution for temperature. you cannot determine how good a reading is by comparing it with its neighbors, because you don’t know the quality of the neighbors. the comparison is nonsensical. you need to rate the quality of the station based on information about the station itself, and then either use the data if it is high quality, or throw it out if it is low quality.

  102. I don’t know, but it seems possible that a set of stations within any given region are going to be maintained by the same people.

    If also seems likely that the stations are maintained using a regular schedule – perhaps each station is visited every year for a simple checkup/spider web removal, repainted every 5 years, and a major service every 10 years when local vegetation is cut down and removed.

    If this is the case, then it is quite possible that the major service takes place for any one station, the nearby stations are not due for this treatment for another year or two. In which case, the drop in the sawtooth will be assumed to be an artifact, since the other local stations didn’t see it. And of course next year, when another nearly station shows a drop due to a service, the previous station will not show it, nor will the others yet to be serviced, and so again this is seen to be an artifact and removed either by the averaging method, or by scalpelling. Either way, the temperature record has been “adjusted” to make it further from the truth.

  103. richardverney: Per Willis

    “…As a result, the raw data may not reflect the actual temperatures….”
    //////////////////////////

    Wrong; the raw data is the actual temperature at the location where the raw data is measured.

    What you mean is whether there are some factors at work which have meant that the actual temperature measured (ie., the raw data) should not be regarded as representaive of temperatures because it has been distorted (upwards or downwards) due to some extrinsic factor (in which i include changes in the condition of the screen, instrumentation, TOBs as well as more external factors such as changes in vegetaion, nearby building etc). .

    I think richardverney puts the matter badly here, and Willis Eschenbach is closer to the truth. In measurement science it is common to distinguish between “accuracay” of measurement, and “precision of measurement”, where accuracy refers exactly to the question of how close the the true value you may consider the measured value to be. When these two aspects of measurement are addressed by statisticians, they are called “bias” (the complement of “accuracy”, the expected disparity between the expected value of the observations and the true value), and “variance” ( the complement of “precision”, namely the variation in repeated measures of the same quantity.) Mean Squared Error is then the sum of the squared bias and the population variance of the estimators.

    “Temperature” of a region is proportional to the mean kinetic energy of the molecules in that region; the measured temperature is always slightly different from the true temperature by some small amount.

    Notice that richardverney uses “should not be regarded as representative of” whereas Willis Eschenbach uses “may not reflect actual”. I do not perceive a meaningful difference imparted by richardverney’s “correction”, but he confueses the issue by presenting his rewording as a correction.

    Remember always: “accuracy” and “precision”; and their complementary concepts in statistical analysis, “bias” and “variance”. A measuring system with high bias and very low precision will repeatedly and reliably get the wrong answer, and the mean of a large number of independent observations will even more reliably get the wrong answer.

    Willis’ question can be restated (I hope, please forgive me if I misunderstand) as: “Does the scalpel process bias the estimates in such a way that the estimated trend is reliably too large?” The possibility that it might do so has to be addressed. That does not say it hasn’t been addressed. I have not (yet) read all the relevant literature on the temperature record.

  104. I don’t understand, do all these corrections occur just because of discrepancies in the meta data? Why can’t these thermometer/weather stations be empirically and/or routinely recalibrated to an independent, objective method traced to some universal standard after maintenance/repairs are made, just like any other scientific instrument? Instead of finessing the data: WHEN IN DOUBT, THROW IT OUT!

  105. So we know each beak-point is “most” correct vs the data surrounding. The question then is whether the reading increase from maintenance to maintenance is linear or asymptotic toward a flat line. I would expect the latter, but determining the specific behavior of temperature bias vs time would probably require a multi-year experiment using multiple temperature stations with varying degrees of routine maintenance.

    What if then we applied a linear trend from discontinuity point to discontinuity point, and calculated the slope from start to end of the interval. We cannot do the slope using best fit as that will not be correct, it must be the slope using only the start and end points. Now we find the slope from the start point of the period and the start point of the next period. We’re pretty sure the start of the next period is correct and (most likely) the same as the current period so we adjust all points based on the difference of the slopes. This way no start points get shifted down and long term trends are kept correct. The discontinuities are also gone without going all willy nilly on the data also.

  106. Am I the “lone voice” pointing out the absurdity of “average temperature”? It’s the RADIATION ENERGY BALANCE (notice I did not use the term “heat” as it too easily confounded with “temperature”) which matters. Thus the “HEAT CONTENT” or Enthalpy of the cubic volume of AIR is what really counts. ROUGHLY it can be measured with humidity and temperature. (Psychometric chart is helpful here.)

    NO ONE, never, EVER talks about this. Yet, it would be the PRIMA FACIA way of assessing the RESULT of the energy balance of the atmosphere. AM I LOOPY? What’s wrong with this? I think the “KING HAS NO NEW CLOTHES” with regard to EVERYONE …i.e., skeptics and Warmistas …because NO ONE recognizes the need to STUDY ENERGY BALANCE and NET ENERGY CONTENT of the atmosphere.

  107. Reply to Wayne Findley ==> You are 1000% (sic) correct — what is missing in all of this temperature transmogrification is the existence of proper Audit Trails that carefully and in detail show exactly what has been done to the originally recorded number, when, by whom, and why — in every case that the number is touched by anyone. ANYTHING else is bordering on illegality in the financial world == fiddled books. So it should be with all scientific data — down to the smallest and least consequential experiment.

    In fact, in real science, one must produce his original lab log — any post hoc changes in it can totally invalidate the work and findings — inability to produce the lab log on demand have the same result.

    I am not sure at all that the whole temperature record hasn’t been fiddle-faddled beyond any usefulness.

  108. PeteJ says: June 30, 2014 at 9:44 am
    “Instead of finessing the data: WHEN IN DOUBT, THROW IT OUT!”

    I’ve written a post here which tries to illustrate the fallacy of that. When you are calculating the average for a period of time, or a region of space, that data point was part of the balance of representation of the sample. If you throw it out, you are effectively replacing it with a different estimate. You can’t avoid that. And your implied estimate could be a very bad one.

    It’s not good advice.

    REPLY: Nick Stokes, defender of the indefensible, is arguing to preserve bad data. On one hand he (and others) argue that station dropout doesn’t matter, on the other he argues that we can’t throw out bad stations or bad data because it won’t give a good result.

    Priceless.

    This is exactly what is wrong with climate science and the people that practice it.

    – Anthony

  109. Wayne Findley says:
    June 30, 2014 at 2:28 am

    Willis and Bob: my poor old brain is too tired to get into the minutiae of How adjustments are to be calculated.

    But my accounting/BI background leads me to something I’ve stated here more than once.

    Given the existence of wonderful data-handling and recording software, with the ability to slice and dice petabytes of data, there should be a concerted attempt to open-source a transactional record of temperatures.

    Thanks, Wayne. The Berkeley Earth folks have already done an excellent job at both preserving the original data as well as showing the adjustments that have been made. While their adjustment (the “scalpel”) may have shortcomings and their overall view tends towards alarmism, they have been very transparent and professional in their data handling. Any of their individual station data pages shows the raw data, the adjustments, and the adjusted data.

    In addition, Steve Mosher has put together an excellent package in the computer language “R” for accessing the data and using their methods, available from the normal CRAN repository. Discussion and details are available on Mosh’s blog.

    So they have been completely up-front about their data and code, and have my congratulations on that part of the effort. Their documentation and the availability of both data and code puts them leagues in front of the other global temperature datasets such as GISS.

    Regards,

    w.

  110. Max Hugoson says:
    June 30, 2014 at 10:10 am

    Am I the “lone voice” pointing out the absurdity of “average temperature”? It’s the RADIATION ENERGY BALANCE (notice I did not use the term “heat” as it too easily confounded with “temperature”) which matters. Thus the “HEAT CONTENT” or Enthalpy of the cubic volume of AIR is what really counts. ROUGHLY it can be measured with humidity and temperature. (Psychometric chart is helpful here.)

    NO ONE, never, EVER talks about this. Yet, it would be the PRIMA FACIA way of assessing the RESULT of the energy balance of the atmosphere. AM I LOOPY?

    Given the number of capital letters in your diatribe, I’d leave out the question of your loopiness, folks might be encouraged to answer …

    In any case, we’ve gone through this before, Max. People are well aware that temperature is not a complete measure of the enthalpy in the air. However, from my own investigations into the question, I have found that the inclusion of the latent heat (in order to calculate enthalpy) makes very little difference in the results.

    So first off, yes, people do talk about this. Me, I’ve concluded that it’s not a significant factor.

    So if you think it is a big factor, here’s what you should do. Get a good clean temperature and humidity record from one of the CRN (climate reference network) sites. Then calculate the temperature on the one hand, and the full enthalpy including water vapor on the other hand, and compare the two. I did it with some Canadian stations at some point, no idea where that data is now, but I found little difference.

    Report back here with the results, and we’ll have another data point for the discussion. I don’t think it’s a big issue, but I’m always willing to learn.

    Thanks,

    w.

    PS—I don’t want to be a spelling Nazi, but I hate to see a man make a mistake more than once, so I apologize in advance for this correction … it’s “prima facie”, and “facia” is a term I use as a builder, it’s a wooden piece covering the ends of the rafters.

  111. Anthony replies @:
    June 29, 2014 at 4:00 pm

    REPLY: You don’t need a PhD to be able to do research and publish papers, I’ve done three now, working on #4, and as many people like to point out, including yourself, I don’t have a PhD and according to many, I am too stupid to be in the same ranks with you. Yet, I do research and publish anyway.

    But, I know in the eyes of many in your position that career path makes me some sort of lowbrow victim of phrenology.
    ——————–

    Right you are, Anthony. A big majority of those having been awarded an MA/MS or PhD Degree, …. including their “brainwashed” underlings and admirers, …. all possess a “Rank before Frank” mentality.

    They have been nurtured to “bow down” to any “Rank” that is greater than their own …. and to ignore, discredit or defame any and all “Franks” regardless of what they might want to contribute to a conversation.

    The per se “purchasing” of a PhD Degree from a reputable college or university is akin to …. some one “purchasing” a BIG toolbox chuck full of all kinds of “specialized” tools from a local Sears, Lowe’s or Home Depot.

    Thus, both parties have “proof of ownership” (Diploma-Degree vrs. Sales Receipt) of their big box of “tools” ……… but said “proof of ownership” is neither proof nor factual evidence that said parties are actually capable of using the “tools” contained in their “toolbox”.

    And when one of the aforesaid pulls “Rank before Frank” on you, …. you should immediately know what their debilitating “deficiency” problem is.

  112. @evanmjones at 9:51 pm
    In short, bad siting exaggerates trend, either warming or cooling. but if there is a flat trend, there will be no exaggeration.

    I cannot accept that statement as truth.

    There might be an element of truth in it IF AND ONLY IF the bad micrositing issues remain constant.

    However, bad micrositing is prima fascia evidence that care is not being taken with respect to the quality and consistency of the recording conditions. If you install an incinerator 5 feet from the Stevenson Screen, it is a bad micrositing issue — even if you don’t use it. But if you change the number of times you use it a month, the time of day you use it, or the quantity you incinerate it, then a change in trend observed will be partially a function of the changes in the incineration schedule. Micrositing issues can create and reverse a trend. UHI can turn a cooling into a warming.

    This issue of variability of microsite, UHI, instrument drift is what invalidates BEST segmenting by scalpel, decoupling from absolute temperature. Using the slopes of the segments is valid if and only if contamination of the station record is a constant over the time span of the segment. Clearly, in regard to UHI constancy is false. Cutting the temperature record into shorter segment doesn’t change the contribution of UHI to record.

    There is a theoretical possibility that some discrete microsite events (like a parking lot paving or nearby building constructed) can be eliminated by the scalpel, but it is a fools errand. There are many gradual micrositing changes (instrument drift, weathering, sensor aging, plant growth) that are gradual and the instantaneous change is necessary recalibration information that should not be lost to the scalpel.

  113. There is a theoretical possibility that some discrete microsite events (like a parking lot paving or nearby building constructed) can be eliminated by the scalpel,….

    In the case of airports, there are commonly construction projects as terminals and tarmacs expand, runways lengthened and added. If you don’t move the temperature sensor, you would be changing the microsite conditions and it arguably deserves a breakpoint. But if you move the sensor away from the construction to restore the micrositing classification back to Class 1, should you count it as a station move and institute a breakpoint?

    I would argue, NO.
    Breakpoints are not going to change the gradual build up in UHI and activity at the airport. Moving the station has restored and recalibrated the temperature sensor to make it less dependent on nearby sources of contamination. While you can argue that an (unnecessary) breakpoint can be inserted here without introducing bias, I argue that it is a bias against long term records, especially from well maintained sites, which appears to be a rare commodity.

  114. There is only one way to account for discontinuities: overlap. It is relatively easy for a station change, keep the first open for a year (to account for annual cycles in precipitation, insolation and wind direction; longer would be better, but I am not that unrealistic) and use that to homogenise.

    I am struggling with maintenance. The only thing I can think of to resolve maintenance of the station or its environs, removing saw-tooth patterns is far harder and horribly expensive: 100% overlap. A second station should be placed immediately adjacent that receives the same treatment more than twice as frequently and out of sync. The data from the second station are used only to correct for artificial trends in the primary station, and will make that correction mid-cycle so as to show those trends.

  115. There might be an element of truth in it IF AND ONLY IF the bad micrositing issues remain constant.

    Yes. And that is why, for the purposes of our study, I removed stations that moved, and also stations that did not move but whose ratings were changed by encroaching heat sink. (We will retain a station with a localized move, but only if the rating is not changed.)

    To be clear, I refer only to such stations as we retained.

    However, bad micrositing is prima fascia evidence that care is not being taken with respect to the quality and consistency of the recording conditions.

    That I don’t think I agree with. Some of the oldest stations, with the finest station records and the most devoted staff, are out of compliance for siting (e.g., Blue Hill, MA, and Mohonk Lake, NY). And some of the nicest, most isolated Class 1\2s are battered up old CRS screens that look like they came out losers in a bar fight.

    Remember, the regional directors who place the stations are not the actual curators who do the day to day and report the data. There is often a surprising disconnect, here. The curators — for the most part — love and care for their stations and are proud of them. I consider the curators to be victims of those who placed the stations badly.

    No one loves his station more or keeps better records than “Old Griz” Huth up at Mohonk. But some yahoo placed his station in a tangled mess of vegetation within 6 meters of a structure (with an exposed, “working” chimney). Damn shame.

    Micrositing issues can create and reverse a trend.

    Yes, if the microsite condition itself changes. That could produce a step change in either direction, wreaking havoc with the trend. If, OTOH, it’s constant, it will tend to exaggerate either cooling trend or warming trend, but should not reverse either. Note also, however, that heat sink and waste heat are two different factors and do not have the same effect. Constant waste heat can actually reduce a trend by swamping the signal. But heat sink works via a different mechanism.

  116. I can’t help it that I had to get a PhD to be allowed to do research. That is the way the system works.

    Remember, Anthony, we are not talking America, but Western Europe, here. It’s a different professional ethic. Very territorial. (And a Class Thing.) These things are looser on this side of the Atlantic. Fewer demands to “see our papers”, as it were. More latitude for the self-made man.

  117. Victor Venema says:
    June 30, 2014 at 2:41 am (Edit)
    . . .
    I would call that a description of what your data shows, but not yet an explanation of what happened locally to the measurement.

    It is consistent with the original hypothesis: If the temperature is a flat trend, bad microsite cannot affect that trend. Once a genuine tend occurs, that trend (warming or cooling) will be exaggerated.

    There was a warming trend during the 1990s. That was exaggerated.
    There was a cooling trend from 1998 – 2008. And that was exaggerated.
    From 2002 (when CRN first began to be deployed) to date, however, there is a flat trend: Nothing “there” to exaggerate.

    The trend match between CRN and USHCN (after 2001) is confirmation, not falsification.

  118. In the case of airports, there are commonly construction projects as terminals and tarmacs expand, runways lengthened and added.

    Usually (though not always) those changes occur further than 100m. from a station. That transitions us from the microsite to the mesosite level of consideration.

    Airports are an interesting case. I once thought the well sited ASOS units definitely ran hot-to-trend. But then Google Earth got better focused and a number of those warm-running Class 2s turned out to be Class 3 or even 4. We also lost a large slice when I purged the moved stations. (Yes, I am both a “station dropper”, and “data adjuster”, good lord have pity on me. Just don’t ask me to homogenize. Some sins transcend the venal.)

    The current data now shows Airport Class 1\2s running cooler than any other subset — but it is a small and statistically volatile subset from which no definitive conclusions may be drawn. So I can no longer conclude with confidence that airports are an inherently bad mesosite. (“When the facts changes, I change my mind,” shades of Keynes.)

  119. @evanmjones at 5:57 pm
    (Yes, I am both a “station dropper”, and “data adjuster”, good lord have pity on me. Just don’t ask me to homogenize. Some sins transcend the venal.)

    LOL.
    Well, I’m a petroleum geophysicist, among other things. When it comes to dropping stations, nulling data for multiples, and adjusting timeseries (in the time domain!) for normal moveout, near surface static corrections, and converting to depth, seismic data processors have no peer. Fortunately, we have fold to rely upon —- which is a form of homogenization, to make each source-receiver pair look like others and the same Depth point and very similar to neighbors.

    Guilty! Geophysicists commit data sin — discretely, and in bulk.
    But we leave the field tapes alone and we document the processing steps.

  120. @evanmjones at 5:57 pm

    I think I agree with you about airport microsite issues. Above I tried to make the case that BEST breakpoints at airports seem unlikely to be real.

    I know DENVER STAPLETON AIRPORT, Its 5 moves and 5 additional breakpoints just don’t make sense for any airport, much less that one. But the opening of the airport in 1929, its expansion in 1944, it’s closing 1995 are not breakpoint in the record.

    All I was saying is that a weather station at an airport, even it if moves away from airport construction does not deserve breakpoints if it maintains it’s distance from the terminal and tarmacs. Not all airports can say that: LAX, SEATAC, yes siting location could change the temperature, some. For most airports, within the limits of UHI, one Class 1 spot at an airport ought to be indistinguishable than another Class 1 spot.

    BEST is just breakpoint happy. It allows regional grids from Class 4&5 to dictate breaks and adjustments at Class 1 stations. Long, unbroken records have the most value of any.

  121. BEST is just breakpoint happy. It allows regional grids from Class 4&5 to dictate breaks and adjustments at Class 1 stations.

    Does it, by god? (I think I may take matters into my own hands and see what is happening to my Class 1\2s, if I can figure out the BEST interface.)

  122. Nick writes “You can’t avoid that. And your implied estimate could be a very bad one.”

    Sometimes, if you know you have bad data, the actual answer is not knowable. No matter what “adjustments” you might make.

  123. if you homogenize milk and manure, the end product will still taste like shzt.

    you cannot eliminate the manure by comparing one pail of milk with another. what if the neighboring pail of milk is also contaminated? the only way to eliminate the manure is to compare each pail against a known standard.

    thus Anthony’s approach of eliminating poorly sited stations is correct, while the various nearest neighbor comparison methods are flat out wrong.

  124. because temperature data is numeric, there is a false belief that data quality can be improve via numerical methods. if your method can improve the quality of data, it should work with non numerical data as well.

    however, once you approach the problem in this fashion you will realize that you can only improve data quality if you have an independent measure of those rows in the dataset that are poor quality and those rows that are high quality. which means you need to score that quality of the source. once that is established, data quality is enhanced by removing or treating the rows from the low quality source.

    simply comparing a row with its neighbor does not establish data quality, because the quality of the neighbor is unknown.

  125. if your neighbors all tell you the same tale, does that make it true? no, because you don’t know the quality of their source. only after you find out if the had a high quality source can your judge if the story is accurate or not.

  126. Attempts at calculating a true and accurate “adjustment” to compensate for the several different randomly generated variables that directly or indirectly affect the numerical values within a number set of recorded temperatures ………. is an act of futility.

  127. To the half-dozen posts above: Yes, if you are going to make an attempt at a “true signal”, one must, perforce, confine oneself to the subset of stations capable of providing such. That is what we do for our paper.

    BEST does not appear to concede that it just sometimes gets colder or warmer in any given neck of the woods. Instead they kill every breakpoint. But natural factors can also produce breakpoints. I say that one must kill a breakpoint only if there is a specific reason to do so. (And I do not adjust such stations. I drop them.)

    I prefer to drop a station only if it has moved or its TOBS is “reversed”. Even if there is no breakpoint. USHCN oversamples, so after the dust clears, we still have 400 stations whose conditions are reported by NCDC to be essentially unchanged, 80 of which are Class 1\2, the latter of which still provide adequate distribution and produce what I call the “true signal”.

  128. evanmjones:

    You conclude your post at July 1, 2014 at 9:00 am saying

    USHCN oversamples, so after the dust clears, we still have 400 stations whose conditions are reported by NCDC to be essentially unchanged, 80 of which are Class 1\2, the latter of which still provide adequate distribution and produce what I call the “true signal”.

    Where was it that I read about this “true signal” before?
    Oh, yes! I remember! It was this.

    In the beginning was the Word, and the Word was with God, and the Word was God.
    The same was in the beginning with God.
    All things were made by him; and without him was not any thing made that was made.
    In him was life; and the life was the light of men.
    And the light shineth in darkness; and the darkness comprehended it not.

    Clearly, when confronted with your “true signal” I am part of “the darkness”.

    Richard

  129. Nick Stokes: I’ve written a post here which tries to illustrate the fallacy of that. When you are calculating the average for a period of time, or a region of space, that data point was part of the balance of representation of the sample. If you throw it out, you are effectively replacing it with a different estimate. You can’t avoid that. And your implied estimate could be a very bad one.

    It’s not good advice.

    REPLY: Nick Stokes, defender of the indefensible, is arguing to preserve bad data. On one hand he (and others) argue that station dropout doesn’t matter, on the other he argues that we can’t throw out bad stations or bad data because it won’t give a good result.

    If you knew for sure what data were “bad” and what data were “good”, your reply might make sense, but note Nick Stokes’ point that throwing out a “bad” data point is equivalent to imputing (the word most statisticians prefer to “estimating” missing data) a particular value to it, and that might not be the best imputation possible. In almost all cases, including the temperature data sets, all the data are “imperfect to some degree”, and the classification into “bad” vs “good” is an arbitrary simplification. With lots of imperfect but few “bad” data points, using the extant data to impute a value to the “bad” data probably is better than the particular imputation method of dropping the “bad” data.

    Nick Stokes’s defense is reasonable, and the whole topic of methods of imputation is much addressed in the statistical literature. How good a particular method is in a particular case often can’t be determined with great confidence from the extant data, but dropping “identified BAD” data is almost always among the least defensible alternatives.

    Another venue where this issue arises is in measuring small concentrations (of drugs, metabolites, toxic pollutants, etc) where a large number of values are positive but “below the limit of detection”. Throwing them out can be worse than using them. Two statisticians who have addressed this problem for particular cases are Diane Lambert, PhD (then of AT&T Bell Labs, more recently at Google) and Emery Brown, MD, PhD (then at Mass Gen or Macleans; now at Harvard Medical School.) If anyone is interested, I can get the full references, but the general topic is “missing values and data imputation”.

    This is a terrific thread. I would like to thank Nick Stokes for hanging around and presenting a spirited defense of a defensible approach to missing weather values, and Willis for initiating the thread.

  130. My thanks to Nick Stokes and Matt Marler for brining up an interesting question—when is throwing out bad data worse than keeping it? Nick says:

    When you are calculating the average for a period of time, or a region of space, that data point was part of the balance of representation of the sample. If you throw it out, you are effectively replacing it with a different estimate. You can’t avoid that. And your implied estimate could be a very bad one.

    Suppose our method of analysis, as in Nick’s example, is an average of the data, to find e.g. an average temperature over some period. Nick says that throwing out a given piece of data in this situation is equivalent to replacing it with the average of the remaining data. While this is demonstrably true, there is one way in which they are not equivalent, and it is an important way.

    This is that throwing out the bad data increases the uncertainty of the result by reducing N, while replacing the bad data with the average of the remaining data decreases the uncertainty of the result. As a result, Nick’s claim that the two are equivalent is simply not true.

    w.

  131. Willis Eschenbach: This is that throwing out the bad data increases the uncertainty of the result by reducing N, while replacing the bad data with the average of the remaining data decreases the uncertainty of the result. As a result, Nick’s claim that the two are equivalent is simply not true.

    That is a pertinent point. When data are imputed, the number of imputed values has to be subtracted from N in order to avoid the misleading appearance of greater precision. Nick Stokes will speak for himself, but I bet that he knows that.

    Now back to “when is imputing better than simply dropping?” Consider for now the estimate of the mean high temperature in the US on July 1. Doing the work, you find that the temperature for Lubbock TX is missing. Simply dropping it is equivalent to replacing it with US mean high temp from all of the other data. A different method of imputation is to replace it with the mean high temp of a region around Lubbock; then use that in calculating the US mean. Which of these imputations yields an estimate closer to the real US mean, given that neither imputed value is exact? Almost for sure, the imputation based on local stations is better than the overall mean. If there is enough other reliable information, the estimate calculated as the Bayesian posterior mean of a well-chosen locale (or based on values highly correlated in general, as with Kriging) can’t be beaten. But you probably can’t know for sure: the proofs depend on assumptions, and the assumptions are almost never exact representations of what you are working with.

  132. This is that throwing out the bad data increases the uncertainty of the result by reducing N, while replacing the bad data with the average of the remaining data decreases the uncertainty of the result. As a result, Nick’s claim that the two are equivalent is simply not true.

    How well I know.

    And that’s yet another crime of homogenization — it gives an entirely false impression of precision. See my error bar. See how nice and small it is. Cocktails all ’round. Well of course it is! You have smoothed away all your outliers, haven’t you? You have reeducated them to conform with the majority of Good Citizens.

    The problem arises when the Good Citizens actually turn out to be Bad Citizens.

    Meanwhile, the true signal has vanished. What remains is meaningless pap.

  133. Thanks, Matthew. I find myself uneasy about the logic regarding infilling “dead” stations. Suppose we’re calculating the average temperature of the US. As you point out, mathematically, infilling is the same as replacing the value for the station with some flavor of local average, and leaving it out is the same as infilling it with the national average. Your claim is that using the local average is better than leaving it out.

    I see a couple of issues with this.

    Let’s suppose we have an area where there are very few stations. So … we decide to use virtual stations. We pick some points, figure out what the local average is for those points, and we include them in the calculation … does this seem like a defensible procedure?

    Because that procedure is exactly equivalent to infilling a dead station.

    The problem is exacerbated by the common procedure of gridcell averaging. If we average all of the new virtual stations plus all of the real stations in a certain gridcell, the gridcell average will NOT be the same as it would be without the virtual stations. This is because the “local” averaging is often based on stations within a certain radius, and not stations within the gridcell.

    Now, I agree that if we’re using a calculation of a smoothly varying “temperature field” rather than gridcell averaging, the inclusion of any number of “virtual stations” whose values are given by the local temperature field will not change that field.

    However, I don’t think that actually solves the problem …

    Puzzling …

    w.

  134. Willis Eschenbach: Your claim is that using the local average is better than leaving it out.

    It’s a ranking: the mse of the overall estimated mean is smaller when the local average is used in place of the overall average: (1) depending on how different the true local average is from the national average (between-location variance); (2) depending on the precisions of the individual temperature recordings (at location measurement variance); (3) depending on how well the overall distribution (from place to place) can be approximated by a functional form (Gaussian etc, with estimated parameters.)

    The Bayesian estimation procedure does not actually “solve” a problem, in any intuitive sense of “solve”; it uses all distribution information to reduce the mse of the estimate. It’s explained in Samaniego’s book “A comparison of frequentist and Bayesian methods of estimation”, and most other introductions to Bayesian estimation.

  135. Matthew R Marler says:
    July 1, 2014 at 9:58 pm

    Willis Eschenbach:

    Your claim is that using the local average is better than leaving it out.

    It’s a ranking: the mse of the overall estimated mean is smaller when the local average is used in place of the overall average: (1) depending on how different the true local average is from the national average (between-location variance); (2) depending on the precisions of the individual temperature recordings (at location measurement variance); (3) depending on how well the overall distribution (from place to place) can be approximated by a functional form (Gaussian etc, with estimated parameters.)

    Yes, I understand all of that. I’m just trying to understand the further implications of those things.

    For example … IF we can get a lower MSE by infilling data, then we could run the MSE down to zero by using the “virtual stations” approach I outlined above …

    Comments?

    w.

  136. Willis writes “Doing the work, you find that the temperature for Lubbock TX is missing”

    Now suppose Lubbock TX was accidentally replaced with another station’s dataset. This data cant be right as the region average is 15C and this dataset has an average of 20C. What to do?

    If you try to use the bad data then you will artificially increase the region’s temperature.

    So in this case its truely bad data and must be discarded. But how do you know when to do that. I mean another station who’s data averaged 17C would be even harder to pick..

  137. For example … IF we can get a lower MSE by infilling data, then we could run the MSE down to zero by using the “virtual stations” approach I outlined above

    Let us not forget, when ever you change, alter, or replace a data point, you must add in an uncertainty band to each estimate. Compare each station’s measurement with it’s difference to any krigged trend that doesn’t use that station. At minimum, an infill must add in at least that mean error.

    At least that much error. For you should consider the mean error of the cases where you might want to infill, for who would want to infill a station that reads close to the krigged trend. In addition, I would want an estimate of how much the krigged trend would change for a random omission of 20% of the control points.

    So, if people are honest about errors and uncertainty added to the dataset as infills are performed, then it is not possible to drive the mse to arbitrarily low values, but mse will soon increase the more you tamper with the data.

  138. Willis Eschenbach: For example … IF we can get a lower MSE by infilling data, then we could run the MSE down to zero by using the “virtual stations” approach I outlined above …

    Comments?

    You can not reduce the mse to 0. I do not understand why you think that you can.

  139. Matthew R Marler says:
    July 2, 2014 at 8:37 am

    Willis Eschenbach:

    For example … IF we can get a lower MSE by infilling data, then we could run the MSE down to zero by using the “virtual stations” approach I outlined above …

    Comments?

    You can not reduce the mse to 0. I do not understand why you think that you can.

    Thanks, Matthew, I don’t think you can. I’m just following out your line of thought, viz:

    It’s a ranking: the mse of the overall estimated mean is smaller when the local average is used in place of the overall average …

    What I said is that IF we can get a lower MSE by infilling as you claim, then we could infill everywhere by using virtual stations and get an arbitrarily small MSE …

    I wasn’t making a claim … I was using the technique called “reductio ad absurdam” on your claim.

    Comments?

    w.

  140. evanmjones says:
    July 1, 2014 at 9:00 am

    To the half-dozen posts above: Yes, if you are going to make an attempt at a “true signal”, one must, perforce, confine oneself to the subset of stations capable of providing such. That is what we do for our paper.
    ——————

    In my opinion, every Surface Station out there generates its own “true signal”. A signal that is “true” only for that specific Surface Station itself.

    But the big question is, ….. is each individual “true signal” also an accurate and/or correct signal that has not degraded (increased/decreased) “over time” due to physical changes within its local environment?

    Said “true signal” that is generated by every Surface Station is also subject to daily (24 hour) increases and decreases that are directly related to other randomly occurring environmental factors such as, to wit:

    1. length of daytime/nighttime.
    2. the amount of direct solar irradiance each day.
    3. the “daily” angle of incidence of solar irradiance to the surface & objects residing on surface.
    4. the “seasonal” angle of incidence of solar irradiance to the surface & residing objects.
    5. near surface air movement (winds & thermals).
    6. the direction of flow of near surface air movement in respect to Surface Station location.
    7. the temperature of the “inflowing” near surface air mass
    8. the amount of H2O vapor (humidity) in the near surface air.
    9. the amount of H2O vapor (clouds, fogs, mists) in the near surface air.
    10. the time of day, amount of and temperature of the precipitation (H2O) that alights on surface.
    11. the amount of thermal (heat) energy that is retained by and slowly emitted from and/or conducted to the near surface atmosphere relative to the mass density of the object that absorbed said thermal energy. Eg: Heat Islands, large volumes of water, etc.

    Given the above, how can one possibly “filter out” an “over time” degraded signal from one (1) or many of said daily “true signals” …… when no two (2) daily “true signals” are the result of exactly the same environmental factors? None, …. zero, zilch, nada.

    Is Climate Science a new game of Atmospheric Horseshoes ….. where “close” counts …. and the closest “distance” is determined by the highest Degreed player that is doing the “measuring”?

  141. I would like inquire in a little more detail what control is used for the krigging of the regional field?

    What do you use for the very first krigging?
    Since kigging a regional field is necessary to determine outliers and breakpoint, it follows that the first krigged field has no adjustments for it’s control. There may be breakpoints for gaps in the records.

    Someway and somehow you identify a station that needs an empirical breakpoint because it diverges from the regional trend by a key threshold. You insert the breakpoint. One semi-long record becomes two semi-short records.

    THEN WHAT? Does that altered station go back into the pool of krigging control points?
    What are your options?
    A) Remove the station from the pool available for krigging? There soon would be no stations left to give a regional trend.
    B) Preplace the original station with the one with an extra break point. ( After all, breakpoints only “improve the data.” right? /sarc. ) Before long the krigged field is totally dominated by stations that have been subjected to prior krigging. We now have a perpetual motion machine endlessly krigging regional trends to test for new break points at stations already ground into pulp.

  142. Willis Eschenbach: What I said is that IF we can get a lower MSE by infilling as you claim, then we could infill everywhere by using virtual stations and get an arbitrarily small MSE …

    The minimum achievable mse is obtained by a Bayesian method, as presented in the book by Samaniego, that I cited. How you get from “IF we can get a lower MSE by infilling as you claim” (when my claim referenced Bayesian methods), to “… arbitrarily small MSE” is a mystery to me.

  143. What defines the radius of control for the regional krigging around any given station under study for breakpoints and segment trend testing?
    How far might the most distant control point be?
    What are the minimum number of stations that a krigging must include?
    Is there a Maximum number?
    Is there any weighting or inclusion in the control pool based upon the length of the segment(s)?

    I think there is a distinct tendency for breakpoints to appear more frequently when there are more stations in the neighborhood in which to generate a regional krigged field of trends. This is at once logical and a red flag.

    It is logical in that that it is difficult to determine an empirical breakpoint in a station record if it is the only one for one hundred miles. Early airport stations, such as DENVER STAPLETON AIRPORT fit this bill. But later in a station’s history, there are many neighbors that can quibble with its trend and suggest breakpoints, deserved or not.

    It is a red flag in that the density of breakpoints, if deserved by station microsite issues, TOBS, and instrument changes, are unlikely to be a function of the number of other stations within a day’s drive. Yet that appears to be what is happening.

    The empirical breakpoints of a station are at least a partial function of the station’s neighbors and appear not to be intrinsic to the station’s own behavior.

  144. Willis Eschenbach: Your claim is that using the local average is better than leaving it out.

    Matthew R Marler : It’s a ranking: the mse of the overall estimated mean is smaller when the local average is used in place of the overall average: (1) depending on how different the true local average is from the national average (between-location variance); (2) depending on the precisions of the individual temperature recordings (at location measurement variance); (3) depending on how well the overall distribution (from place to place) can be approximated by a functional form (Gaussian etc, with estimated parameters.)

    The Bayesian estimation procedure does not actually “solve” a problem, in any intuitive sense of “solve”; it uses all distribution information to reduce the mse of the estimate. It’s explained in Samaniego’s book “A comparison of frequentist and Bayesian methods of estimation”, and most other introductions to Bayesian estimation.

    This began with a comparison of using the overall mean (equivalent except for degrees of freedom to omitting the observation) to the local mean. Unless all of the local means are the same, you get a lower mean square error in estimating the local means and the overall mean if you impute the local mean for the missing value.

  145. EDIT of 10:24am:
    It is a red flag in that the density of breakpoints, if deserved by station microsite issues, TOBS, and instrument changes, should be independent of the number of other stations within a day’s drive. Yet it appears that the frequency of breakpoints increases with the number of neighboring stations.

  146. Stephen Rasey: The empirical breakpoints of a station are at least a partial function of the station’s neighbors and appear not to be intrinsic to the station’s own behavior.

    It’s not an either/or. It compares the size of the jump at the possible break point to the standard deviation of the neighbor measurements.

  147. Stephen Rasey: EDIT of 10:24am:
    It is a red flag in that the density of breakpoints, if deserved by station microsite issues, TOBS, and instrument changes, should be independent of the number of other stations within a day’s drive. Yet it appears that the frequency of breakpoints increases with the number of neighboring stations.

    That is a really interesting comment. Is it possible that there are more thermometers in the regions that have the greatest change?

  148. Matthew R Marler at 9:58 pm
    Samaniego’s book “A comparison of frequentist and Bayesian methods of estimation”, and most other introductions to Bayesian estimation.

    I have not read that book. Nor am I likely to in the time I have left on this planet.
    What is the essence to the methodology?
    What are the key assumptions?

    Bayesian analysis has as its Achilles Heel the issue of “Prior” estimates of distributions. IPCC accepted a 0-18 uniform prior for the climate sensitivity in an egregious case of Thumb on the Scales in a desperate and transparent effort to keep a climate CO2 sensitivity 4.5 deg C per doubling as a viable high estimate.

    I am a Bayesian. That doesn’t mean I accept the work of all Bayesians in all situations.

  149. @Matthew R Marler at 10:34 am
    Is it possible that there are more thermometers in the regions that have the greatest change?
    “greatest change?”
    Change in what?

    I think the existence of a breakpoint at a station has less to do with what happens at the station and more to do with the number of and nearness of neighboring thermometers in the krigged field of trends.

    I further suspect that as more stations are broken into smaller and smaller segments and returned to the pool of krigging control points (a guess on my account, see above), the krigged field of trends takes on a “fabricated” shape with unrealistically small uncertainty. Which is what perpetuates the processes desire to keep breaking segments into ones shorter than ten years.

  150. @Matthew R Marler 10:28 am
    This began with a comparison of using the overall mean (equivalent except for degrees of freedom to omitting the observation) to the local mean. Unless all of the local means are the same, you get a lower mean square error in estimating the local means and the overall mean if you impute the local mean for the missing value.

    Impute the local, regional krigged mean estimate — AND ITS ESTIMATED ERROR — for the missing value.

    Every thing you do to a dataset adds error and uncertainty.

    Data Stream: 12, 12, 12, 12, 12, 21, 12, 12, 12, 12,
    Wow. that 21 sure is an outlier. You know, it is probably a transposition.
    PROBABLY. There is a non-zero probability that it is correct. (I know, because it’s my data and it is real.)
    Changing the data stream to
    Data Stream: 12, 12, 12, 12, 12,
    12, 12, 12, 12, 12,
    is not as correct as
    Data Stream: 12, 12, 12, 12, 12, (12, 12, 21), 12, 12, 12, 12,
    where (12,12,21) denotes a low, most like, high estimate on some non-symmetric distribution.
    The latter preserves the addition of uncertainty to the overall mean result.

  151. Speaking of mse (mean standard error),
    Here is a simple question.
    Given a 30 day month’s daily Tave measurements
    c(9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11,9,11)

    The month’s Tmean is obviously 10.00,
    but what is the month’s Tmse?
    StDev = 1.000, Count = 30,
    mse = StDev/sqrt(count) = 0.182.

    Wrong.
    It is a trick question.
    We never measure any 9 or 11.
    We never measure any daily Tave.

    We measure 30 daily Tmin and 30 daily Tmax.
    So let’s assume that our
    30 Tmins are c(4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6)
    and
    30 Tmax are c(14,16,14,16,4,16,4,16,4,16,14,16,14,16,4,16,4,16,4,16,14,16,14,16,4,16,4,16,4,16)
    The monthly Tavg from the 30 Tmins and 30 Tmaxs = 10.000
    but the StDev = 5.099, count = 60
    Tmse = 5.099/sqrt(60) = 0.658 deg C.

    That’s a big error bar when we are dealing with trends of 0.20 deg / decade.
    It is a huge error bar if we break long temperature records into sub-decade long segments. For this reason, if no other, BEST’s scalpel is making a bad situation worse.

    Even if the temperature sensor is pristine, class 1, and error free, the fact that the original data is daily mins and maxes, means there is about a half degree uncertainty in the monthly mean, which will pass through to the monthly anomaly.

    Where in all this work about mse are we keeping track of the original Tmse in the monthly anomalies? Keep in mind, every infilled data point must contain not only the Tmse that should be in the station data, but the Tmse present in the control points of the krigging field. As you infill the data, you are adding noise you must not ignore.

  152. South African mining engineer & geostatistician Danie G. Krige, father of kriging, died last year.

  153. Stephen Rasey, lots of good comments.

    Yes, estimate the imputed value and the error of its estimate. With multiple imputation, you draw samples from the data (where the exist) and from the distributions of the imputed values, when estimating the precision of the overall estimate.

    The key feature of Samaniego’s book is its evaluation of when the Bayes estimate is likely to be better than the mle and fiducial distribution in practice: basically, iff the prior distribution is accurate enough, what he calls passing the “threshold”. Thus, he supports your idea that the prior is the Achilles heel. Lots of us agree anyway, but it is good to have a scholarly work on the subject.

    @Matthew R Marler at 10:34 am
    Is it possible that there are more thermometers in the regions that have the greatest change?
    “greatest change?”
    Change in what?

    The greatest change in temperature. You said that there are more breakpoints in the regions that have the most thermometers, and I was wondering if those are the regions with the largest number of industrial developments and such in that period of time. If so, the larger number of breakpoints would be an expected outcome

  154. Stephen Rasey says:
    July 2, 2014 at 12:06 pm

    Speaking of mse (mean standard error),
    Here is a simple question.

    Excellent points, Stephen.

    I have long held that the reason that the standard error of the ocean heat content (OHC) values is incorrectly claimed to be so small is that the error in not carried forwards in the step where they remove the “climatology”. Instead, the “anomalies” with the monthly average values removed are treated as observational points with no associated error … when in fact they have an error equal to the standard error of the mean of the monthly measurements.

    w.

  155. But the big question is, ….. is each individual “true signal” also an accurate and/or correct signal that has not degraded (increased/decreased) “over time” due to physical changes within its local environment?

    To that end, we drop any station that has moved to or from an unknown or differently rated location. We also drop an unmoved station where encroaching microsite has changed the rating.

    We also drop a station if TOBS is “reversed” because that produces a large, spurious step change.

  156. Stephen Rasey and Willis Eschenbach: Speaking of mse (mean standard error),

    mse stands for “mean squared error”. It is the sum of the square of the bias and the variance of the estimator. The bias is the difference between the mean of the estimator and the true value. Siting a thermometer next to an air conditioner heat exchanger, for example, adds a bias no matter how precise the thermometer is. Siting, and drift are bias issues, related to accuracy.

  157. Matthew R Marler says:
    July 2, 2014 at 5:29 pm (Edit)

    Stephen Rasey and Willis Eschenbach: Speaking of mse (mean standard error),

    mse stands for “mean squared error”.

    Ahhh … I was under the impression you were talking of the standard error of the mean (sem, not mse).

    I don’t see how you could ever hope to calculate the bias of the temperature dataset … so I’m not clear now about your previous claim of the effect of infilling on the mse.

    w.

  158. Willis Eschenbach: I don’t see how you could ever hope to calculate the bias of the temperature dataset … so I’m not clear now about your previous claim of the effect of infilling on the mse.

    With a particular data set, you can’t tell for sure that you have the estimate with the least obtained bias, unless you have some other estimate of the parameter of interest. I think I wrote that earlier, but perhaps not on this thread. That the Bayes estimators can’t be beaten over many samples is a theorem; one of the points of Samaniego’s book is that you can’t expect (!) the result unless the prior is sufficiently accurate.

    However, there is no reason to think that the overall mean is a better estimate for infilling than the Bayesian estimate of the local mean, which was where we started.

  159. The greatest change in temperature. You said that there are more breakpoints in the regions that have the most thermometers, and I was wondering if those are the regions with the largest number of industrial developments and such in that period of time. If so, the larger number of breakpoints would be an expected outcome.

    Why would it be the expected outcome? I do not disagree. But the reason there should be more breakpoints in places with a large number of thermometers is key to grokking the problem.

    Breakpoints should reflect a problem with THAT station. Either it is an obvious data gap, a meta-data know change in recording methodology and/or location.(with presumed significant change in local climate) —- OR it is some UNKNOWN change in the station, a change only highlighted by its difference to surrounding stations.

    This last point would only be valid if and only if the regional trend is trustworthy enough to identify unknown problems at a station. Trustworthiness of the regional trend has yet to be proven to me. The only peer reviewed paper I’ve seen (from AGU 2012) was on a synthetic case where breakpoints were not an element tested. But I see an enormous problem with the krigging trend as a function of time as different control points have different segments come and go.

    It is easy for me to see that an increase in breakpoints are a natural artifact of as the number of comparing stations increase. It signifies a problem with region instability and a too-sensitive decision criteria on wielding the scalpel.

  160. evanmjones says:
    July 2, 2014 at 4:06 pm

    To that end, we drop any station that has moved to or from an unknown or differently rated location. We also drop an unmoved station where encroaching microsite has changed the rating.
    —————-

    If you keep eliminating stations …. then you have to keep extrapolating temperature over a farther and farther greater area which introduces an even greater uncertainty in the temperature record, does it not?

    Of course I don’t think it makes any difference, one way or the other, how you do it.

  161. @Willis Eschenbach at 7/2 3:00 pm

    have long held that the reason that the standard error of the ocean heat content (OHC) values is incorrectly claimed to be so small is that the error in not carried forwards in the step where they remove the “climatology”. Instead, the “anomalies” with the monthly average values removed are treated as observational points with no associated error … when in fact they have an error equal to the standard error of the mean of the monthly measurements.

    I think the claimed error bars on OHC for the years 1955 to 2005 have a much simpler and basic explanation.

    Furthermore, the real uncertainty is almost always greater than or equal the sampled data. For how do you know the uncertainty in the data you did not sample? I can plunge a thermometer 300 m at 50 deg N 20 deg W on July 1, 1972. And do it again on Sept 1, 1973. Those two readings do not remotely define the uncertainty in temperatures for the entire North Atlantic for the decade of the 1970s. But to read Levitus-2009, the uncertainty in the Ocean Heat Content prior to 2003 is based precisely on such poor spatial and temporal sampling of ocean temperatures with unrealistically narrow uncertainty bands. Prior 1996, Ocean Temperatures profiles were primarily done for antisubmarine warfare research, and thus concentrated around submarine petrol areas. See Figure 1 Ocean Temperature data coverage: maps b=1960, c=1985. from Abraham, J. P., et al. (2013) (pdf) (from Rasey comment in “Standard Deviation, the overlooked….” WUWT June 15, 2014)

    Suppose we want to study the changes in heights and weights for the adult population of the United States. Since 2004 we have been measuring every 1000th adult that walked into every doctor’s office around the country. Before long you’d have a good tight estimate on the mean with a narrow mean standard error. Of course there are caveats. We are not measuring children, just adults. A little deeper in the fine print, we see that we measure people who “walked” into Doctors offices. If they are in a wheel chair, or on crutches, they don’t get measured. Deeper down, we realize that people who cannot get to Doctor’s offices don’t get measured either.

    But It is still a good dataset! (If you remember the caveats).
    Is there any way we can find some data from before this well designed study started to find any trend in heights of adults back to 1950? “Wow, look what I found! Here is some data from a well designed study, high quality control, that gives me heights and weights of adults. They’ve been doing it for years. It’s just what we need!”
    Oh, who conducted the study?
    “The U.S. Army medical corps.”
    Where did they collect the data?
    “Fort Benning and Fort Jackson Basic Combat Training centers”
    So…. you have a good handle on mean and standard deviation on adults…. who are fit and young enough to join (or be drafted) into the Army — at boot camp. That’s what you are going to splice onto your new doctor visit dataset?

    Levitus 2009 did something very similar.
    Today we have a great ARGO data collection effort. It only measures the top 2000 m of ocean. It only reports from open ocean, so forget the arctic regions. Continental shelves are not deep enough for the bouy hibernation. Restricted seas, like the sea of Japan may be over or under sampled. But it is a good dataset.

    Levitus looked for a well collected dataset at times earlier than 2003 when ARGO was started. He found such a dataset that was originally collected under contract with the Office of Naval Research who were keenly interested in ocean temperature profiles as an element for the hiding and detection of submarines (our’s and their’s). Very well collected data. The problem is they focused measurements where submarines patrol: NW Pacific, NE Pacific, NW Atlantic, NE Atlantic, Barrents Sea. Within 1500 miles of the coast. (See the Abraham maps in the links above). It is sparse sampling, but high quality — where</b. it was done. But just like the apocryphal boot camp study above, they have nice tight narrow standard deviations on an artificially restricted dataset they pass off as representative of the data they collect today.

    It is one thing for Levitus 2009 to have been published. It is quite another that its conclusions are repeated today with a straight face.

  162. If you keep eliminating stations …. then you have to keep extrapolating temperature over a farther and farther greater area which introduces an even greater uncertainty in the temperature record, does it not?

    Yes, but there is no getting away from it. And, fortunately, there is oversampling, and even the included Class 1\2s alone have a pretty good distribution. Those are the ones they should use. The others must either be heavily adjusted for microsite, or dropped (dropped, if possible). Besides, what’s the use of extrapolating less if the slack is being taken up by fatally flawed data in the first place?

  163. Stephen Rasey: Why would it be the expected outcome? I do not disagree. But the reason there should be more breakpoints in places with a large number of thermometers is key to grokking the problem.

    On that we agree.

  164. evanmjones says:
    July 3, 2014 at 10:06 am

    Besides, what’s the use of extrapolating less if the slack is being taken up by fatally flawed data in the first place?
    ————————

    My thoughts exactly. It is all “fatally flawed data” with no hope of ever getting it corrected or straightened out.

    But there sure are a lot of people that are obsessed with trying to do that very thing.

  165. @Willis Eschenbach at 3:00 pm
    I have long held that the reason that the standard error of the ocean heat content (OHC) values is incorrectly claimed to be so small is that the error in not carried forwards in the step where they remove the “climatology”. Instead, the “anomalies” with the monthly average values removed are treated as observational points with no associated error … when in fact they have an error equal to the standard error of the mean of the monthly measurements.

    Willis, I agree with you completely. I believe with almost ALL work with anomalies, a poor job is being done with the accounting of uncertainty in the statistics.

    In my rather lengthy reply of July 3, 9:29 am I did not mean to disagree or discount your observation. I chose instead to pick upon a separate, easier to identify, and more basic error in the OHC (Ocean Heat Content) dataset from measurements made prior to 2003 — unrealistic measurement uncertainty coming from a highly biased and sparse sampling (in time and space) of Temperatures below 300 m.

    Levitus-2009, reports a 1.8 x 10^22 Joules, increase from 1955 to 2008. Keeping in mind that it takes 27.5 ZJ (= 2.7 x10^21 Joules) to raise the 0-2000 m interval of the ocean 0.01 deg C, that means Levitus total temperature change is less than 0.08 deg C. As in you Decimals of Precision post, if 100,000 measurments per year from ARGO might justify a 0.005 deg C accuracy, 1000 measurements per year can do no better than 0.05 deg C and 10 measurement/year can do no better than 0.5 deg C.

    On top of all that sample bias, I too believe that they are disposing of the sample uncertainty when they are taking their anomalies. But we have to get deeper in the weeds to prove it. Such as how are they binning in time and space the sub 300 meter readings? What do they do with cells that have zero, 1, or two readings? What happens if they have 5 measurements in a vol-time grid cell for May 1965, but no more until July 1969? What happens if a sub 1000m vol-grid cell is never sampled in January until 2004 with the first ARGO to visit the cell?

  166. RE: my July 2, 12:06 pm
    Speaking of mse rmse> (mean standard error),…..
    We never measure any daily Tave.

    We measure 30 daily Tmin and 30 daily Tmax.
    So let’s assume that our
    30 Tmins are c(4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6,4,6), and
    30 Tmax are c(14,16,14,16,4,16,4,16,4,16,14,16,14,16,4,16,4,16,4,16,14,16,14,16,4,16,4,16,4,16)
    The monthly Tavg from the 30 Tmins and 30 Tmaxs = 10.000 deg. C
    but the StDev = 5.099, count = 60
    Trmse = 5.099/sqrt(60) = 0.658 deg C.

    What is the Trmse of that month’s ANOMALLY?

    Answer #1. The anomaly is just the average shifted to the average the base period. It is just a bulk shift. For the purposes of uncertainty analysis, we are subtracting by a constant and we are not adding uncertainty. It makes no difference to the slope of the trend if we subtract 9, 10, 11 or pi().

    Answer #1 is correct if and only if the adjustment is the same for all months.

    For instance, If we were to look at the trend in May temperature anomalies over the period 1980 to 2010, and we subtracted all by the mean of (may temperatures over 30 years) or Tm(May, 1980-2010) , we would not have to be concerned with the mean standard error of that estimate (Trmse(May, 1980-2010)).

    But when we are combining anomalies for May, June, July, …. April, then the Trmse for each month becomes important.

    Answer #2. The Anomaly is a bulk shift by an uncertain quantity: (Tm(May, 1980–201), Trmse(May,1980-2010))

    If we use the example of 30 Tmins and 30 Tmaxes for the month of May above, and we assume it repeats constantly for 30 years in the same month of May, then
    Mean: Tm(May, 1980-2010) = Tm(May) = 10.000 deg. C.
    RMSE (Standard Error of mean)
    Trmse(May, 2010) = 0.658 deg C
    Trmse(May, 1980-2010) = 0.120 deg C

    So there is an error bar proportional to 0.120 deg C added to the data when we take the anomaly. Each month gets this error bar, just from taking an anomaly from 30 years of constant temperatures where the daily difference between high and low is between 8 and 12 deg C and averages 10 deg C.

    So, when you take the Anomaly for the Month
    TA(May,2010) = T(May,2010) – T(May,1980-2010) and we treat these temperatures as (mean,std Deviation)
    Then
    TA(May,2010) = (10, 0.658) – (10 , 0.120) = (0, 0.669)
    (these standard deviations add like the Pythagorean Theorem)
    So while the mean Anomaly is zero (as desired), It’s mean standard error is increased to 0.669 deg. C.

    Moral of the story, the Trmse of the 30 year average is not insignificant, easily a couple tenths of a degree. It matters when you attempt to compare one month against another month to remove an unknown but estimated seasonal signal. The overall error bar of an individual month’s anomaly TA(Month,year) is probably mostly composed of the uncertainty from the single months Trmse(Month, Year) derived from the month’s measured Mins and Maxes.

  167. Addendum to my July 5, 2:58 pm
    Answer #1 is correct if and only if the adjustment is the same for all months.

    The bulk shift of a station-month temperature record to an Temperature Anomaly record is the addition of a constant with no uncertainty IF and ONLY IF the shift is the same for all months AND ALL STATIONS that it will be compared against.

    The head post is about “Problems with the Scalpel Method” of BEST.

    BEST uses it’s scalpel by comparing the Temperature Anomaly record of a station with a krigged regional field derived from the Temperature Anomaly records of “nearby” stations. This comparison means that the Trmse, the mean standard error of mean Temperature must be included in the uncertainty analysis when comparing between stations of the same month.

    I have show above that the individual station error bars of the mean Temperature of any month for any forstation, derived from the only observed measurements, the daily Tmin and Tmax, is greater than 0.6 deg C if the daily min and max range is only 10 deg. C. I have also shown that with a 0.6 deg uncertainty per month, then a 30 year average for the month will also have an uncertainty of at least 0.1 deg C. Every station in the regional homogenization grid experiences these same uncertainties.

    I submit that with these real uncertainties in means, as well as in average maxes and average mins that derive from the raw recorded station data, that it is impossible for BEST or anyone else to determine an empirical breakpoint of any station based upon fit to a regional trend.

    The regional trend, which we have no reason to believe is a monotonic surface, has significant fuzzy thickness from error bars. It is a surface where each control point has a Trmse of over 0.6 deg C, an anomaly adjustment of 0.1 deg C. The derived krigged surface has significant uncertainty thickness is then applied to a subject station, who also possesses a 0.6 deg C Temp uncertainty and 0.1 deg uncertainty to it anomaly. Under such circumstances, it is unlikely that any station will exceed the errors present to earn an empirical breakpoint, much less an average of over 5 breakpoints per station.

    There are many reasons I reject the BEST scalpel and trend reliance to tease out a climate signal. From day one I had objections based upon information theory and the loss of low frequency (Climate) signal caused by the scalpel and emphasis on temperature trends. The results of BEST’s work on individual stations don’t make sense: It can find 20 station adjustments at Lulling, TX, 8 stations adjustments at DENVER STAPLETON AIRPORT (but misses the opening and closing of the airport), yet it misses the fire-bombing of Tokyo in March 1945.

    Even the greatest supporters of BEST must acknowledge the BEST Scalpel requires precision in the data to justify 0.1 degree breakpoint shifts — a precision that does not exist when the raw data are daily mins and maxes.

  168. Understanding adjustments to temperature data
    by Zeke Hausfather

    http://judithcurry.com/2014/07/07/understanding-adjustments-to-temperature-data/

    543 Comments in less than 18 hours.

    Why Adjust Temperatures?
    What are the Adjustments?
    Quality Control
    Time of Observation (TOBs) Adjustments
    Pairwise Homogenization Algorithm (PHA) Adjustments
    Infilling
    Changing the Past?

    This will be the first post in a three-part series examining adjustments in temperature data, with a specific focus on the U.S. land temperatures. This post will provide an overview of the adjustments done and their relative effect on temperatures. The second post will examine Time of Observation adjustments in more detail, using hourly data from the pristine U.S. Climate Reference Network (USCRN) to empirically demonstrate the potential bias introduced by different observation times. The final post will examine automated pairwise homogenization approaches in more detail, looking at how breakpoints are detected and how algorithms can tested to ensure that they are equally effective at removing both cooling and warming biases.

  169. More on Zeke’s post at Curry on July 7,
    of the 547 comments so far, 98 of them are Steven Mosher with sentences so short, terse, and abbreviated of meaning, his word processor must use the BEST scalpel as a plug-in.

    Paul Matthews has a short comments that sums up a good deal of the thread.

    Paul Matthews | July 7, 2014 at 9:51 am | Reply
    Congratulations, you’ve written a long post, managing to avoid mentioning all the main issues of current interest.

    “Having worked with many of the scientists in question”
    In that case, you are in no position to evaluate their work objectively.

    “start out from a position of assuming good faith”
    I did that. Two and a half years ago I wrote to the NCDC people about the erroneous adjustments in Iceland (the Iceland Met Office confirmed there was no validity to the adjustments) and the apparently missing data that was in fact available. I was told they would look into it and to “stay tuned for further updates” but heard nothing. The erroneous adjustments (a consistent cooling in the 1960s is deleted) and bogus missing data are still there.
    So I’m afraid good faith has been lost and it’s going to be very hard to regain it.

  170. From Zeke’s Curry paper, at the end of the Pairwise Homogenization Algorithm (PHA) Adjustments section.

    With any automated homogenization approach, it is critically important that the algorithm be tested with synthetic data with various types of biases introduced (step changes, trend inhomogenities, sawtooth patterns, etc.), to ensure that the algorithm will identically deal with biases in both directions and not create any new systemic biases when correcting inhomogenities in the record. This was done initially in Williams et al 2012 and Venema et al 2012. There are ongoing efforts to create a standardized set of tests that various groups around the world can submit homogenization algorithms to be evaluated by, as discussed in our recently submitted paper. This process, and other detailed discussion of automated homogenization, will be discussed in more detail in part three of this series of posts.

    The Williams link is a pdf. the Venema link is to an abstract. Neither make any reference to “sawtooth”.

  171. Had to repost this gem of an observation from Patrick B, with Mosher’s typical “read the literature” (where) retort.

    Patrick B | July 7, 2014 at 9:48 am | Reply

    How could you have written this article without once mentioning error analysis?

    Data, real original data, has some margin of error associated with it. Every adjustment to that data adds to that margin of error. Without proper error analysis and reporting that margin of error with the adjusted data, it is all useless. What the hell do they teach hard science majors these days?

    Steven Mosher | July 7, 2014 at 11:24 am | Reply
    the error analysis for TOBS for example is fully documented in the underlying papers referenced here.

    It’s true. Patrick B’ comment is the first time “error” appears in the time stream. It is not in the head post.

Comments are closed.