BEST practices step uncertainty levels in their climate data

Note the step change. At about 1960, the uncertainty levels plummet, meaning BEST is claiming we became more than twice as certain of our temperature estimates practically overnight.

Note the step change. At about 1960, the uncertainty levels plummet, meaning BEST is claiming we became more than twice as certain of our temperature estimates practically overnight.

Brandon Shollenberger writes in with this little gem:

I thought you might be interested in a couple posts I wrote discussing some odd problems with the BEST temperature record.  You can find them here:

http://www.hi-izuru.org/wp_blog/2015/01/how-best-overestimates-its-certainty-part-2/

http://www.hi-izuru.org/wp_blog/2015/01/how-best-overestimates-its-certainty-part-1/

But I’ll give an overview.  BEST calculated its uncertainty levels by removing 1/8th of its data and rerunning its averaging calculations (then examining the variance in the results).  I’ve highlighted two problems I haven’t seen people discuss before.  If you’re familiar with the  Marcott et al reconstruction’s inappropriate confidence intervals, some of this may sound familiar.

First, BEST only reruns its averaging calculations to determine its uncertainty.  It does not rerun its breakpoint calculations.  As you may know, BEST breaks data from temperature stations into segments when it finds what it believes to be a “breakpoint.”  The primary way it looks for these breakpoints is by comparing stations to other stations located nearby.  If a station seems too different from its neighbors, it will be broken into segments which can then be realigned.  This is a form of homogenization, a process whereby stations in the dataset are made to be more similar to one another.

This process is not repeated when BEST does its uncertainty calculations.  The full data set is homogenized, and subsets of that homogenized data set are compared to determine how much variance there is.  This is inappropriate.  The amount of variance BEST finds within a homogenized data set does not tell us how much variance there is in BEST’s data.  It only tells us how much variance there is once BEST is finished homogenizing the data.

Second, to determine how much variance there is in its (homogenized) data set, BEST reruns its calculations with 1/8th the data removed, eight times.  This produces eight different series.  When comparing these different series, BEST realigns them so they all share the same baseline.  The baseline period BEST uses for its alignment is 1960-2010.

This is a problem.  By aligning the eight series on the 1960-2010 period, BEST artificially deflates the variance between those series in the 1960-2010 period (and artificially inflates the variance elsewhere).  That makes it appear there is more certainty in the recent portion of the BEST record than there actually is.  The result is there is an artificial step change in BEST uncertainty levels at ~1960.  This is the same problem demonstrated for the Marcott et al temperature record (see here).

All told, BEST’s uncertainty levels are a complete mess.  They are impossible to interpret in any meaningful way, and they certainly cannot be used to try to determine which years may or may not have been the hottest.

Advertisements

151 thoughts on “BEST practices step uncertainty levels in their climate data

  1. I just got an alert about this post going up. I’m glad it has because I think these issues deserve attention, but I’m a little annoyed at myself because of the timing of my posts. When I published a little eBook giving the first part of my overview of the hockey stick debate:
    http://www.amazon.com/Hockey-Stick-Climate-Wars-Introduction-ebook/dp/B00RE7K3W2/
    last month, I had planned to finish the second part this month. Even when I wrote these two posts, I thought I could still manage to do it if I focused entirely on getting it done. That’s not going to happen if I want to discuss what I wrote about BEST.
    Oh well. If this can spark some discussion of the BEST methodology, it’ll be worth it. I’ll just have to delay the publication a bit. And yes, I did mostly just write this comment to advertise that eBook. I know it’s crass, but I think a number of people here might genuinely be interested in it.

    • Since the subject of the post is error estimation, I will repost my concluding paragraphs from the link:

      Returning to BEST, all those fragments of temperature records are equivalent to the band-pass seismic data. Finding the long term temperature signal is equivalent to inverting the seismic trace, but the error in the data must also accumulate as you go back in time. Since the temperature record fragments are missing the lowest frequencies, where is the low frequency control in the BEST process? In the seismic world, we have the velocity studies to control the low-frequency result.
      What does BEST use to constrain the accumulating error? What does BEST use to provide valid low-frequency content from the data? What is the check that the BEST result is not just a regurgitation of modeler’s preconceptions and contamination from the suture glue? Show me the BEST process that preserves real low frequency climate data from the original temperature records. Only then can I even begin to give Berkley Earth results any credence.

      Also refer to the June 28 – July 7, 2014 thread:
      Problems with the Scalpel Method – Willis Eschenbach

      • Dedekind’s prior post, to which Willis was trying to get BEST to respond, was about ‘scalpeling’. Technically it is called Menne stitching in homogenization algorithms. And the inherent warming bias Dedekind explained to WUWT has been confirmed for actual stations. See Zhang et.al. Effect of data homogenization…in Theor. Appl. Climatol. 115: 365-373 (2014. Results from the anchoring on most recent data.

  2. It seems to me the issue here is random error versus systematic error. The former would be errors due to such things as imprecision in reading instruments and the effects of limited sampling. The latter would be things like instruments being out of calibration or measurements not being made according to protocol.
    It is common in scientific papers for uncertainty estimates (when they are provided at all) to only include random errors, for the simple reason that systematic errors are very hard to estimate, because what really matters are the systematic errors you don’t know about or can’t control. The proper thing to do in such cases is to say that the error estimates are for random errors only.
    It sounds like BEST is making an estimate of random errors. I think the very real issues Anthony raises pertain to possible systematic errors. So the question then is whether BEST has properly identified their error estimates as being for random errors only.

    • Mike M., sadly, BEST doesn’t even handle random errors properly. Resampling a data set is a common way to try to estimate the uncertainty in that data set, but BEST rebaselines its results every time it resamples them. The problem with this is there is uncertainty in those baselines, and BEST doesn’t account for it (even though it has a variable to store them).
      The issues are more clear if you read the posts at my site, but basically, if you want to see how much variance there is between two series, you have to be careful how you compare them. If you set both series to have the same mean for only one segment (in this case, the 1960-2010 segment), you will force the series to agree better in that segment and worse anywhere else. That causes your variance, and thus your uncertainty, to come out wrong.
      And of course, if you’re going to test your methodology on a subset of your data, you have to test the entire methodology. BEST doesn’t. BEST doesn’t rerun its breakpoint calculations. That means BEST does nothing to try to establish how much uncertainty there is in one of the most important steps of its process.

      • Brandon
        Thank you for delving into this data. I would encourage all to take a look at BEST and see what they think about the various graphs etc. There are many issues and questions that pop out just from a cursory review. Keep up the good work.

      • Brandon Shollenberger,
        It is not clear to me that the use of of a common reference period is inappropriate with respect to computing uncertainties in an anomaly series. I can see arguments on both sides. The test would be whether the uncertainty time series depends on choice of reference period. Has that been tried?
        The issue with the homogenization surely seems to be a question of systematic, not random, error. So it is not obvious that it should be included in a test of random errors, although it is obvious that it is important to derive a test of the homogenization. That is, after all the issue: Does the homogenization introduce a bias into the trend?
        It appears as if BEST has largely adopted the procedures used by others, rather than examining those procedures de novo. Having an “honest broker” using the same procedures would be valuable if the issue were fraud in handling the data. But that is not the case. The real issue is whether confirmation bias has led to the acceptance of procedures that would not have been accepted otherwise.

      • Mike M, the math involved in showing the choice of baseline period causes the step change evident in the graph is fairly simple. I gave a simple demonstration of it in the first post I wrote on this. It shows how if a diferent vaseline period was used, a different set of uncertainties would been generated.
        But really, there’s no reason uncertainties should plummet just because they are during your baseline period.

      • Breakpoints are not actually very important.
        The empirical breakpoint approach was pretty rigorously tested to both increases in breakpoints and decreases in breakpoints. Not much effect.
        The reason for this is that globally adjustments dont amount to much as many of us have shown long before berkeley ever did its project.

      • Steven Mosher, it’s easy to say something “doesn’t matter,” but saying it doesn’t make it so. This is especially true when your argument is in the form:

        The empirical breakpoint approach was pretty rigorously tested to both increases in breakpoints and decreases in breakpoints. Not much effect.
        The reason for this is that globally adjustments dont amount to much

        There is no reason people should only care whether or not breakpoints have effects “globally.” Even if breakpoints didn’t change the global trend at all, they could still be hugely important.
        Heck, if we take Zeke’s response to me as true, a significant portion of why the step change in uncertainty I highlighted exists (or at least, exist in that particular location) may well be because of two breakpoints.

      • Steven Mosher does a good job of showing what is wrong with BEST’s handling of things:

        Testing makes it so.
        Looking at the difference between adjusted and unadjusted makes it so.

        BEST doesn’t publish results of such tests. BEST doesn’t justify its methodological decisions. All it does is say, “It doesn’t matter, trust us.” And if you don’t just trust them, they say, “Test it yourself” even though they know such testing can take at least weeks of runtime if one knows exactly what to do. People needing to familiarize themselves with the problem could need months of runtime.
        The reality is if BEST ought to discuss its methodological decisions so it can justify them and explain any caveats that may be involved in them. It doesn’t. It claims to be completely open and transparent, but it intentionally fails to disclose known issues with its work. And if other people try to discuss those issues, it tells them to bugger off.
        Incidentally, a lot of people may not have a computer they can set aside for months solely to test BEST’s methodology to figure out the results BEST chooses not to release. I know I don’t.

    • I am not following the idea that since systematic errors are very hard to estimate and what really matters are the systematic errors you don’t know about or can’t control they should not be considered.
      Limited sampling seems to be a systemic error. Reporting the random error of guessing or deriving data due to lack of data is just hiding the systemic error of not having enough reliable data. It’s saying since we do not know how much sampling is required to get a reliable number, we’ll just assume that margin of error no matter how large or small will favor warming and cooling equally and so will ignore it.
      I get that I cannot plan driving my car on premise my speedometer at some unknown point in time no longer recording speed accurately (hopefully I would figure it out after 1 or 2 speeding tickets). However I can predict how many cars will be driven improperly since we have an extensive accident history and annual body count due to driving not being made according to protocol so to speak. So I believe we can indeed add the systemic error called human error to the plot.

    • Mike M, all the groups working on global surface air temperature — UEA/UKMet, BEST, GISS — all assume that sensor measurement error is random and averages away.
      The assumption is promiscuous and entirely unjustifiable. Nevertheless, the Central Limit Theorem is assumed to apply throughout, and they clutch to it in a death grip.
      They ignore systematic measurement error entirely, and until recently never even mentioned it in their papers. Available calibration experiments show that temperature sensor systematic error is large and persistent. Solar loading has the greatest impact.
      I’ve published on this problem here (870 KB pdf) and here (1 MB pdf). From systematic sensor measurement error alone, the uncertainty in the 20th century global surface air temperature record is about (+/-)0.5 C.
      If they ever admitted to the systematic error, obviously present, they’d end up with nothing to report. The prime evidence base of AGW would vanish. One can understand the reluctance, but it’s incompetent science regardless.
      I’ve corresponded with Phil Brohan and John Kennedy at UK Met about the papers (Phil contacted me). They can’t refute the work, but have chosen to ignore it. Apparently likewise, everyone else in the field, too.

      • Having read your first linked paper I think it is very important.
        Your conclusion isn’t that the world hasn’t warmed (believable) but rather that the measurement uncertainties are underestimated and that the magnitude of the warming is unknowable. That is very relevant to the core of this website.
        “The temperature sensor at each station will exhibit a unique and independent noise variance” – of course!
        So you can’t just average them together to get about zero-ish. That makes sense.
        In my opinion, this deserves a full post.

      • M Courtney, thanks for the positive feedback. After that paper was published, the response amazed me.The idea of systematic error seemed beyond the grasp of so many. There may be a follow-up. Most of the analysis is done. But I’m trying to publish now on the reliability of climate models. The reviewer responses to that submission have been, if anything, more amazing. Modelers know nothing of physical error analysis. That would be worth a post in its own right.

    • Mike,
      You make a really good point here.
      Systemic errors are those which occur across all data points. For GMST the most obvious one is the post-data collection processing which is run on every item of data.
      Because a huge number of weather stations are used, you’d expect the raw data to have a random distribution of instrument calibration and measurement protocol.
      However, the more you standardise globally, the more possibilities you have for systemic errors arising. My understanding is that there was a big standardisation effort around about 1997, which would increase the possibility of systemic error to weather-station design, instrumental hardware decay, etc etc..
      I don’t believe statistics are useable on GMST.

      • Mike,
        You make a really good point here.
        Systemic errors are those which occur across a large proportion of data points. For GMST the most obvious one is the post-data collection processing which is run on every item of data.
        Because a huge number of weather stations are used, you’d expect the raw data to have a random distribution of instrument calibration and measurement protocol. (Unlike when using one instrument in a lab experiment.)
        However, the more you standardise globally, the more possibilities you have for systemic errors arising. My understanding is that there was a big standardisation effort around about 1997, which would increase the possibility of systemic error to weather-station design, instrumental hardware decay, etc etc..
        I don’t believe statistics are useable on GMST.

  3. BEST has other problems also, and they are not subtle.
    They use regional expectations to QC outliers. To see what that does, just look at BEST 166900. Rejected 28 months of extreme cold to turn no trend into a warming trend. 166900 is Amundsen Scott at the south pole, arguably the best and certainly the most expensive station on the planet. Nearest station is McMurdo, 1300 km away and 2700 meters lower on the coast. Altogether different climate.
    Concerning their website representation of the ingested data, look at BEST 157455. BEST reports two station moves in the metadata (the red diamonds), one about 1972 and one about 2007. The GHCN metadata file has nothing. Easy to check. http://Www.ncdc.noaa.gov/data-access/land-based-station-data/ Go to left side menu, click metadata, enter Puerto Casades.
    Found a back door to the summarized ingested BEST data at berkeleyearth.lbl.gov/auto/Stations/TAVG/text/157455-TAVG-Data.txt. The two station moves came from ingested WMO metadata. Flagged by 1 rather than 0 in column 5. The problem is, these were in 2007 and 2013. BEST shows a 1972 station move that is not in the metadata, and not one that is.
    Altogether not confidence inspiring. Yet still better than GHCN or GISS or BOM ACORN. You can spot check BEST against other stations that have plainly been improperly fiddled. There are quite a few examples in essay When Data Isnt in ebook Blowing Smoke.

    • “The two station moves came from ingested WMO metadata. Flagged by 1 rather than 0 in column 5. The problem is, these were in 2007 and 2013.”
      Isn’t it the 1 in col 6? There is one in Feb 1971, one in Sep 2005.

      • You are right about column 6, I just checked. But in my .txt it still appears that the breaks header is the fifth column not the sixth. The headers did not align well with the actual numerical columns and there was also a wrap around problem on every line when saved as a .pdf due to page width incompatability. My apologies if misinterpreted your output. I stand corrected on my point 2, but not point 1.

      • Its a good post, saw it earlier. And makes things even more confusing. Shub found three different station location coordinates in the BEST ingestion once he had the backdoor open, yet whether column 5 or column 6 there are only two move flags. He has google earthed all three locations, and all are plausible. Probably just shows how unclean all this stuff is.
        Btw, I think Nick is possibly right about column 6 and I misread because of the print wrap around in the saved .pdf.
        I just went and looked at the archived BEST output again. 1971 is certainly possible rather than 1972. But apparently not 2005. The flagged move is too far beyond the decade midpoint, still looks about 2007. Could be mid 2006, but not earlier. Plotting bug? Dunno.
        I will repeat again what was said upthread. When you check BEST versus GISS, GHCN, BOM ACORN, it still appears more reliable and less provably biased. Rekyavik Iceland, Valencia Ireland, De Bilt Netherlands, Sulina Rumania, Rutherglen Australia, …

    • Do you think it would matter if the station was moved from the east side of Amundsen Scott station to the west side? The glacier is presumably moving some direction anyway as well as a small amount higher every year.
      Amundsen Scott is staffed with 200 scientists in the summer and 50 in the winter. I’m assuming they know what they are doing and noone should be mucking around with the already quality controlled weather data from this station taken by people risking their lives. It is immoral really.
      The data is here and BEST should just leave it alone. I noted this problem to Zeke over 2 years ago and nothing has been fixed. BEST has taken Amundsen Scott’s Zero trend since 1957 and turned it into 1.0C of warming. That means BEST’s algorithm’s are biased to find breaks that occur on the downside and far, far fewer on the warming side. Hence, all of their temperatures are biased upwards. They can prove this statement wrong very easily but they will not supply the basic info to prove this wrong.
      http://www.antarctica.ac.uk/met/READER/surface/Amundsen_Scott.All.temperature.html

    • Steven was very generous of his time in explaining to me how BEST dropped those cold outliers at Amundsen. The problem for me is that there is little doubt that they were perfectly valid observations of real temperature conditions. Just as in the US southeast a passing summer thunderstorm can create a cool anomaly. Or a mistral, etc. There just seem to be far more regularly occurring events which can create a valid cool outlier than a hot one. And eliminating them just may bias the results.
      I suggested that BEST create a switch which would allow a user to toggle the outliers in or out of the mix to see how it affects results.

    • Don’t get cocky Steve.I’ve spot-checked BEST break-points for stations where I have information that isn’t in GHCN.The breakpoint algorithm regularly misses quite large station moves. However I can’t prove that the break-points that are inserted are actually wrong, since there is always the possibility of undocumented changes in e. g. vegetation.

  4. The reduction in uncertainty from the early fifties to early sixties could well be real. For instance, the IGY was during that period (1958?), resulting in a big increase in global monitoring.

    • Mike M., the amount of stations which exist is relevant, but it is tied to spatial uncertainty. The uncertainty I dealt with in these two posts is statistical uncertainty. I’ll quote the explanation I gave in a comment at my site. My post quoted part of the BEST code:

      sp_unc = sp.([‘unc_’ types{m}]);
      st_unc = st.([‘unc_’ types{m}]);
      unc = sqrt( st_unc.^2 + sp_unc.^2 );
      Issues with how much of the globe is covered by temperature stations would go in the sp_unc variable. The issues I’m discussing would go in the st_unc variable.
      That said, I get the improved coverage for the 1960-2010 period might make someone think that is the best choice of baseline period. The problem is as soon as you choose a segment of your record to be your baseline period for comparing series (such as the eight series created in the BEST jackknife calculations), you reduce the variance of that period and inflate the variance outside that period. That distorts your uncertainty levels. Even worse, it distorts them in a way which fits your expectations (e.g. decreasing the uncertainty in the 1960-2010 period while increasing the uncertainty before it) so you’re less likely to notice it.
      The bias this mistake causes may fit BEST’s assumptions, but you can’t just take any answer that fits your assumptions as proof of those assumptions. If BEST hadn’t made a boneheaded mistake here, the uncertainty of modern times would be higher.

    • The reduction is due to a reduction of spatial uncertainty.
      This is due to the addition of stations.

  5. But…But.. BEST, IIRC, is supposed to be a robust, complete scientifically and statistically valid analysis of ‘all’ available data, etc, etc. (as per the initial fanfare and blurb) and hence produces the most accurate representation of climate data possible, right?
    Now I am totally Gutted……my beliefs shattered and in rags..
    (/sarc – just in case anyone doesn’t detect the heartfelt sarcasm)
    On a more serious note – I do hope this does not actually reflect any internal ‘purpose’ or ‘agenda’ within or by the BEST team – I have no problem with them making genuine errors (other than they should have been spotted in peer review, etc) and correcting them, as that would be ‘normal’ progressive science/advancement. But given the fanfare of the proclamations of AGW being ‘real’, etc, by those concerned, must we now seriously question their results too? (a la NASA/GISS/CRU, etc)

    • Kev-in-Uk, that’s the only reason I’ve spent any real time examining BEST. BEST was supposed to address all the issues skeptics had with the previous temperature records. It was supposed to use the best methodologies available.
      But from the moment I started looking at it, I found obvious problems with it. The more time I spent looking into it, the worse I realized it was. There are a ton of issues this post doesn’t even begin to touch on. For instance, BEST makes a big deal about being completely transparent, but did you know there are at least seven different versions of its final results, none of which were archived? I get that BEST wants to update its results from time to time, but why wouldn’t it keep a record of its old ones for people to look at? Shouldn’t people be able to compare current results to old ones?
      I don’t think BEST lives up to its hype, at all, but if you want to see it really screw up, you should take a look at this.

    • BEST, IIRC, is supposed to be a robust, complete scientifically and statistically valid analysis of ‘all’ available data, etc, etc. (as per the initial fanfare and blurb) and hence produces the most accurate representation of climate data possible, right?

      Well to the folks at BEST, IIRC it is.
      Nothing is clear or clean-cut in the temperature business, there are a lot of problems and challenges with no easy answers and basically comes down to smart people making assumptions and then using those assumptions to take their best guess. The best and most dependable financial analysts all concluded all was hunky-dory right up to when the financial markets went belly-up across the globe.
      We blue-sky new new plane designs using this approach but luckily we go much farther before we actually build planes. If climate alarmists built planes the carnage from crashed planes would ground the airlines industry for decades.
      Climate alarmists have to come clean on the limitations of the temperature record before building cathedrals using it as a foundation.

  6. Glad to see that people are waking up to the fact that BEST’s methodology is very far from that. In fact, in terms of imposing unverified assumptions upon the data base, it’s the worst.

  7. Good catch! But am I the only one that wonders why we don’t just use good stations that haven’t moved and forget all the stats gymnastics? The degree of absurdity in the calculation of these larger averages defies anything rational or meaningful.

    • Correct. There is too much quality control being applied to weather station data that was never intended to be merged into a global climate assessment. See the “surfacestations.org” site that shows thermometers located next to buildings, and in the exhaust stream of air conditioners. Complete failure of samplng methodology. There is no good scientifically maintained global surface temperature network. The US now has the climate reference network, USCRN, but that has only been in place about 10 years, and is USA only.
      There are a few scattered weather stations with good data, some at universities. Those show zero warming.
      http://hidethedecline.eu/pages/ruti/europe/western-europe-rural-temperature-trend.php
      There are a few others, some in the US and one or two in Britain. Generally, stations producing good data for a century show the 1930s as the warmest decade, with a small decline since 1940, the a slow increase in the late 1970s until 2002. Current temps are about equal the the 1930s, not higher.
      There are a few scientifically maintained weather stations in Antarctica since 1958. Those all show zero warming since 1958.

      • 1934 was the warmest year only in the USA average. Using exclusively stations that pass scientific scrutiny (instead of bogus “homogenization”) the global area-average peaked somewhat earlier, but saw a DEEP decline in the 1960’s and 1970’s. It has since recovered and, in 1998 and 2010, somewhat surpassed the earlier highs. The century-long trend is quite insignificant, nevertheless.

    • Dave in Canmore write: “am I the only one that wonders why we don’t just use good stations that haven’t moved and forget all the stats gymnastics?”
      Sounds like a good idea to me. But it would not stop the critics. “They are ignoring 98% of the data!” Just look at the recent flap over ocean pH.

      • Yep. First the skeptics cried about the great thermometer drop out. They demanded that all the data be used. So best was formed with one goal.
        Use all the data.

      • Skeptics often have lodged the justified complaint that quasi-century-long time series at many stations were quite arbitrarily truncated in the latter decades by GHCN, thus forcing the use of short time series from other stations to bring regional averages up to date. This, of course, introduces the uncontrolled variable of exact measurement location into the averaging process. Contrary to what Mosher claims, the cry for “all the data” was aimed to avoid such station shuffling, rather than a blind insistence that mere scraps of record from every available station should be used.

    • It is a good idea, and there are quite a few even though coverage is sparse. Long records like at the Valencia observatory in Ireland or Sulina Rumania or Hachijyo Japan where there is no UHI. Shorter records like Taksi Siberia starting in 1936. For anomalies, perhaps sufficient.
      Another problem is the oceans. All early data was trade route biased. Argo is newish.
      IMO a sufficient answer is satellites starting 1979. UAH and RSS. After all, even diddled records show no warming until the mid to late 1970’s. That was the global cooling scare decade, see essay Fire and Ice. Holdrens thing then. The Satellite record is now long enough to be useful, at least for purposes like calibrating climate models and their performance.

      • Do some history. Look for skeptics complaining about
        Ncdc dropping stations.
        They accused people of sample bias.
        Read mckitrick
        Hell we wrote about the great thermometer drop out here.
        So goal number one of best from it’s inception was to answer the Skeptics complaint about folks not using all the data.

    • From the OED. Genius:

      A demon or spiritual being in general. Now chiefly in pl. genii (the sing. being usually replaced by genie), as a rendering of Arab. jinn, the collective name of a class of spirits (some good, some evil) supposed to interfere powerfully in human affairs.

  8. This is a form of homogenization, a process whereby stations in the dataset are made to be more similar to one another.

    What?!?
    Making stations like other ones in a region completely defeats the purpose of having that station. Why even have that station, shut it down and save some money. Even worse it destroys the notion of measuring anomalies, since you are not measuring the differences in a specific station over time but instead you are comparing that specific station to other stations. That does not make sense.
    Instead just admit temperature data is a messy business, with no easy answers or conclusions, and keep working it.
    It’s funny, in deflate-gate there are PHDs arguing whether or not the Patriots had significantly less fumbles than other teams since 2006 when the NFL allowed teams to provide their own balls. The statistical arguments, margins of error and accusations of cherry picking is identical to the arguments around temperature data-sets. In football, we are talking about a very closed defined system, with meticulous and precise collections of team and individual statistics and experts cannot agree on how to determine if the Patriots fumble trend has reduced (global warming) and if so due to cheating (man-made) against the rest of the league (natural variation). If we cannot get statistical experts to agree how to pull conclusions from a relatively simple system like football, who in their right mind believes we can be conclusive about global temperatures.

    • “Making stations like other ones in a region completely defeats the purpose of having that station. Why even have that station, shut it down and save some money. ”
      The process does not make stations like other stations.
      1. All we have is the raw data.
      2. there is no independent check on any historical station.
      3. All the stations are used and we estimate a surface from that data which minimizes the error.
      in other words.. given all this raw data what is the best estimate ( minimize the error) we can make for this region.
      This is essentially the method suggested at skeptical sites.

  9. I asked BEST people if they had a professional statistician on the team. Yes, a professor of statistics. But he does not appear as an author anywhere in their papers.

    • Both David Brillinger and Charlotte Wickham are professional statisticians. David was not a coauthor (though he was an advisor and helped develop the approach to uncertainty calculations). Charlotte was a coauthor.

      • Hollywood directors use a name Smithee when they don’t want to take any credits for their movie.

    • Charlotte Wichkam is. Factoid: She is the sister of Hadley Wichkam, creator of ggplot2.
      Having statisticians on the team does not guarantee against errors in inference and reasoning.

  10. Does BEST or anyone else make their raw absolute temperatures available in a simple to use format?
    Something like: lat, long, date, time, temp?
    I’d like to try a run using sampling theory to see if we can’t do something that none of the major temp series appear to have tried. replace all of the averaging and adjustments with randomness, then only calculate the averages as the last step in the whole process. no anomalies, no adjustments no breakpoints, no grids, no homogenizations.

      • Are they really raw? ie, are they simply the original measurements, nothing added, nothing deleted, nothing changed?
        If so, is there anyone here with enough time and computer power to do a nice straightforward analysis using simple rules:
        1. A station can be split at a known move or station change, and not for any other reason.
        2. Stations with insufficient data to be excluded. [Criteria to be determined in advance.].
        3. All data for all selected stations is used unaltered, except for obvious typos/errors. [Note: the non-obvious errors can be expected to be reasonably few, to not introduce bias, and to be unidentifiable anyway.].
        3. The algorithm for global averaging and for uncertainty to be determined in advance. Wherever there are possible alternatives, simplest wins. [Every complexity introduces its own uncertainty, and simplicity is important for others to be able to reproduce results.].
        4. Everything to be documented.

      • Are they really raw? ie, are they simply the original measurements, nothing added, nothing deleted, nothing changed?
        Yes. the vast majority of data is daily data from GHCN-D and GCOS.
        daily data is not adjusted.
        before I ever went to work for berkeley I did my own global series using GHCN-D.. daily data.
        no adjustments WHATSOEVER. guess what, you get an answer within a few percentage points.
        If so, is there anyone here with enough time and computer power to do a nice straightforward analysis using simple rules:
        1. A station can be split at a known move or station change, and not for any other reason.
        instrument changes are also used, time of observation is also used, big gaps in data are also used
        2. Stations with insufficient data to be excluded. [Criteria to be determined in advance.].
        un necessary. small amounts of data mean the series has small weight.
        3. All data for all selected stations is used unaltered, except for obvious typos/errors. [Note: the non-obvious errors can be expected to be reasonably few, to not introduce bias, and to be unidentifiable anyway.].
        there are over 10 known QC problems. All listed
        3. The algorithm for global averaging and for uncertainty to be determined in advance. Wherever there are possible alternatives, simplest wins. [Every complexity introduces its own uncertainty, and simplicity is important for others to be able to reproduce results.].
        4. Everything to be documented.
        Its been done six ways since sunday
        Lets go back to 2010
        2010…
        http://wattsupwiththat.com/2010/07/13/calculating-global-temperature/
        “Bloggers and researchers who have developed reconstructions so far this year include:
        Roy Spencer
        Jeff Id
        Steven Mosher
        Zeke Hausfather
        Tamino
        Chad
        Nick Stokes
        Residual Analysis
        And, just recently, the Muir Russell report
        Here is the bottom line.
        pick ANY data source you like. GHCN-D, GHCN-M, GCOS, CRU, whatever
        pick any method you like: CAM, RSM, Least squares, Kriging.
        apply adjustments or DONT apply adjustments
        Use only rural or use all stations.
        Guess what?
        Your answers will not differ in any way that has any impact on the theory of global warming.
        There was an LIA.
        It is getting warmer
        The question is not whether it has warmed .7C or .8C or .9C
        The question is
        A) how much of that warming is due to man
        B) what future warming can we expect.

      • The question is how much of the warming since 1950 is natural and how much is due to fossil fuel use. I’ve seen estimates from modellers that it is as low as 0.18°C, so “whether it has warmed .7C or .8C or .9C” is important.

      • Mosh says
        “There was an LIA. It is getting warmer
        The question is not whether it has warmed .7C or .8C or .9C
        The question is;
        A) how much of that warming is due to man
        B) what future warming can we expect.”
        —– —— —
        Despite the errors and uncertainties it is undoubtedly (and unsurprisingly) getting warmer since the LIA.
        It has probably warmed by somewhere around Mosh’s estimates.
        He asks two good questions in A and B)
        There is a C) however, which is much more interesting.
        C) Is this modern warming unusual, or merely part of a cyclical trend of rising and declining temperatures that can be traced throughout the Holocene?
        tonyb

      • dbstealey
        I guess we need to add on a bit for warming in that area since the core dates of 2004. (although the two warmest consecutive decades in Greenland remain the 1930’s and 1940’s according to Phil Jones.
        In posing my question C) I wanted to add historical context. If today is genuinely the ‘warmest ever’ that is significant. If it isn’t it puts todays values into context. I can’t see any indication that the modern era is any warmer than past eras such as the MWP or Roman era.
        Climatologists seem to be fixated on parsing those instrumental values which have a very short history. Climate did not begin in 1980 or even 1880. Incidentally, as you know, the general warming has been going on for much longer than the GISS start date.
        So I guess there is a question D) What has caused it to warm for the last 300 years?
        tonyb

      • I find it weird Steven Mosher says:

        There was an LIA.
        It is getting warmer
        The question is not whether it has warmed .7C or .8C or .9C
        The question is
        A) how much of that warming is due to man
        B) what future warming can we expect.

        I think most people would say it’s difficult to answer A or B without having a decent idea what the answer to the question Mosher dismisses. It’s difficult to see how one can say “how much of that warming is due to man” without knowing how much warming there was.
        Similarly, if we can’t tell how much warming there has been, exactly how can we decide how much warming there will be? We use our observations of changes in our world to estimate what changes there will be in the future. If we don’t know what changes there have been, we won’t know what changes there will be.
        I know some people would just wave that all way saying we don’t need to worry about small changes like that, but the difference between .7C and .9C is .2C. That could easily be a decade or two worth of warming. I think most policy makers would like to know any problem they face could be mistimed by as much as two decades.
        We’re often told there is a consensus humans have caused 50+% of the observed warming. If we don’t know how much warming there’s been, how will we rate somebody who estimates humans have caused .4C of warming?

      • Steven Mosher
        January 29, 2015 at 8:13 pm

        If so, is there anyone here with enough time and computer power to do a nice straightforward analysis using simple rules:
        1. A station can be split at a known move or station change, and not for any other reason.
        instrument changes are also used, time of observation is also used, big gaps in data are also used
        2. Stations with insufficient data to be excluded. [Criteria to be determined in advance.].
        un necessary. small amounts of data mean the series has small weight.
        3. All data for all selected stations is used unaltered, except for obvious typos/errors. [Note: the non-obvious errors can be expected to be reasonably few, to not introduce bias, and to be unidentifiable anyway.].
        there are over 10 known QC problems. All listed
        3. The algorithm for global averaging and for uncertainty to be determined in advance. Wherever there are possible alternatives, simplest wins. [Every complexity introduces its own uncertainty, and simplicity is important for others to be able to reproduce results.].
        4. Everything to be documented.

        This is basically what I’ve done, and because I’m not doing the same thing as what everyone else has done, I get different results.
        These are the headlines:
        The big swings in average surface temp are from large swings in min temp at different regional location at different times around the world.
        There is no loss in night time cooling in surface data collected since the 50’s. Daily rising temp is and has been well matched by the following nights cooling.
        Normal swings in surface temp (even for periods as short as an hour, 2F/hour cooling in clear skies sub freezing temps) far exceed any possible effect from Co2.
        Land use changes make a larger impact than Co2.
        Changes in the amount of clouds make a larger impact than Co2.
        There does appear to be a change in the rate annual temps change during the year, but this rate also appears to be changing direction. Potentially back towards the same rate as we had in the recent pass.
        So, has anyone else looked at the rate of change in surface data for daily and annual temp cycles? Isn’t there a lot of useful information in this data that everyone else throws away?

      • If so, is there anyone here with enough time and computer power to do a nice straightforward analysis using simple rules:

        There is a MUCH simpler method to calculate average temperature.
        1. Start with the very basic observation that your stations change over time.
        2. That any adjustments to make your stations appear “unchanged” is thus a source of error.
        3. Therefore, it is useless to try and build a temperature record based on fixed stations, as you can never be sure of how much error you introduced.
        4. Instead, assume that your station readings are simply random samples in time and space.
        5. apply sampling theory to pick a random samples that accurately recreate the spatial and temporal distribution of the earth surface over a year.
        6. these samples should fit a normal distribution – check this assumption
        7. calculate the average temperature and standard deviation and standard error for the year.
        This result should be at least as accurate as any gridding method and has huge computational advantages. Anyone with a modern PC and a good sized drive should be able to tack this. All that is required is a small bit of custom programming to build and analyze the samples. I’ll probably use sql, as the problem lends itself readily to analysis on a database, but many different tools should be able to do the job.

        • ferdberple
          A long blockquote copy, but worth repeating here.

          There is a MUCH simpler method to calculate average temperature.
          1. Start with the very basic observation that your stations change over time.
          2. That any adjustments to make your stations appear “unchanged” is thus a source of error.
          3. Therefore, it is useless to try and build a temperature record based on fixed stations, as you can never be sure of how much error you introduced.
          4. Instead, assume that your station readings are simply random samples in time and space.
          5. apply sampling theory to pick a random samples that accurately recreate the spatial and temporal distribution of the earth surface over a year.
          6. these samples should fit a normal distribution – check this assumption
          7. calculate the average temperature and standard deviation and standard error for the year.
          This result should be at least as accurate as any gridding method and has huge computational advantages. Anyone with a modern PC and a good sized drive should be able to tack this. All that is required is a small bit of custom programming to build and analyze the samples.

          But, you’re wrong. 8<)
          No programming is needed to implement your idea.
          Just run the same program as-is that is already processing the floating thermometers (constantly moving, irregular-time-of-day reporting) in the ARGO buoys.

      • I didn’t dismiss the question.
        You don’t understand sensitivity analysis.
        Think more. Comment less.

      • tonyb asks:
        D) What has caused it to warm for the last 300 years?

        Well, that is the central question here, isn’t it? The answer is, we don’t know. Just like we don’t know the cause of the LIA.
        But looking at the chart I posted above, we see that current temperatures are very normal. Nothing either unprecedented or unusual is occurring. So I refer you to the climate Null Hypothesis and Mr. Billy Ockham for a reasonable conclusion…

      • “climatereason January 30, 2015 at 12:40 am
        I guess we need to add on a bit for warming in that area since the core dates of 2004. (although the two warmest consecutive decades in Greenland remain the 1930’s and 1940’s according to Phil Jones.”

        Not since 2004, but 1855. That Allen data stops in 95BP. And P=Present=1950. The Mann Hockey Stick label is false.

      • Just run the same program as-is that is already processing the floating thermometers (constantly moving, irregular-time-of-day reporting) in the ARGO buoys.

        the ARGO bouys don’t satisfy point 5:
        5. apply sampling theory to pick a random samples that accurately recreate the spatial and temporal distribution of the earth surface over a year.
        The advantage of sampling is that even if the data isn’t normally distributed, the sample will be so long as ARGO ocean temps within the year are bound by the central limit theorem. this allows you to apply lots of very well known statistical methods to your results.

  11. This analysis is incorrect. The baseline period chosen has only a minor impact on uncertainty (we’ve tried it with various different ones). Rather, the drop in uncertainty around 1960 is almost entirely due to a reduction spatial uncertainty. Prior to 1960 there is no data at all in one of the world’s continents, Antarctica, which significantly increases the uncertainty in the global reconstruction.
    See this discussion of the uncertainties present in Berkeley, GISS, and Hadley methods using synthetic data: http://static.berkeleyearth.org/memos/robert-rohde-memo.pdf
    Also Figure 8 (and the associated discussion) in the Berkeley methods paper: http://static.berkeleyearth.org/papers/Methods-GIGS-1-103.pdf

    • Zeke, it is inappropriate to simply claim an “analysis is incorrect” when two separate points were made and you only take issue with one. The point I am personally more troubled by is the fact that, according to BEST’s own words and code, BEST does not rerun its breakpoint calculations as part of its jackknife algorithm. This means BEST does not account for any uncertainty in its homogenization process. Nothing you say touches upon that issue. As such, the analysis as a whole cannot be incorrect.
      That said, let’s consider what you say:

      This analysis is incorrect. The baseline period chosen has only a minor impact on uncertainty (we’ve tried it with various different ones).

      This is a claim which merits more than a passing, “We’ve examined this and it doesn’t matter.” If the impact exists at all, it is something BEST ought to have discussed at some point. And if the impact truly is minor, there ought to be some demonstration of such. Regardless, the crux of the issue is you say:

      Rather, the drop in uncertainty around 1960 is almost entirely due to a reduction spatial uncertainty. Prior to 1960 there is no data at all in one of the world’s continents, Antarctica

      This is clearly not true as BEST’s website shows there is data for Antarctica prior to 1960. In fact, there doesn’t seem to be a particularly significant change in the amount of data in Antarctica around 1960.

      • Brandon,
        I should have said 1955, not 1960, as that is when the dip in the graph you highlighted occurs.
        I’m less familiar with the interactions between the breakpoint calculations and uncertainty bounds. I know we have experimented with a number of different parameters for breakpoint detection and looked at the results, but I’ll have to check with Robert to see how that factors into statistical uncertainty.
        Regardless, the issue you point out regarding the decline in total uncertainty in the original post relates to Antarctic coverage.

        • Zeke, even if we change your stated value to 1955, your claim is not true. Berkeley Earth requires segments (after breakpoints are calculated) cover at least ten years to be used. There are exactly 54 stations in that list with 120 or more months worth of data which are labeled as “inside” the Antarctica region. Six begin prior to 1950, six more begin prior to 1961, and only four more began prior to 1970. That clearly refutes your claim there was no data prior to 1960, and it also shows there was no particularly large increase in data coverage at the time of the breakpoint.
          But there are issues with how BEST’s website defines the Antarctica region so we should consider more stations in the list. Allowing for stations as far as 500km outside the region, we find 61 more stations with 120 months or more data. Adding these in increases the previous numbers to 10 before 1950, 22 more before 1961 and 10 more before 1970 though some of these clearly aren’t in Antarctica.
          Here is a breakdown of the 32 stations prior to 1961:
          BASE ORCADAS / SOUTH ORKN – No problems, available as early as 1903.
          ISLAS ORCADA – Partial duplicate of the above station.
          South Orkney – Usable from 1917-1934.
          Bellingshausen AWS – Usable from 1947 on.
          Faraday – Usable as early as 1944.
          Deception Island South Shetan – Usable from 1947-1967.
          Esperanza + Hope Bay – Usable from 1952.
          Base Esperanza – Partial duplicate of the above, extending further into the future.
          Adelaide Island – Unusable prior to 1962.
          Signey Island South Orkney – Available as early as 1947, but rendered unusable prior to 1956 due to an “empirical breakpoint” with a small magnitude. How an “empirical breakpoint” can be calculated before there is enough data to estimate the area’s temperature is a fascinating question.
          Dumont D’urville – Usable as early as 1956.
          Destacamento Naval Deception – No usable data.
          Est. Naval Almirante Brown – No usable data prior to 1973.
          Admirality Bay – Available as early as 1951, but rendered unusable due to an “empirical breakpoint” of incredibly small magnitude in 1957. Again, how this breakpoint was calculated is beyond me.
          Dest. Naval Melchior – Usuable from 1951-1961.
          Mawson Base – Usable from 1954 on.
          Belgrano – Usable from 1955-1979.
          General Belgrano – Duplicate of the above.
          Belgrano I – Duplicate of the two above, with a .1 degree shift.
          Base Belgrano II – Unusable prior to 1980.
          McMurdo Sound NAF – Unusable prior to 1961 due to a station move.
          Mirny – Usable from 1956 on.
          Amundsen Scott – Usable from 1957 on.
          Vostok – Usable from 1957 on.
          Byrd Station – Usable from 1957-1971 (though it uses outdated data).
          Scott Base – Usable from 1957 on.
          S.A.N.A.E. Station – Not usable prior to 1962.
          Syowa – Not usuable prior to 1966.
          Adare Hallett – Not usable at all.
          Davis – Not usable before 1969.
          I think I missed two somewhere, but that list still shows eights stations with usable data prior to 1955. The number would be 10 if not for inexplicable “empirical breakpoints” added to two of them. There are only seven stations added between 1955 and 1960. Are you saying seven stations can cause the uncertainty of the BEST record to plummet so dramatically? (I’m excluding the obvious duplicates because obvious duplicates shouldn’t be used.)
          I can accept I may have been wrong about this issue, but I think assuming a methodological issue is more generous than assuming you rest so much on a mere seven stations.

    • If you have no data for antarctic, simple use the climate science (TM) formula:
      antarctica temp = (south america temp + new zealand temp) / 2
      After all, antarctica is half-way between south america and new zealand, so its temp should be the average of these two stations.
      According to climate science (TM) homogenization and interpolation.

  12. Any statistical analysis based on crappy ill-controlled observations needs to be taken with a grain of salt. That said, the results may indeed show a warming trend. So what? The results of said analysis cannot speak to cause and effect.
    That last point is the main contention I have with the mad rush to blame what I breath out. What causes warming trends? What causes cooling trends? What causes wetter decades? Dryer decades? What causes erratic swings? The too-quick jump to the AGW cause will bite their ass big time. Too bad I will likely be pushing up daisies when the current cadre of conclusion jumping researchers get their comeuppance.

      • Epiphron, the spirit of Prudence, Shrewdness, and Thoughtfulness.
        Not so shrewd. Controlled observations is what science is all about. BEST was a crazed attempt to manufacture a warming signal from surface stations provably suffering micro and macro site degeneration. Ie: uncontrolled observations. The data was clearly unfit for purpose. That they tried to unscramble the egg with more time in the blender speaks to motive. It speaks loudly.
        And the hope and expectation you tacked on the end? Well, don’t get your hopes up. I’ve looked into the crystal ball and I can tell you what the future holds –

        December 2015
        At the Paris climate negotiations, hampered by heavy snowfalls, the parties come to a historic agreement. To hold next year’s meeting in Barbados.
        Unfortunately the delegates at Paris 2015 don’t get time to debate the IPCC’s new and improved “buck each way” position, however they agree it should be on the Barbados 2016 agenda. The next location debate already goes into extra time, with the Barbados compromise only being reached “at the 11th hour”.
        And Barbados is a compromise. The Chinese stall negotiation by asking for the moon, literally. The Chinese argue that lunar orbit is the perfect place for 2016 delegates to observe the utter insignificance of human effects on climate. While other delegates generally agree that China is a impoverished developing nation that should be allowed to emit CO2 forever, the location is rejected. Other nations argue that the lunar location discriminates against other impoverished developing nations that, unlike China, do not have the space launch capability to reach lunar orbit.
        Australia’s suggestion of a low cost international teleconference is also dismissed when the issue of adverse impacts on the struggling airline and pre-mix pina colada industries is raised.
        It is agreed that Barbados still involves extensive first class airline travel. Also that delegates will still be able to at least view the moon. From the beach at night. While holding a pina colada in a pineapple. Finally it is the acknowledgement that, unlike December in Paris, the only umbrellas required will be purely decorative that gets the Barbados vote over the line at 3.00am.

        Epiphron, you and yours should have been far more prudent. The collapse of the GoreBull Warbling hoax is going to destroy the professional Left from one side of the planet to the other. Adding radiative gases to the atmosphere in no way reduces our radiatively cooled atmosphere’s ability to cool our solar heated oceans. Your tears are as nectar.

    • actually all the code was done before I got there.
      I was asked to join because I was critical of their UHI approach.
      go figure.
      but your conspiracy theory is noted.
      did we land on the moon?

      • ”but your conspiracy theory is noted. did we land on the moon?”
        Good lord. Trotting out that old wheeze!
        Lewandowsky’s inane attempt to pathologise dissent toward your ridiculous hoax has been thoroughly discredited.
        Some of the most prominent sceptics today were involved in the Apollo program. Several walked on the moon. In pre AGW hoax days I met the geologist from 17, Harrison Schmitt. We talked gas pressurised joint design. Smart man. Now a dedicated sceptic to your inane “adding radiative gases to the atmosphere reduces the atmosphere’s radiative cooling ability” hoax.
        Did you just turn your “Snivelling Stupidity” dial all the way to 11 Steven?! Men landed on the moon. You can still bounce a laser off the reflectors they left there. JAXA has photographs of landing stages from lunar orbit. And a great number of the engineers, flight controllers and astronauts who made it happen are sceptical regarding your CO2 causes warming BS. How much more epic can your fail possibly be??
        Your only fame -”First sleeper at WUWT to snap”
        And I remember when you snapped. 2010. The M2010 discussion paper. You and your cronies did something most foul. You attacked a legitimate paper in meteorology because it threatened your climastrology BS. M2010 introduced the concept of horizontal flows being effected by diabatic processes (radiative cooling). You and yours just couldn’t have that. That may raise the question of radiative subsidence in tropospheric convective circulation. You are on record as one of the “Knights of Consensus” who rode out to trash that paper. The record is permanent. Your shame is forever. You did not just scrape the bottom of the barrel, you clawed through the rotting timbers at its base and got yourself elbows deep in the feculant ooze below.
        Forever, Steven. Forever.

  13. Zeke has posted the graph above that illustrates Brandon’s misdiagnosis of the change in uncertainty.
    I would expect some sort of editorial correction or update to the head post.
    After all, we are practicing blog peer review here.
    Here are some other comments from Dr. Rohde.
    1) First off, the step-wise shift in uncertainty has nothing to do with normalization or data processing issues. The large increase in uncertainty prior to about 1950 is a simple consequence of the complete absence of weather stations in Antarctica prior to the 1950s. The ability to estimate the global land average dramatically improved once we finally started putting instruments in Antarctica to start placing some constraints on the final 10% of the Earth’s land area. If one looks in our publication where the spatial and statistical parts of the uncertainty calculation are reported separately, the step change is entirely in the spatial part (i.e. a result of reduced coverage) and isn’t related to the statistical uncertainties which have no such step at that time.
    So Brandon is wrong. Here is what you see if you read our paper
    as you can see and as zeke and robert explained the jump in uncertainty happens in the spatial uncertainty. This is pretty clear in the paper.
    http://i61.tinypic.com/mm3ar9.png
    Continuing:
    2) In the statistical calculation, the choice of a 1960-2010 baseline was done in part for a similar reason, the incomplete coverage prior to the 1950s starts to conflate coverage uncertainties with statistical uncertainties, which would result in double counting if a longer baseline was chosen. The comments are correct though that the use of a baseline (any baseline) may artificially reduce the size of the variance over the baseline period and increase the variance elsewhere. In our estimation, this effect represents about a +/- 10% perturbation to the apparent statistical uncertainties on the global land average. Again, this is completely separate from the large step-increase in uncertainty associated with the absence of Antarctic data.
    3) With regards to homogenization, the comments are only partially correct. The step that estimates the timing of breakpoints is presently run only once using the full data set. However, estimating the size of an apparent biasing event is a more general part of our averaging code and gets done separately for each statistical sample. Hence the effect of uncertainties in the magnitude, but not the timing, of homogeneity adjustments is included in the overall statistical uncertainties. Conceptually, it would be desirable to rerun the breakpoint timing detection code on the subsamples as well, to capture the uncertainty in breakpoint timing. However, the effect of doing that is generally very small. Uncertainties in the magnitude of breakpoint adjustments generally contribute much more to the overall uncertainty than the typical errors in breakpoint timing. The breakpoint detection code is also quite computationally intensive because of the large number of station comparisons involved. After performing some tests on this issue, it was decided not to rerun the breakpoint detection code on each subsample due to the very small magnitude of the effect vs. large computational cost.
    I hope this helps

    • Steven Mosher’s comment repeats the same false argument as Zeke made above, though Zeke said 1960 (then 1955) while this comment says 1950:

      1) First off, the step-wise shift in uncertainty has nothing to do with normalization or data processing issues. The large increase in uncertainty prior to about 1950 is a simple consequence of the complete absence of weather stations in Antarctica prior to the 1950s.

      But as I’ve shown above, the claim there is a “complete absence of weather stations in Antarctica prior to the 1950s is flat-out wrong. I can accept I may have misdiagnosed the cause of this step-change, but we’ve now had Zeke, Mosher and Rhode all make false claims in explaining it.

      The comments are correct though that the use of a baseline (any baseline) may artificially reduce the size of the variance over the baseline period and increase the variance elsewhere. In our estimation, this effect represents about a +/- 10% perturbation to the apparent statistical uncertainties on the global land average. Again, this is completely separate from the large step-increase in uncertainty associated with the absence of Antarctic data.

      It is good to know we can all agree I was right about this being a real problem. As far as I can tell, BEST has never discussed this before, so that is progress.
      I’ll apologize for my misdiagnosis if it is confirmed I did make one. It may be true the addition of only ten or so stations in Antarctica causes the enormous shift in BEST’s uncertainty. I think that’s a worthwhile point to highlight though. I think most people can agree it is troubling so little data can have so large an effect. If I happened to think there was only one troubling point causing this, when in reality there were two, I think that could be chalked up to just being charitable.

      With regards to homogenization, the comments are only partially correct. The step that estimates the timing of breakpoints is presently run only once using the full data set. However, estimating the size of an apparent biasing event is a more general part of our averaging code and gets done separately for each statistical sample.

      I’ll admit this comment confuses me. I’ve been told, repeatedly, the scalpel method means the split up segments are used separately so there is no need to calculate “the size of an apparent biasing event.” If that is true, Rhode’s reasoning for why I am supposedly wrong can’t be true. In fact, his remark:

      Hence the effect of uncertainties in the magnitude, but not the timing, of homogeneity adjustments is included in the overall statistical uncertainties.

      Shouldn’t be true at all as there shouldn’t be any adjustments made for breakpoints.

      • I’ll reply here. While a bit of Antarctic data may be available prior to 1955, as far as I can tell none is actually used by Berkeley. See http://berkeleyearth.lbl.gov/regions/antarctica
        I’m not sure why its not used, but the fact that Berkeley doesn’t have an Antarctica record till 1955 is the reason for the change in uncertainty.
        Regarding 1950 vs. 1955 vs. 1960, its just an issue of different people eyeballing the charts.

      • Zeke, it is weird you provide a link and tell me to look at it when I myself provided that link to you. Regardless, if you want to say:

        While a bit of Antarctic data may be available prior to 1955, as far as I can tell none is actually used by Berkeley.

        Then you’re suggesting there’s an even bigger issue than I’ve suggested. I listed what data is presented by BEST. If BEST is listing data as having been used, but simply not using it like you suggest… I don’t know what to say.

        I’m not sure why its not used, but the fact that Berkeley doesn’t have an Antarctica record till 1955 is the reason for the change in uncertainty.

        One could presume there is a minimum amount of data necessary for a region in order to perform estimates for that region. If so, BEST may using the data I described but finding it insufficient to draw conclusions. That’s not as bad as simply disregarding data.
        But it still raises a serious problem. If your uncertainty can change by so much based upon a handful of temperature stations, it stands to reason your overall results could change by meaningful amounts based upon small amounts of data as well.

        Regarding 1950 vs. 1955 vs. 1960, its just an issue of different people eyeballing the charts.

        I find it peculiar people would say data doesn’t exist based off eyeballing charts rather than doing the simple thing of actually looking at the data. I’d understand it if you guys had said things like, “I don’t think there’s data before X,” but you didn’t. You stated it as fact even though it would take someone less than two minutes to prove you wrong.
        It’s hard to take pronouncements from people/groups seriously when they get such easily verified facts wrong. If you’re wrong about something so simple, can we really trust your pronouncements on more complicated matters right?

      • Steven Mosher, you say:

        Brandon you should update your accusations.
        And update the head posts

        This shows your lazy approach to discussions. I have only written one head post in which I made an incorrect claim, and I updated it eight hours before your comment. Not only did I update that post, I wrote a new post explaining the corrections.
        I would add an update to this post, but I have no control over that. This post was taken from an e-mail I wrote. I sent a follow-up e-mail explaining what is correct on this issue. It’s up to our host if he wants to add an update.
        But since you brought it up, you should update the remarks you provide from Robert Rhode. You, Rhode and Zeke have all told us there is no data for Antarctica before 1960/1955/1960 (depending on the comment) even though that’s clearly not true. When it was shown that’s not true, you… did nothing. It’s a bit weird to tell people to admit mistakes while refusing to admit your own.
        [Brandon:
        What is the specific change you need to have us make in the “Head text” of this thread? They have to be edited differently than the individual comments? List the change clearly, and we will make that change, but, it does have to be done separately from “comments” or replies. .mod]

      • Sorry for the slow response moderator. I didn’t see it since it was added in-line. I don’t know that I’d want anything about this post changed. I’d just add a short update to the end. Modeled after the update I added to my post on the baseline issue, I’d go with something like:

        Edit: Brandon says he got his argument wrong as part of a trick. You can find his explanation here, but the short version is the baseline issue highlighted in this post is real but is not the cause of the step change the post shows. Brandon says he intentionally misdiagnosed the cause of the step change to provoke BEST into acknowledging the problems he describes are real.
        Brandon apologizes to anyone who is bothered by this but wants to stress the fact it worked. BEST has now, for the first time ever, acknowledged the existence of the problems he described.

        And of course, if WUWT would like to distance themselves from my actions, it is welcome to add a statement of its own. I don’t think such is necessary since my deception was just a matter of playing dumb, but I can’t complain if people are bothered by me misleading them.

    • Steven Mosher: However, the effect of doing that is generally very small. Uncertainties in the magnitude of breakpoint adjustments generally contribute much more to the overall uncertainty than the typical errors in breakpoint timing.
      I am sure that is true, but it would still be worthwhile I think if the results of the tests were posted.
      Thank you, and Zeke for your informative posts in this thread.

  14. To give you some sense of the computational load the full uncertainty calculation takes days.
    DAYS to run.
    So, re running breakpoint might change your uncertainty from +-.05 to +-.055 and that
    would take you several extra days to compute.
    of course we tested that and frankly it’s not worth the time.
    Whether the uncertainty is +-.05C after a few days of computing or +-.06C after a couple weeks
    isn’t really scientifically interesting. Might make for a cool blog post though.
    Now of course, attributing the chance in uncertainty to the jackknife methodology could have been avoided by reading the materials. Or by actually running the code. That is why it is provided. So that folks can run it, change things and test their theories about why things work the way they work

    • Would it not take less time to research the actual stations themselves and find out its actual siting history and changes to it’s actual environment? Everything else honestly seems kind of backwards.

      • One would have presumed that the starting point to the compilation of this data series, would have been a thorough audit (including an on site physical review) of each station used in the collection of data that forms the series.

    • Mosh Dude –
      Do you go out of your way to be grammatically and otherwise incoherent? Spicoli might be an intellectual on your side of the pizza. But damn I just don’t have the munchies.
      Your BEST defense invokes,
      “Relax, all right? My old man is a television repairman, he’s got this ultimate set of tools. I can fix it.”

    • `Steven Mosher:

      To give you some sense of the computational load the full uncertainty calculation takes days.
      DAYS to run.

      Get a faster computer or rewrite in a compiled language. If you (or anybody) is getting paid to develop the code, expect no sympathy for interpreted code running slowly.

    • Steven Mosher: To give you some sense of the computational load the full uncertainty calculation takes days.
      DAYS to run.
      So, re running breakpoint might change your uncertainty from +-.05 to +-.055 and that
      would take you several extra days to compute.
      of course we tested that and frankly it’s not worth the time.

      With respect, I think that you are wrong. It would be worthwhile to use the additional computer time to generate the tables and graphs that support your claim (which I think is likely true), that re-estimating the breakpoints adds negligible uncertainty to the output estimates. As you can see from the reading, Brandon Shollenberger and others simply do not believe you.
      The burden of proof for a claim rests with the claimant; in this case, the burden rests with the BEST team to support your claim.
      Yes, I have done many days worth of extra computations like this, in response to criticisms/questions like these from Brandon Shollenberger. You have to bite the bullet and do the computations and report the results.

    • Please note how most of these Antarctic ice core data were supplied through personal communications to the authors. Climate “science” at work. No wonder others can’t find anything.

  15. Brandon Shollenberger
    Thankyou for your fine analysis and the informative discussion it has engendered.
    There is an underlying issue which I think to be very important, and Zeke Hausfather explicity states it when he writes

    Berkeley Earth is a compilation of surface temperature records via thermometers, not inferred temperatures from ice cores or other proxies.

    Yes, there cannot be an “average temperature” of anything because temperature is an intrinsic property and, therefore, cannot provide a valid average.
    Every determination of ‘average global temperature’ or ‘mean global temperature’ or etc. from GISS, HadCRU, UAH, etc. is merely “a compilation of surface temperature records” and is a function of the chosen compilation method.
    Hence,
    1.
    Each of the global temperature time series has no known physical meaning.
    2.
    Each of the global temperature time series is a function of a unique compilation method.
    3.
    Each of the global temperature time series frequently alters its compilation method.
    So, whatever ‘global temperature’ is, it is not a scientific indication of any stated physical parameter.
    I assume you have seen Appendix B of this which considers these matters. It concludes

    MGT time series are often used to address the question,
    “Is the average temperature of the Earth’s surface increasing or decreasing, and at what rate?”
    If MGT is considered to be a physical parameter that is measured then these data sets cannot give a valid answer to this question, because they contain errors of unknown magnitude that are generated by the imperfect compensation models.

    and

    To treat the MGT as an indicative statistic has serious implications. The different teams each provide a data set termed mean global temperature, MGT. But if the teams are each monitoring different climate effects then each should provide a unique title for their data set that is indicative of what is being monitored. Also, each team should state explicitly what its data set of MGT purports to be monitoring. The data sets of MGT cannot address the question “Is the average temperature of the Earth’s surface increasing or decreasing, and at what rate?” until the climate effects they are monitoring are explicitly stated and understood. Finally, the application of any of these data sets in attribution studies needs to be revised in the light of knowledge of what each data set is monitoring.

    Richard

    • “Yes, there cannot be an “average temperature” of anything because temperature is an intrinsic property and, therefore, cannot provide a valid average.”
      /////////////////////
      Quite so.
      The land based thermometer record provides no meaningful insight into anything, and should have been ditched along time ago.
      Given the small area of land and the low thermal capacity of the atmosphere, the only data relevant to global warming is OHC.
      The problem is that (presently) there is no worthwhile data of OHC. Nothing pre ARGO is robust, and ARGO is insufficient duration, lacks spatial coverage, and no attempt has been made to assess what, if any, bias is inherent in the system (caused by the free floating nature of the buoys that ride currents which currents possess a distinct temperature profile differening from ajacent waters and by the lack of spatial coverage itself).

      • Would like to point out that temperature does not always equate to heat. A quick quiz on temperatures – Knowing that at no time is the solar intensity greater the further north above the Tropic of Cancer, which of following states, Alabama, Florida, Minnesota, Montana, North Dakota and South Dakota, have the two highest Extreme High Temperatures on record and have the two lowest Extreme High Temperatures on record?
        Did you get Alabama & Florida and North & South Dakota? For a bonus, did you get that Alabama & Florida was the lowest and the Dakota’s were the highest?
        The difference is in humidity. Dry air gets to a higher temperature than wet air with the same energy.
        Alabama 112 F Sept. 5, 1925 Centerville
        Florida 109 F June 29, 1931 Monticello
        Minnesota 115 F July 29, 1917 Beardsley
        Montana 117 F July 5, 1937 Medicine Lake
        North Dakota 121F July 6, 1936 Steele
        South Dakota 120 F July 5, 1936 Gann Valley
        http://ggweather.com/climate/extremes_us.htm
        R.V., your point is made.

  16. I read the Booker article and took a look at the reference here. The article looks at three stations in Paraguay: Puerto Casados, Mariscal Estigarribia and San Juan Bautista Misiones.
    Mariscal Estigarribia and San Juan Bautista Misiones appear to be airports. The weather station of Mariscal appears to be located at -22.045186, -60.627443 and that of San Juan may be at -26.636024, -57.103306 (in Google Maps). I found a list of weather stations for Paraguay here. I found another list of weather stations for Paraguay (code: PY) as part of a new dataset at the Met Office here. The new dataset is described in HadISD: A quality controlled global synoptic report database for selected variables at long-term stations from 1973-2011 (Dunn et al. 2012). Actually the database now extends to 2013 and individual station files can be found here. The code for Mariscal is 860680 and that for San Juan is 862600.
    I used 7-Zip to decompress the downloaded files (ext. .qz). The resulting files are netCDF files which can be read with various utilities. I used Panoply for Windows 7. After decompressing, it runs directly from the unzipped folder (click on Panoply.exe) under Java Run Time Environment. If you don’t have it, search for Java SE Run Time Environment (not the browser plugin) and install it first. I used Java SE 7u76.
    Although the Met Office website does not mention it, the netCDF files contain both the raw temperatures and change points calculated (if I understood it correctly) according to Menne and Williams 2009. Dunn et al 2012 does not even mention change points, so finding the documentation was difficult. I believe Pairwise homogeneity assessment of HadISD (Dunn et al. 2014) describes what they did.
    Interestingly, they have apparently NOT done any adjustments to the data. Both Mariscal and San Juan appear to be airports, so there is quasi-hourly data in the Met Office files. The files have been “quality controlled,” but apparently not change point adjusted, if I understood it correctly.
    With Panoply, it is possible to quickly plot graphs of the raw temps and the change points (with and without interpolation). You can export the temp data from Panoply as “labeled text” and then import it into a spreadsheet. The time is in hours since midnight on 1 Jan 1973, so extracting the time and date is a bit of a bother. I used the @slope function in Quattro Pro X5 to calculate the slope of a simple linear regression on the data without adjustments. I did it with and without simple linear interpolation for missing data. Mariscal had a lot of gaps, but San Juan had few. Hopefully, I didn’t goof up too badly.
    The slope for San Juan was:
    -0.000008465716398 (no interpolation) and
    -0.000009930963906 (interpolating).
    The slope for Mariscal was:
    -0.000008738121812 (no interpolation) and
    -0.000015694874005 (interpolating).
    Compare that with the GISS slopes shown by Paul Homewood.
    Plots for San Juan:
    Change Points Image:
    http://i60.tinypic.com/vncpp4.png
    Temps image:
    http://i62.tinypic.com/2dshh1x.png
    Plots for Mariscal:
    Change Points Image:
    http://i58.tinypic.com/iny05l.png
    Temps Image:
    http://i60.tinypic.com/2ni8vpf.png
    I am having a hard time understanding the validity of the change points algorithm. Both San Juan and Mariscal appear to show negligible climate change over about 40 years, beginning with cooling in the 1970s, even though they are airports. Being airports, there is quasi-hourly data and some reliance on good maintenance and calibration of the instruments, since the data was needed for aviation. Panoply and this new data set may permit a visual comparison of the change point algorithm with raw data. Unfortunately, it would be interesting to see what “adjustments” the change point algorithm would do, but that does not yet seem to be available.

  17. Quite typical of any climate debate. Zeke (and then Mosher) gave a reason for sthe step change (and prior to that just arrogantly, without any real argument, stating plainly that Brandon is wrong) that Brandon quite convincingly refutes (no data prior to 1960. Or was it 1955? oh wait, maybe 1950? or maybe it just wasn’t used by BEST?) and then they just disappear. In a real and honest debate where the aim would be to get better knowledge, mistakes would be aknowledged (as, conditionally (“if it turns out it was a mistake”), Brandon did) and taken into account for moving forward…

    • Sven, it is remarkable what responses I got from them. Three different BEST members have publicly stated data doesn’t exist when it takes no more than two minutes to find that data. I was hoping we’d be able to move past that as this topic has resulted in progress. BEST has now acknowledged:

      The comments are correct though that the use of a baseline (any baseline) may artificially reduce the size of the variance over the baseline period and increase the variance elsewhere. In our estimation, this effect represents about a +/- 10% perturbation to the apparent statistical uncertainties on the global land average.

      Something it has never done before even though, according to BEST, they’ve known about this issue all along. They may have known about this issue all this time, but I can’t find any indication they’ve ever told anyone about it. I think that’s incredible, but I also think it’s good that now they have. Now we know more about how to interpret the things BEST publishes. For instance, BEST published a report discussing whether or not 2014 is the hottest year on record which presented their results with uncertainty levels they say are accurate to the thousandth of a degree. We now know that’s not true, and BEST has known that’s not true all along.
      You have to wonder how many other issues there are with BEST’s methodology they know about but simply don’t disclose. I can think of at least two more. And yes, I know I could try to figure out what effects all the issues I know about have. The problem is I’d have to buy a new computer and let it run code for weeks to do so. I have no intention of spending $400+ just because the people at BEST have decided not to be open with and transparent about their work like they claim to be.

      In a real and honest debate where the aim would be to get better knowledge, mistakes would be aknowledged (as, conditionally (“if it turns out it was a mistake”), Brandon did) and taken into account for moving forward…

      Yes, but in a real and honest debate, I wouldn’t have to play dumb to trick BEST members into publicly admitting problems they’ve known about for years.

    • Typical yes. Schollenberger was wrong. But cannot admit that. Then he just repeats being wrong.
      No reason for Zeke and Mosher to continue from there.

      • rooter
        As always, you display failure to understand what you are talking about.
        Schollenberger is right and you cite nothing to support your assertion that he “was wrong”.
        Richard

  18. Congratulations are in order to anyone who fully understands BEST’s methods. Their methods paper is not exactly transparent.
    It does seem relevant however to stress BEST is not a simple interpolation. As far as I can make out, it is a nonlinear (the kriging part) regression fit of a simple ‘climate’ model. The latter consists of a stationary ‘climate’ term and a global temperature term. Local ‘weather’ is then added to this fitted model by adding the appropriate part of the function being minimized to get the regression fit. If this is right, the (squared) ‘weather field’ is actually being minimized in the BEST fit?
    Whatever, the point is that there is no reason to think it will correspond closely to observed temperatures at any particular location. Studying strange goings on at particular locations (often cited in comments) is probably not that relevant to whether BEST works or not. Maybe, BEST needs to explain more clearly its somewhat abstract version of temperature. Although I can see that might cause problems too.

    • basicstats, you say:

      Whatever, the point is that there is no reason to think it will correspond closely to observed temperatures at any particular location.

      This is an interesting issue because it makes BEST’s actions incredibly strange. It’s easy to find BEST representatives explaining their work won’t get local details right, yet at the same time, they want to present their temperature results on an incredibly fine scale. Right now they publish results at a 1º x 1º scale. They’ve talked about plans to do it on a 1/4º x 1/4º scale. If BEST doesn’t want people to interpret their results on a local level, why do they encourage people to look at their results on a local level?
      You can pull up results on the BEST website for individual cities. Why would BEST do that if they want people to understand BEST shouldn’t be used on local levels? It’s baffling. One side of their mouth tells people to look at small local details while the other side tells people they don’t have resolution finer than (at least) hundreds of kilometers.

      Studying strange goings on at particular locations (often cited in comments) is probably not that relevant to whether BEST works or not. Maybe, BEST needs to explain more clearly its somewhat abstract version of temperature. Although I can see that might cause problems too.

      I’m not sure how this came up since I haven’t seen anyone doing this, but I’d agree if we were talking only about small, local areas. We’re not though. We’re not even just talking about entire states. We’re talking about BEST changing cooling trends in areas (at least) half the size of Australia into warming trends. That’s 1/3rd the size of Europe.
      I don’t think you can publish your results on a 1º x 1º scale then expect people to know the resolution of your results are, “Continental scale.”

    • basicstats, Brandon, you both have exactly right. What records they produce for local stations literally have no meaning. They are entirely synthetic. I’ve said this before, and not in the context of BEST but it applies to them thoroughly. Regional climate must be inferred from weather measurements from a conglomerate of stations in the area, and not the other way around.

  19. Why are we interested in global temperatures. Wouldn’t it be more practical to discuss local temps for each state (in the US) and discuss why those temps are moving up and down. For instance where I live in Florida, the climate here is different from both the Keys and the Panhandle. No averaging or homogenization could tell us anything here about the three zones.
    Just my 2 cents or maybe it’s a shiny penny.
    Thanks in advance to anyone who responds.

    • This is a point that seems to completely escape the likes of Steve Mosher, he stated on Climate etc and I quote that
      “you tell me the altitude and latitude of a location and I will tell you the temperature. And I’ll be damn close as 93% of the variance is explained by these two factors.”
      When I asked him for the temperature at 51.6N elevation 90-100ft within 1 degree the only response was “The error for a given month is around 1.6C”
      Now there was no mention of it being monthly or average data, he just said Temperature.
      So I have repeated the request for January.
      You see I have already shown that BEST cannot handle Island Temperatures, so I was intrigued by his claim.
      Yesterday I could see a difference in temperature between one side of the UK at that Lat/Elevation of 2 degrees C from +2 to +4
      But if I then look across the Atlantic at St Anthony it was at -8 degrees C and on the other side of Canada it was anywhere from +4 to +10.
      So on the same Lat/Eleveation we have a variation 18 degrees C.
      If you look at what controls the climate in those areas you will understand why.

      • He meant to say; “Give me the name of the place and I’ll look up the stats.”
        Vancouver and Saguenay same latitude and elevation.
        January mean temperature Vancouver +4 Saguenay -14
        The day you looked reflected the monthly average exactly!

    • Climate is local. From the OED Condition (of a region or country) in relation to prevailing atmospheric phenomena, as temperature, dryness or humidity, wind, clearness or dullness of sky, etc., esp. as these affect human, animal, or vegetable life. My crops respond to local variables, not global.
      Current temperature inside my greenhouse at 4:30 pm (summertime) in Southern Tasmania 15.7°C. Temperature @ Castle Forbes Bay 11°C. Expected temperature for this time of year ~23°C. Estimated “Global Warming” minus 12°C.
      Global Warming Fatigue = Lassitude or weariness resulting from repeated claims it’s the “hottest year evah”…

  20. @basicstats 2:34 am
    Trying to understand what they write about poorly is tough.
    But the real challenge is to understand what they don’t write about.
    The head post by Brandon rightfully criticizes BEST for uncertainty analysis on only one element of their processing. Uncertainty on the scalpel is not done. But the scalpel is BEST’s blunder.
    Denver Stapleton Airport has 10 breakpoints, some only 4 years apart. Lulling, Tx has 20. As mentioned by Rud above, the single most expensive weather station at Admunson-Scott at the South Pole has been shamefully broken without justification. Yet their TOKYO is missing a breakpoint it must have if there is any validity to their methods:

    While we are on the subject of the TOKYO station record and its relatively few breakpoints… It doesn’t have a breakpoint I expected. March 1945 should have generated one heckofa breakpoint and probable station move. BEST doesn’t show one. BEST can tease out of the data 20 station moves and breakpoints for Lulling, TX. But BEST somehow feels no break point is warranted on a day a quarter million 100,000 people die in a city-wide firestorm.

  21. Nobody believes the BEST temperature series anymore.
    It started out as saying they were going to give us a true representation of temperature but then Robert Rohdes’ algorithm was designed to take out ALL the cooling breakpoints (even including the 90% false positive ones) and leave in all the warming breakpoints (including the 90% ones which were true positive warming breakpoints) and what we been left with is a Raw temperature series adjusted up by +1.5C. Even worse than the NCDC.

Comments are closed.