I noted with a chuckle today, this statement over at the California Academy of Sciences “Climate Change Blog”:
I think that we all need to be careful of not falling into the unqualified and inexpert morass characterized by vessels like Anthony Watts. – Peter D. Roopnarine
Seeing that compliment, and since we are having so much fun this week reviewing papers online and watching street lamps melt due to posited global warming, this seemed like a good time to bring this up. I’ve been sitting on this little gem for a year now, and it is finally time to point it out since nobody seems to have caught it.
I expected that after the peer review BEST been though (and failed), that this would have been fixed. Nope. I thought after the media blitzes it would have been fixed. Nope. I thought that after they submitted it to The Third Santa Fe Conference on Global and Regional Climate Change somebody would point it out and fix it. Nope. I thought after I pointed it out in Watts et el 2012 draft paper, surely one of the BEST co-authors would fix it. Still nope.
The assumption error I spotted last year still exists in the May 20th edition of the BEST paper Earth Atmospheric Land Surface Temperature and Station Quality in the Contiguous United States by Richard A. Muller, Jonathan Wurtele, Robert Rohde, Robert Jacobsen, Saul Perlmutter, Arthur Rosenfeld, Judith Curry, Donald Groom, Charlotte Wickham: 2012, Berkeley Earth Surface Temperature Project (online here PDF).
From line 32 of the abstract:
A histogram study of the temperature trends in groupings of stations in the NOAA categories shows no statistically significant disparity between stations ranked “OK” (CRN 1, 2, 3) and stations ranked as “Poor”(CRN 4, 5).
From the analysis:
FIG. 4. Temperature estimates for the contiguous United States, based on the
classification of station quality of Fall et al. (2011) of the USHCN temperature stations,
using the Berkeley Earth temperature reconstruction method described in Rohde et al.
(2011). The stations ranked CRN 1, 2 or 3 are plotted in red and the poor stations (ranked 4 or 5) are plotted in blue.
Did you catch it? It is the simplest of assumption errors possible, yet it is obvious, and renders the paper fatally flawed in my opinion. Answer below.
Note the NOAA CRN station classification system, derived from Leroy 1999, described in the Climate Reference Network (CRN) Site Information Handbook, 2002, which is online here.(PDF)
This CRN classification system was used in the Fall et al 2011 paper and the Menne et al 2010 paper as the basis for these studies. Section 2.2.1 of the NOAA CRN handbook says this:
2.2.1 Classification for Temperature/Humidity
- Class 1 – Flat and horizontal ground surrounded by a clear surface with a slope below 1/3 (<19º). Grass/low vegetation ground cover <10 centimeters high. Sensors located at least 100 meters from artificial heating or reflecting surfaces, such as buildings, concrete surfaces, and parking lots. Far from large bodies of water, except if it is representative of the area, and then located at least 100 meters away. No shading when the sun elevation >3 degrees.
- Class 2 – Same as Class 1 with the following differences. Surrounding Vegetation <25 centimeters. Artificial heating sources within 30m. No shading for a sun elevation >5º.
- Class 3 (error 1ºC) – Same as Class 2, except no artificial heating sources within 10 meters.
- Class 4 (error ≥ 2ºC) – Artificial heating sources <10 meters.
- Class 5 (error ≥ 5ºC) – Temperature sensor located next to/above an artificial heating source, such a building, roof top, parking lot, or concrete surface.
Note that Class 1 and 2 stations have no errors associated with them, but Class 3,4,5 do.
From actual peer reviewed science: Menne, M. J., C. N. Williams Jr., and M. A. Palecki, 2010: On the reliability of the U.S. surface temperature record, J. Geophys. Res., 115, D11108, doi:10.1029/2009JD013094 Online here PDF
It says in Menne et al 2010 section2 “Methods”:
…to evaluate the potential impact of exposure on station siting, we formed two subsets from the five possible USCRN exposure types assigned to the USHCN stations by surfacestations.org, and reclassified the sites into the broader categories of “good” (USCRN ratings of 1 or 2) or “poor” exposure (USCRN ratings of 3, 4 or 5).
In Fall et al, 2011, the paper of which I am a co-author, we say:
The best and poorest sites consist of 80 stations classified as either CRN 1 or CRN 2 and 61 as CRN 5 (8% and 6% of all surveyed stations, respectively).
and
Figure 2. Distribution of good exposure (Climate Reference Network (CRN) rating = 1 and 2) and bad exposure (CRN = 5) sites. The ratings are based on classifications by Watts [2009] using the CRN site selection rating shown in Table 1. The stations are displayed with respect to the nine climate regions defined by NCDC.
Clearly, per Leroy 1999 and the 2002 NOAA CRN Handbook, both Menne et al 2010 and Fall et al 2011 treat Class 1 and 2 stations as well sited aka “good” sites, and Class 3,4,5 as poorly sited or “poor”.
In Watts et al 2012, we say on line 289:
The distribution of the best and poorest sites is 289 displayed in Figure 1. Because Leroy (2010) considers both Class1 and Class 2 sites to be acceptably representative for temperature measurement, with no associated measurement bias, these were combined into the single “compliant” group with all others, Class, 3, 4, and 5 as the “non-compliant” group.
Let’s compare again to Muller et al 2012, but first, let’s establish the date of the document for certain, from document properties dialog:
From line 32 of the abstract:
A histogram study of the temperature trends in groupings of stations in the NOAA categories shows no statistically significant disparity between stations ranked “OK” (CRN 1, 2, 3) and stations ranked as “Poor”(CRN 4, 5).
From the analysis:
FIG. 4. Temperature estimates for the contiguous United States, based on the
classification of station quality of Fall et al. (2011) of the USHCN temperature stations,
using the Berkeley Earth temperature reconstruction method described in Rohde et al.
(2011). The stations ranked CRN 1, 2 or 3 are plotted in red and the poor stations (ranked 4 or 5) are plotted in blue.
Note the color key of the graph.
On line 108 they say it this, apparently just making up their own site quality grouping, ignoring the siting class acceptability of the previous peer reviewed literature.
We find that using what we term as OK stations (rankings 1, 2 and 3) does not yield a statistically meaningful difference in trend from using the poor stations (rankings 4 and 5).
They binned it wrong. BEST mixed an unacceptable station class set, Class 3, with a 1°C error (per Leroy 1999, CRN Handbook 2002, Menne et al 2010, Fall et al 2011, and of course Watts et al 2012) into the acceptable classes of stations, Classes 1&2, calling the Class 123 group “OK”.
They mention their reasoning starting on line 163:
The Berkeley Earth methodology for temperature reconstruction method is used to study the combined groups OK (1+2+3) and poor (4+5). It might be argued that group 3 should not have been used in the OK group; this was not done, for example, in the analysis of Fall et al. (2011). However, we note from the histogram analysis shown in Figure 2 that group 3 actually has the lowest rate of temperature rise of any of the 5 groups. When added to the in “poor” group to make the group that consists of categories 3+4+5, it lowers the estimated rate of temperature rise, and thus it would result in an even lower level of potential station quality heat bias.
Maybe, but when Leroy 1999, CRN Handbook 2002, Leroy 2010, The WMO standard endorsement of Leroy 2010, Fall et al 2011, and now Watts et al 2012 all say that Classes 1 and 2 are acceptable, and Classes 3, 4, and 5 are not, can you really just make up your own ideas of what is and is not acceptable station siting? Maybe they were trying to be kind to me I don’t know, but the correct way of binning is to use Class 1 and 2 as acceptable, and Classes 3, 4, and 5 as unacceptable. The results should always be based on that especially when siting standards have been established and endorsed by the World Meteorological Organization. To make up your own definition of acceptable station groups is capricious and arbitrary.
Of course none of this really matters much, because the data that BEST had (the same data from Fall et al 2011), was binned improperly anyway due to surface area of the heat sinks and sources not being considered, which when combined with the binning assumption, rendered the Muller/BEST paper pointless.
I wonder, if Dr. Judith Curry will ask her name to be taken off of this paper too?
This science lesson, from an “unqualified and inexpert morass”, is brought to you by the number 3.
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.


Yes I did spot it, in your paper, line 293. It jumped right out at me, oh that’s the detail that clinches why Muller’s results are c**p, so I made a note of it.
Pamela Gray says:
For those of you who think the paper has not made a statistical mistake remixing bins. This has nothing to do with whether or not the bins are mixed this way and that and then track each other’s average anomaly, as BEST seems to want to say. It has to do with error bars being significantly different from each other. Bins 1 and 2 have small error bars. Bins 3, 4, and 5 have larger error bars. Comparing the average anomalies without error bars is deceptive. BEST wants to mix good data with crappy data (without telling us) and say there is nothing wrong, move along.
This appears to be common in climate “science”. I’m unaware of anywhere else it would be considered acceptable to knowingly mix bad data with good. (Or to not attempt a revision if this had happened unknowingly.)
This post is clearly mistaken and dramatically misses the point, and the VERY FIRST comment points this out (as a later comment by “Sou” does as well in a bit more detail). Still, there are dozens of other people in the comments saying things like “wow, amazing, excellent catch”. If it is not an incident of collective madness, it must be a flashmob.
tallbloke says:
August 3, 2012 at 6:43 am
Congrats on the 2012 pre-print release Anthony.
Muller says:
“group 3 actually has the lowest rate of temperature rise of any of the 5 groups.”
Is this still true under the Leroy 2011 re-analysis?
[REPLY: No. But then Leroy (2010) is assigning stations to different groups than if Leroy (1999) is applied. Read the paper again. -REP]
Thanks Robert. The relevant section I’ve found is at lines 204-212. I’ll keep reading.
It may have been flagged up already but I spotted a typo at line 387:
‘May’ airports should be ‘Many airports’
As I have said before Anthony over at CA and here; if real_climate_ scientists consider that that their climate science are ‘Apples’, and when one uses these very same ‘Apples’ in an independent paper to compare ‘Apples with Apples’ it turns out that apparently real_climat-scientists don’t like eating them-there ‘Apples’.
Unfortunately, I could not get an answer either from you or Steve when I pointed this out.
amoeba says:
August 4, 2012 at 4:42 am
This post is clearly mistaken and dramatically misses the point, and the VERY FIRST comment points this out (as a later comment by “Sou” does as well in a bit more detail). Still, there are dozens of other people in the comments saying things like “wow, amazing, excellent catch”. If it is not an incident of collective madness, it must be a flashmob.
>>>>>>>>>>>>>>>>>>>>>
My understanding is that the trends reported by BEST are post “adjustments”. The whole point of Anthony et al is that the adjustments applied to cat 1&2 stations were larger than the ones applied to 3,4, & 5 stations. If my understanding is correct, then the points raised by those to commentors are mute.
Were there no stations that deserved a change in classification over the time period?
why plot anomalies? why not use type 1 as the trend baseline and then plot 2 through 5 trends against that?
Oh yeah… forgot to add – AND NO “ADJUSTMENTS”!! Just raw temperature data, thank you.
polistra says:
August 3, 2012 at 8:49 am
Vessel? What a strange epithet!
Not a strange epithet — an archaic religious one. Parsing the phrase in context, the morass is the Swamp of Skepticism containing all us evil non-believers and Anthony is an ambulatory container full of that heresy. Google “vessel of iniquity” and watch the Genesis references pop up.
When your opponents start using religious imagery to disparage you, you know you’re dealing with cultists.
Paul says: August 3, 2012 at 12:43 pm
“I’ve always found it curious how the more vocal of the CAGW alarmists can’t seem to conceive that there isn’t someone, of a higher authority behind the curtains pulling levers and issuing marching orders to his thralls. I suppose if you can’t think critically for yourself, and let some authority issue you your talking points, you assume everyone is like that.”
Indeed.
If you look at the underlying beliefs of most of them, they are subjectivists not capable of rational thought, and they believe in economic control (perhaps the defining theme of Marxism). Since they are such true believers in their own theories they cannot accept that anyone else has a reasoned alternative view. Apparently they cannot say “you are wrong because…”. And in general they like to throw words around.
Beware too that they will try to blow smoke, as I pointed out to Stephen McIntyre when David Karoly was accusing him of something recently, after McIntyre had corrected him on something.
I’ve seen it so many times, from dangerous drivers through people noisy in apartment stairwells late at night and gardeners who claim I talked nasty when I firmly told them parking in a fire lane was a bad idea to jerks who throw their cigarette butts on the ground where there is risk of grass fires instead of getting a secure receptacle – they try to put the monkey on your back.
Anthony, you’re going to have even more of such directed at you this month, due to your new technical paper.
Yes, I was also wondering — are these raw or adjusted temps? Because it sure looks like they must be adjusted based on the overall trend, and if they’re adjusted then the whole thing is completely useless.
I’m lost as to how one measures an anomaly with regard to temperature:
noun, plural a•nom•a•lies.
1. a deviation from the common rule, type, arrangement, or form.
2. someone or something anomalous: With his quiet nature, he was an anomaly in his exuberant family.
3. an odd, peculiar, or strange condition, situation, quality, etc.
4. an incongruity or inconsistency.
The whole idea of looking at two temperature measures per day and averaging them across the globe seems a ridiculous method of proving anything.
In my engineering mind, I would look at single long representative stations around the globe. If anything was amiss it would show up there. Granted it would only be a known at that point in space. Adding all (in fact they only add a sub-set of the stations) stations together, and trying to guess what temperature variations have happened to an area over time, then averaging the result to a single figure to represent the earth, seems to me to be a ludicrous exercise in confusing oneself of any meaningful value.
Greg Cavanagh,
In the present context an anomaly is simply a deviation from the average. Zero baseline charts are used for anomalies.
When using a zero baseline chart, it is possible to show accelerating temperatures. But that is an artifact of the chart; it is not real. A long term trend chart is the proper thype of chart to use when looking at whether temperatures are accelerating.
Tall Dave appears to prefer uncalibrated measurements. The semantic part about what NOAA does is that the “adjustments” are really inter-calibrations. Ask John Christy about the problems you can get into without inter-calibrations when you have different instruments or measurement devices that drift.
Greg Cavanagh says:
August 4, 2012 at 1:22 pm
That’s my line of thinking too – there is certainly no logical derivation of a ‘global’ temperature anomaly from the available information – maybe, with a few million identical stations set at precise and equal closely spaced points and heights, and measuring continuously, etc, – we might get a decent idea – but it would still only be a snapshot at that given height/level.
The whole global temp anomaly thing is a scare tactic IMO and is certainly not scientifically valid as a ‘measurement’. At best it could be an indicator – but when they chop and change the data all the time, what is it actually indicating?
Shouldn’t class 2 be artificial heating sources between 30-100m? and Class 3 be between 10-30m?
Class 1 – Sensors located at least 100 meters from artificial heating
Class 2 – Artificial heating sources within 30m
Class 3 (error 1ºC) – no artificial heating sources within 10 meters.
Class 4 (error ≥ 2ºC) – Artificial heating sources <10 meters.
Class 5 (error ≥ 5ºC) – Temperature sensor located next to/above an artificial heating source
The whole idea of changing the grouping of sites based on what the results are is clearly flawed science whatever way it swings the “findings”. You cannot use the results of a study to justify regrouping your inputs.
I’m surprised that a “future genius” like Muller would be trying to publish a paper using that kind of method.
Perhaps this is one of the reasons his papers got rejected.
This science lesson, from an “unqualified and inexpert morass”, is brought to you by the number 3.
———
So they appear to have changed the binning scheme to potentially exaggerate the discrepancy between good and poor stations and thereby provide support for Anthony’ claim that station quality is important
But they failed.
So now Anthony is complainng that their trick intended to support him was wrong.
Personally I think it don’t matter whether they used the same binning scheme as previous papers. This is not cast in stone and as long as its documented it’s fine.
LazyTeenager;
This is not cast in stone and as long as its documented it’s fine.
>>>>>>>>>>>>>>>>>>>>>>
Yes! It doesn’t matter how wrong you do things, itz OK as long as you document them. Integrity of data and process don’t matter as long as you document everything. A professor once told me that when you don’t know what you are doing, do it in excrutiating detail. I guess Lazy subscribes to the same philosophy.
Replying to *davidmhoffer* (August 4, 2012 at 8:17 am):
—–
amoeba says: …
My understanding is that the trends reported by BEST are post “adjustments”. The whole point of Anthony et al is that the adjustments applied to cat 1&2 stations were larger than the ones applied to 3,4, & 5 stations. If my understanding is correct, then the points raised by those to commentors are mute.
—–
I beg to disagree. Anthony wrote a long post and not a single time did he mention “adjustments”. The whole post (and quite a detailed and lengthy one) is ONLY about groupings. The wrong grouping (1+2+3 instead of 1+2) is presented as a, quote, little gem, unquote, that everybody else failed to notice. And this critique is absurd, as correctly pointed out by previous commentators.
Pamela Gray says
The average of crappy data is meaningless.
———-
This is a bogus over generalization. The exact way in which data is crappy is very important.
I can look at graphs of data/ signal which look just like noise. Given the right processing techniques I can pull out signal that is as clear as day.
Everyone here uses technology on a daily basis that depends on being able to process crappy data and derive from it good data. This includes all of the electronic devices you use.
davidmhoffer on August 5, 2012 at 1:17 am
LazyTeenager;
This is not cast in stone and as long as its documented it’s fine.
>>>>>>>>>>>>>>>>>>>>>>
Yes! It doesn’t matter how wrong you do things, itz OK as long as you document them. Integrity of data and process don’t matter as long as you document everything.
———-
Well the boundaries of the binning process are somewhat arbitrary as is the classification scheme. So it’s NOT wrong.
If they document it and someone else wants to pick nits then the critic can do it some other way and prove that their new way is better.
But the point is if the binning convention is changed to make the results more comparable to past papers it will still not be wrong, just different. The final outcome of this will be even less difference between the temperature trends of good stations and poor stations.
amoeba;
I beg to disagree. Anthony wrote a long post and not a single time did he mention “adjustments”. The whole post (and quite a detailed and lengthy one) is ONLY about groupings.
>>>>>>>>>>>>>>>>>>
It was BEST who claimed that the results produced the trends they did, and BEST calculated their trends from adjusted data.
I am sorry, davidmhoffer, but you are confused. I would recommend you go and read this paper, then maybe you will see it yourself. What you are saying is, first, beside the point (as I tried to explain before), and, second, plain wrong: BEST is not even using adjusted data! They use “scalpel” and outlier deweighting, but they do not use adjusted data as it was usually done before them. But again: in the context of this post it is irrelevant! Anthony’s post is not about adjusting. It is about binning. Period. I will not argue anymore, can only recommend you to actually read BEST papers.
PS. LazyTeenager is completely right, by the way.