An uncorrected assumption in BEST's station quality paper

I noted with a chuckle today, this statement over at the California Academy of Sciences “Climate Change Blog”:

I think that we all need to be careful of not falling into the unqualified and inexpert morass characterized by vessels like Anthony Watts.  – Peter D. Roopnarine

Seeing that compliment, and since we are having so much fun this week reviewing papers online and watching street lamps melt due to posited global warming, this seemed like a good time to bring this up. I’ve been sitting on this little gem for a year now, and it is finally time to point it out since nobody seems to have caught it.

I expected that after the peer review BEST been though (and failed), that this would have been fixed. Nope. I thought after the media blitzes it would have been fixed. Nope. I thought that after they submitted it to The Third Santa Fe Conference on Global and Regional Climate Change somebody would point it out and fix it. Nope. I thought after I pointed it out in Watts et el 2012 draft paper, surely one of the BEST co-authors would fix it. Still nope.

The assumption error I spotted last year still exists in the May 20th edition of the BEST paper Earth Atmospheric Land Surface Temperature and Station Quality in the Contiguous United States by Richard A. Muller, Jonathan Wurtele, Robert Rohde, Robert Jacobsen, Saul Perlmutter, Arthur Rosenfeld, Judith Curry, Donald Groom, Charlotte Wickham: 2012, Berkeley Earth Surface Temperature Project (online here PDF).

From line 32 of the abstract:

A histogram study of the temperature trends in groupings of stations in the NOAA categories shows no statistically significant disparity between stations ranked “OK” (CRN 1, 2, 3) and stations ranked as “Poor”(CRN 4, 5).

From the analysis:

FIG. 4. Temperature estimates for the contiguous United States, based on the

classification of station quality of Fall et al. (2011) of the USHCN temperature stations,

using the Berkeley Earth temperature reconstruction method described in Rohde et al.

(2011). The stations ranked CRN 1, 2 or 3 are plotted in red and the poor stations (ranked 4 or 5) are plotted in blue.

Did you catch it? It is the simplest of assumption errors possible, yet it is obvious, and renders the paper fatally flawed in my opinion.  Answer below. 

Note the NOAA CRN station classification system, derived from Leroy 1999, described in the Climate Reference Network (CRN) Site Information Handbook, 2002, which is online here.(PDF)

This CRN classification system was used in the Fall et al 2011 paper and the Menne et al 2010 paper as the basis for these studies. Section 2.2.1 of the NOAA CRN handbook says this:

2.2.1 Classification for Temperature/Humidity

  • Class 1 – Flat and horizontal ground surrounded by a clear surface with a slope below 1/3 (<19º). Grass/low vegetation ground cover <10 centimeters high. Sensors located at least 100 meters from artificial heating or reflecting surfaces, such as buildings, concrete surfaces, and parking lots. Far from large bodies of water, except if it is representative of the area, and then located at least 100 meters away. No shading when the sun elevation >3 degrees.
  • Class 2 – Same as Class 1 with the following differences. Surrounding Vegetation <25 centimeters. Artificial heating sources within 30m. No shading for a sun elevation >5º.
  • Class 3 (error 1ºC) – Same as Class 2, except no artificial heating sources within 10 meters.
  • Class 4 (error ≥ 2ºC) – Artificial heating sources <10 meters.
  • Class 5 (error ≥ 5ºC) – Temperature sensor located next to/above an artificial heating source, such a building, roof top, parking lot, or concrete surface.

Note that Class 1 and 2 stations have no errors associated with them, but Class 3,4,5 do.

From actual peer reviewed science:  Menne, M. J., C. N. Williams Jr., and M. A. Palecki, 2010: On the reliability of the U.S. surface temperature record, J. Geophys. Res., 115, D11108, doi:10.1029/2009JD013094 Online here PDF

It says in Menne et al 2010 section2 “Methods”:

…to evaluate the potential impact of exposure on station siting, we formed two subsets from the five possible USCRN exposure types assigned to the USHCN stations by surfacestations.org, and reclassified the sites into the broader categories of “good” (USCRN ratings of 1 or 2) or “poor” exposure (USCRN ratings of 3, 4 or 5).

In Fall et al, 2011, the paper of which I am a co-author, we say:

The best and poorest sites consist of 80 stations classified as either CRN 1 or CRN 2 and 61 as CRN 5 (8% and 6% of all surveyed stations, respectively).

and

Figure 2. Distribution of good exposure (Climate Reference Network (CRN) rating = 1 and 2) and bad exposure (CRN = 5) sites. The ratings are based on classifications by Watts [2009] using the CRN site selection rating shown in Table 1. The stations are displayed with respect to the nine climate regions defined by NCDC.

Clearly, per Leroy 1999 and the 2002 NOAA CRN Handbook, both Menne et al 2010 and Fall et al 2011 treat Class 1 and 2 stations as well sited aka “good” sites, and Class 3,4,5 as poorly sited or “poor”.

In Watts et al 2012, we say on line 289:

The distribution of the best and poorest sites is 289 displayed in Figure 1. Because Leroy (2010) considers both Class1 and Class 2 sites to be acceptably representative for temperature measurement, with no associated measurement bias, these were combined into the single “compliant” group with all others, Class, 3, 4, and 5 as the “non-compliant” group.

Let’s compare again to Muller et al 2012, but first, let’s establish the date of the document for certain, from document properties dialog:

From line 32 of the abstract:

A histogram study of the temperature trends in groupings of stations in the NOAA categories shows no statistically significant disparity between stations ranked “OK” (CRN 1, 2, 3) and stations ranked as “Poor”(CRN 4, 5).

From the analysis:

FIG. 4. Temperature estimates for the contiguous United States, based on the

classification of station quality of Fall et al. (2011) of the USHCN temperature stations,

using the Berkeley Earth temperature reconstruction method described in Rohde et al.

(2011). The stations ranked CRN 1, 2 or 3 are plotted in red and the poor stations (ranked 4 or 5) are plotted in blue.

Note the color key of the graph.

On line 108 they say it this, apparently just making up their own site quality grouping, ignoring the siting class acceptability of the previous peer reviewed literature.

We find that using what we term as OK stations (rankings 1, 2 and 3) does not yield a statistically meaningful difference in trend from using the poor stations (rankings 4 and 5).

They binned it wrong. BEST mixed an unacceptable station class set, Class 3, with a 1°C error (per Leroy 1999, CRN Handbook 2002, Menne et al 2010, Fall et al 2011, and of course Watts et al 2012) into the acceptable classes of stations, Classes 1&2, calling the Class 123 group “OK”.

They mention their reasoning starting on line 163:

The Berkeley Earth methodology for temperature reconstruction method is used to study the combined groups OK (1+2+3) and poor (4+5). It might be argued that group 3 should not have been used in the OK group; this was not done, for example, in the analysis of Fall et al. (2011). However, we note from the histogram analysis shown in Figure 2 that group 3 actually has the lowest rate of temperature rise of any of the 5 groups. When added to the in “poor” group to make the group that consists of categories  3+4+5, it lowers the estimated rate of temperature rise, and thus it would result in an even lower level of potential station quality heat bias.

Maybe, but when Leroy 1999, CRN Handbook 2002, Leroy 2010, The WMO standard endorsement of Leroy 2010,  Fall et al 2011, and now Watts et al 2012 all say that Classes 1 and 2 are acceptable, and Classes 3, 4, and 5 are not, can you really just make up your own ideas of what is and is not acceptable station siting? Maybe they were trying to be kind to me I don’t know, but the correct way of binning is to use Class 1 and 2 as acceptable, and Classes 3, 4, and 5 as unacceptable. The results should always be based on that especially when siting standards have been established and endorsed by the World Meteorological Organization. To make up your own definition of acceptable station groups is capricious and arbitrary.

Of course none of this really matters much, because the data that BEST had (the same data from Fall et al 2011), was binned improperly anyway due to surface area of the heat sinks and sources not being considered, which when combined with the binning assumption, rendered the Muller/BEST paper pointless.

I wonder, if Dr. Judith Curry will ask her name to be taken off of this paper too?

This science lesson, from an “unqualified and inexpert morass”, is brought to you by the number 3.

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

132 Comments
Inline Feedbacks
View all comments
Just an engineer
August 3, 2012 8:58 am

Perhaps it is time to rename it to “Berkeley Urban Surface Temperature”?

Pamela Gray
August 3, 2012 9:11 am

Error bars tell you whether or not the average is meaningful. If you were to run every combination of “data” there is in a large error barred data set, you would get a number of different averages that only minimally resembled each other, and some not at all. The combined average of a large error bar data set is not representative of the raw data. If you were to run every combination of “data” there is in a small error barred data set, you would get averages that resemble one another. The combined average is more representative of the raw data. A spurious result is that sometimes, the average of a tight data set matches the average of a crappy set. But that is a false positive. That the averages match in the BEST paper is a false positive. Period.

RockyRoad
August 3, 2012 9:18 am

…an “unqualified and inexpert morass” forces the comeuppance of a variety of so-called “expert cliimate scientists”.
Fun to watch.

tckev
August 3, 2012 9:26 am

“Oh what a tangled web we weave, When first we practise to deceive!”
Sir Walter Scott.

Stephen Richards
August 3, 2012 9:29 am

They binned it wrong. BEST mixed an unacceptable station class set, Class 3, with a 1°C error (per Leroy 1999, CRN Handbook 2002, Menne et al 2010, Fall et al 2011, and of course Watts et al 2012) into the acceptable classes of stations, Classes 1&2, calling the Class 123 group “OK”.
It’s subprime climate change. Put the rubbish with the A class and call it AAA.

Steve C
August 3, 2012 9:41 am

You have sharp eyes, Mr. Watts, for a “vessel”. (D’you s’pose they subconsciously meant “vassal”? – It’s a very strange locution.)
Too often, climate data mashing brings to mind the old truth:
1 barrel sewage + 1 teaspoon wine = 1 barrel sewage
1 barrel wine + 1 teaspoon sewage = 1 barrel sewage

August 3, 2012 9:46 am

BEST data shows western USA temperature have fallen off a cliff.
http://sunshinehours.wordpress.com/2012/08/01/best-usa-tmax-fell-off-a-cliff-on-west-coast/
http://sunshinehours.wordpress.com/2012/08/02/best-usa-5-years-averages-fall-off-a-cliff-continued/
Maybe Mosher or Zeke can explain why CO2 AGW ignores large parts of the USA.

August 3, 2012 9:51 am

“I think that we all need to be careful of not falling into the unqualified and inexpert morass characterized by vessels like Anthony Watts.”
I think Freeman Dyson already has demonstrated that a Ph.D. is not necessary in order to do research in physics.

Edward Bancroft
August 3, 2012 9:59 am

Isn’t there something else misleading about that graph? Namely that it is showing the temperature differences from the norm for both plots, which are bound to be the same, as even a bad UHI site record which consistently reads high will faithfully follow the temperature changes.
Further, the UHI affected sites historically start out unaffected and slowly build up the error, into modern times. I suspect that this is also being masked in that graph, because it is a slowly moving effect compared to the smaller periods used
for the temperature anomaly plots. Showing the absolute temperature plots might give us a different slant.

August 3, 2012 10:02 am

How to Make a Good Curry
“….. but I thought it was not unreasonable for them to want to post a joint statement since we disagree.”
That sounds to me like she’s not willing to lose her status in the political-climate movement.
“They still seem to want me on the team in spite of public disagreements. And I like having an inside track on what is going on with the project.” Good enough…
But, what exactly is keeping her “inside track” here? I think she is afraid to come out and state the obvious, because she knows that the political-climate movement could throw her under the bus in no time.
A better idea; when your old friends turn out to be unreliable, get some new ones.

August 3, 2012 10:09 am

Judith Curry says:
“…and I like having an inside track on what is going on with the project.”
Says it all. They’ve got her under control.

Shevva
August 3, 2012 10:09 am

What a bunch of Muppets.
What would you call a group of Climate Scientists though, a Chorus of Climate Scientists?

Neil Jordan
August 3, 2012 10:12 am

The following article relates to “Texting and language skills”
http://languagelog.ldc.upenn.edu/nll/?p=4099
but the statistical manipulations described are germane to the BEST binning situation. The quote suitable for framing is:
“There’s a special place in purgatory reserved for scientists who make bold claims based on tiny effects of uncertain origin; and an extra-long sentence is imposed on those who also keep their data secret, publishing only hard-to-interpret summaries of statistical modeling. The flames that purify their scientific souls will rise from the lake of lava that eternally consumes the journalists who further exaggerate their dubious claims. Those fires, alas, await Drew P. Cingel and S. Shyam Sundar, the authors of “Texting, techspeak, and tweens: The relationship between text messaging and English grammar skills”, New Media & Society 5/11/2012. . .”
There follows a list of offending journalists, culminating with:
“And, of course, in a specially-hot lava puddle all his own, the guy who wrote the press release from Penn State: Matt Swayne, “No LOL matter: Tween texting may lead to poor grammar skills”, 7/26/2012.”

michael hart
August 3, 2012 10:16 am

The “histogram” shown is not a histogram, that’s my first problem with it.
Where I come from, a histogram displays frequencies.
I must be getting old.

Frenchie77
August 3, 2012 10:26 am

Just wondering, but doesn’t a histogram normally have on the vertical axis a simple instance/occurence count.
This ‘histogram’ isn’t really a histogram, it is a standard 2-variable plot of anomalies versus year.
Or did someone re-define a histogram differently? I know, I know – These things sometimes happen in universities and don’t necessarily flow out to the real world where numbers have meaning and effect.

Mac the Knife
August 3, 2012 10:34 am

Edward Bancroft says:
August 3, 2012 at 9:59 am
Astute observations, Edward.
MtK

August 3, 2012 10:35 am

If people can bin arbitrarily, they will sooner or later find a binning system that gives them the result they’re looking for, cooling or warming. But even if everybody can agree on a binning system and where to draw the line between OK and poor in that system, it may still a pretty arbritrary line. The physics don’t care whether there is an agreement or not.
The category 1-5 system described doesn’t seem to me as a very precise way to estimate error. For instance, it seems to disqualify entire mountainous areas (and often remote from human activity) apart from the summits and possibly isolated spots that really are atypical of the area. Also, it doesn’t help if there’s a km to the nearest road or building if it makes the observer skip readings.
Since we’re talking about temperatures requiring accuracy to the tenth of a degree to tell one trend from another, it seems pretty irrelevant to me whether to have two bins 1-2 and 3-4-5 or 1-2-3 and 4-5. First I would prefer all stations to be automatic, to minimise human error. Then, it seems that both OK and poor (with respect to significant temperature errors) stations could end up in category 3 and possibly 4 as well, so to eliminate that uncertainty, I would eliminate those categories and compare only 1-2 and 5, all automatic. If that leaves too few stations to give a “robust” result, then the result may simply be “we don’t know”.

gopal panicker
August 3, 2012 10:46 am

a gathering of climate ‘scientists’…a fever…an idiocy…a corruption…a herd…a troop…etc

Resourceguy
August 3, 2012 10:55 am

Data quality is in fact the very last thing that participants in the grand paper mill of researchers seeking promotion want to talk about. That is true across many disciplines. Follow the money and egos but don’t look back.

DonS
August 3, 2012 11:16 am

Judith Curry and the BEST team have me off to refresh my memory on the causes and effects of the Stockholm Syndrome.

Mac the Knife
August 3, 2012 11:27 am

Shevva says:
August 3, 2012 at 10:09 am
“What would you call a group of Climate Scientists….
Shevva,
Let’s call them a ‘Blind Man’s Bluff’: Each climatologist is blindfolded with ‘find the warming’ grant money. As they all grope around blindly, searching for anything to grab onto to justify their next grant and chance to play, they ‘peer review’ each other by calling out “You’re getting warmer, warmer…. Ohhh, You’re on Fire!”
Mtk

yamaka
August 3, 2012 11:38 am

Surely a group of Climate Scientists is a “Consensus” 😉

August 3, 2012 11:43 am

A real histogram by CRN rating would be good.
As well as a histogram for each CRN rating for each NOAA region would be good.

jorgekafkazar
August 3, 2012 11:44 am

Shevva says: “What would you call a group of Climate Scientists though, a Chorus of Climate Scientists?”
Without intending to demean all slimate…oops, that was truly a Freudian slip. Let me start over: not all climate scientists would be included in this group, but based on majority behaviors, I lean towards a pander of climate scientists.

CoonAZ
August 3, 2012 11:52 am

Something else that could create confusion in the BEST report:
In Figure 4 RED = OK, BLUE = poor.
In Figure 1 RED = poor, BLUE = OK.