An uncorrected assumption in BEST's station quality paper

I noted with a chuckle today, this statement over at the California Academy of Sciences “Climate Change Blog”:

I think that we all need to be careful of not falling into the unqualified and inexpert morass characterized by vessels like Anthony Watts.  – Peter D. Roopnarine

Seeing that compliment, and since we are having so much fun this week reviewing papers online and watching street lamps melt due to posited global warming, this seemed like a good time to bring this up. I’ve been sitting on this little gem for a year now, and it is finally time to point it out since nobody seems to have caught it.

I expected that after the peer review BEST been though (and failed), that this would have been fixed. Nope. I thought after the media blitzes it would have been fixed. Nope. I thought that after they submitted it to The Third Santa Fe Conference on Global and Regional Climate Change somebody would point it out and fix it. Nope. I thought after I pointed it out in Watts et el 2012 draft paper, surely one of the BEST co-authors would fix it. Still nope.

The assumption error I spotted last year still exists in the May 20th edition of the BEST paper Earth Atmospheric Land Surface Temperature and Station Quality in the Contiguous United States by Richard A. Muller, Jonathan Wurtele, Robert Rohde, Robert Jacobsen, Saul Perlmutter, Arthur Rosenfeld, Judith Curry, Donald Groom, Charlotte Wickham: 2012, Berkeley Earth Surface Temperature Project (online here PDF).

From line 32 of the abstract:

A histogram study of the temperature trends in groupings of stations in the NOAA categories shows no statistically significant disparity between stations ranked “OK” (CRN 1, 2, 3) and stations ranked as “Poor”(CRN 4, 5).

From the analysis:

FIG. 4. Temperature estimates for the contiguous United States, based on the

classification of station quality of Fall et al. (2011) of the USHCN temperature stations,

using the Berkeley Earth temperature reconstruction method described in Rohde et al.

(2011). The stations ranked CRN 1, 2 or 3 are plotted in red and the poor stations (ranked 4 or 5) are plotted in blue.

Did you catch it? It is the simplest of assumption errors possible, yet it is obvious, and renders the paper fatally flawed in my opinion.  Answer below. 

Note the NOAA CRN station classification system, derived from Leroy 1999, described in the Climate Reference Network (CRN) Site Information Handbook, 2002, which is online here.(PDF)

This CRN classification system was used in the Fall et al 2011 paper and the Menne et al 2010 paper as the basis for these studies. Section 2.2.1 of the NOAA CRN handbook says this:

2.2.1 Classification for Temperature/Humidity

  • Class 1 – Flat and horizontal ground surrounded by a clear surface with a slope below 1/3 (<19º). Grass/low vegetation ground cover <10 centimeters high. Sensors located at least 100 meters from artificial heating or reflecting surfaces, such as buildings, concrete surfaces, and parking lots. Far from large bodies of water, except if it is representative of the area, and then located at least 100 meters away. No shading when the sun elevation >3 degrees.
  • Class 2 – Same as Class 1 with the following differences. Surrounding Vegetation <25 centimeters. Artificial heating sources within 30m. No shading for a sun elevation >5º.
  • Class 3 (error 1ºC) – Same as Class 2, except no artificial heating sources within 10 meters.
  • Class 4 (error ≥ 2ºC) – Artificial heating sources <10 meters.
  • Class 5 (error ≥ 5ºC) – Temperature sensor located next to/above an artificial heating source, such a building, roof top, parking lot, or concrete surface.

Note that Class 1 and 2 stations have no errors associated with them, but Class 3,4,5 do.

From actual peer reviewed science:  Menne, M. J., C. N. Williams Jr., and M. A. Palecki, 2010: On the reliability of the U.S. surface temperature record, J. Geophys. Res., 115, D11108, doi:10.1029/2009JD013094 Online here PDF

It says in Menne et al 2010 section2 “Methods”:

…to evaluate the potential impact of exposure on station siting, we formed two subsets from the five possible USCRN exposure types assigned to the USHCN stations by surfacestations.org, and reclassified the sites into the broader categories of “good” (USCRN ratings of 1 or 2) or “poor” exposure (USCRN ratings of 3, 4 or 5).

In Fall et al, 2011, the paper of which I am a co-author, we say:

The best and poorest sites consist of 80 stations classified as either CRN 1 or CRN 2 and 61 as CRN 5 (8% and 6% of all surveyed stations, respectively).

and

Figure 2. Distribution of good exposure (Climate Reference Network (CRN) rating = 1 and 2) and bad exposure (CRN = 5) sites. The ratings are based on classifications by Watts [2009] using the CRN site selection rating shown in Table 1. The stations are displayed with respect to the nine climate regions defined by NCDC.

Clearly, per Leroy 1999 and the 2002 NOAA CRN Handbook, both Menne et al 2010 and Fall et al 2011 treat Class 1 and 2 stations as well sited aka “good” sites, and Class 3,4,5 as poorly sited or “poor”.

In Watts et al 2012, we say on line 289:

The distribution of the best and poorest sites is 289 displayed in Figure 1. Because Leroy (2010) considers both Class1 and Class 2 sites to be acceptably representative for temperature measurement, with no associated measurement bias, these were combined into the single “compliant” group with all others, Class, 3, 4, and 5 as the “non-compliant” group.

Let’s compare again to Muller et al 2012, but first, let’s establish the date of the document for certain, from document properties dialog:

From line 32 of the abstract:

A histogram study of the temperature trends in groupings of stations in the NOAA categories shows no statistically significant disparity between stations ranked “OK” (CRN 1, 2, 3) and stations ranked as “Poor”(CRN 4, 5).

From the analysis:

FIG. 4. Temperature estimates for the contiguous United States, based on the

classification of station quality of Fall et al. (2011) of the USHCN temperature stations,

using the Berkeley Earth temperature reconstruction method described in Rohde et al.

(2011). The stations ranked CRN 1, 2 or 3 are plotted in red and the poor stations (ranked 4 or 5) are plotted in blue.

Note the color key of the graph.

On line 108 they say it this, apparently just making up their own site quality grouping, ignoring the siting class acceptability of the previous peer reviewed literature.

We find that using what we term as OK stations (rankings 1, 2 and 3) does not yield a statistically meaningful difference in trend from using the poor stations (rankings 4 and 5).

They binned it wrong. BEST mixed an unacceptable station class set, Class 3, with a 1°C error (per Leroy 1999, CRN Handbook 2002, Menne et al 2010, Fall et al 2011, and of course Watts et al 2012) into the acceptable classes of stations, Classes 1&2, calling the Class 123 group “OK”.

They mention their reasoning starting on line 163:

The Berkeley Earth methodology for temperature reconstruction method is used to study the combined groups OK (1+2+3) and poor (4+5). It might be argued that group 3 should not have been used in the OK group; this was not done, for example, in the analysis of Fall et al. (2011). However, we note from the histogram analysis shown in Figure 2 that group 3 actually has the lowest rate of temperature rise of any of the 5 groups. When added to the in “poor” group to make the group that consists of categories  3+4+5, it lowers the estimated rate of temperature rise, and thus it would result in an even lower level of potential station quality heat bias.

Maybe, but when Leroy 1999, CRN Handbook 2002, Leroy 2010, The WMO standard endorsement of Leroy 2010,  Fall et al 2011, and now Watts et al 2012 all say that Classes 1 and 2 are acceptable, and Classes 3, 4, and 5 are not, can you really just make up your own ideas of what is and is not acceptable station siting? Maybe they were trying to be kind to me I don’t know, but the correct way of binning is to use Class 1 and 2 as acceptable, and Classes 3, 4, and 5 as unacceptable. The results should always be based on that especially when siting standards have been established and endorsed by the World Meteorological Organization. To make up your own definition of acceptable station groups is capricious and arbitrary.

Of course none of this really matters much, because the data that BEST had (the same data from Fall et al 2011), was binned improperly anyway due to surface area of the heat sinks and sources not being considered, which when combined with the binning assumption, rendered the Muller/BEST paper pointless.

I wonder, if Dr. Judith Curry will ask her name to be taken off of this paper too?

This science lesson, from an “unqualified and inexpert morass”, is brought to you by the number 3.

Advertisements

  Subscribe  
newest oldest most voted
Notify of
JB

The difference between the “poor” (4+5) sites and the “OK” (1+2+3) sites is 0.09
141 ± 0.07 oC per century. We also tried other groupings; the difference between the (3+4+5)
142 grouping and the “good” (1+2) sites is -0.04 ± 0.10 oC per century, i.e. the other sites are
143 warming at a slower rate than are the good sites, although the effect is not larger than the
144 statistical uncertainty. There is no evidence that the poor sites show a greater warming
145 trend than do the OK sites.

I hope that Dr. Judith Curry does disassociate herself from the “BEST” paper. Have you put this suggestion to her, Anthony?

gator69

Excellent catch Mr Watts! Sneaky little bastards get their junk caught in the wringer again!!!

Bloke down the pub

Well I suppose the warmists have always thought that OK was good enough for Government work.

Congrats on the 2012 pre-print release Anthony.
Muller says:
“group 3 actually has the lowest rate of temperature rise of any of the 5 groups.”

Is this still true under the Leroy 2011 re-analysis?
[REPLY: No. But then Leroy (2010) is assigning stations to different groups than if Leroy (1999) is applied. Read the paper again. -REP]

Amid the flood of Figures, stats, claims, etc I may have missed the simplest direct demonstration of all, namely a curve for each class, rather than hiding behind bins.
REPLY: Excellent suggestion – Anthony

Superbly written.
Compared to Anthony I’m a genuinely one of the “inexpert and unqualified morass” – yet I trust Anthony more because he writes in such a gentlemanly way. After-all people who throw mud usually have something to hide.
Even Anthony’s final pay-off line is a restrained but suitably humourous and assertive rebuttal “This science lesson, from an “unqualified and inexpert morass”, is brought to you by the number 3.”
Trust? Who do I trust in this debate? Not the supposed “experts” that’s for sure.

tadchem

An even more funfdamental error, one which can only be accounted for through mendacity, is that they are plotting a histogram of *anomalies* (transient deviations), while the siting errors resulting in the five site categories would be expected to create *biases* (systematic deviations).
Apples and oranges…

Bloke down the pub

If there is so little difference between the ok and the poor sites, why bother having sites at all. Why not just make the figures up? oh err mmm

Coach Springer

Questions from a non-scientist/non-statistician:
1. I see their discussion proving they knowingly binned it the way they did, but their discussion is why it supposedly didn’t “hurt” to exclude it from the poor category. But what was the purpose of including the unacceptable in the acceptable category contrary to established, standard practice in the first place?
2. What is Curry’s reason for not already taking her name off this paper too if “Of course none of this really matters much, because the data that BEST had (the same data from Fall et al 2011), was binned improperly anyway due to surface area of the heat sinks and sources not being considered…”?

cbltoo

As a layman, I don’t see a significant different in the graphs – other laymen – like politicians – also will not see same. Can the scale of the graphs be adjusted to show the precise differences?

Bill Yarber

Anthony.
It would be interesting to see the histogram comparing the 1,2 and 3,4,5 groupings. If statistically different then the whole paper should be withdrawn. Can you show that? If not, the climate science cabal will just ignore your point. Hope you find something significant.
Also, it is deplorable that only 8% of the stations were deemed Good (CRN 1 & 2). Deplorable!
Bill

David Fogg

Somebody help me out. The very poor stations (4, 5) don’t show a different trend to the good, ok and slightly poor stations?? If the error of a class 4 is >=2deg, and the error of a class 5 is >=5deg, how is this possible?? Shouldn’t the comparison plot show that the poor stations have a temperature anomaly much higher than the good stations?

Bill Yarber

Another thought: The histogram should use only the raw data from those two groups. You have already pointed out that the “adjustments” impact good stations more to bring them in line with the bad station increases (think I got that the right way). In any case, only the raw, unadjusted data should be used to compare these two groupings.
Bill

gregole

BEST, upon close examination, appears as nothing more than confirmation bias writ large sponsored under the aegis of a well-oiled PR campaign. Shameful to claim it as a scientific endeavor. Nice catch Anthony and thank you – if BEST was legitimate that is the message you would be receiving from them.
tadchem says:
August 3, 2012 at 6:48 am
An even more fundamental error, one which can only be accounted for through mendacity, is that they are plotting a histogram of *anomalies* (transient deviations), while the siting errors resulting in the five site categories would be expected to create *biases* (systematic deviations).
+1

Kaboom

Trust no one is the proper setting when checking for airtightness of your space suit and the assertions of experts.

Jeremy

I see in their histogram study for those temperature trends, they also seem to have used the MEAN instead of the MEDIAN. Wouldn’t this create influence from the ends of the distribution that they would want to avoid? In the plots they have the words “cut outliers”, but wouldn’t that just bring up the need to justify where you make your cuts?

Somebody
Poriwoggu

I don’t see why all five classes aren’t plotted individually. Is there a reason this wasn’t done? Is it possible to get the data and plot all 5 classes?

Bill Illis

Berkeley has 43% higher trend for both the United States (back to 1895) and for the United Kingdom (back to 1753).
This is disturbing.
http://s16.postimage.org/9awf95qx1/Berkeley_US_vs_USHCN_1895.png
http://s18.postimage.org/k6t9xf4mh/Berkeley_UK_vs_Had_CET_1753.png

Gibby

Just my two cents, but it seems to me that in order to handle the disparity statement correctly they needed to test groups 3, 4, and 5 individually against a grouping of 1 and 2. Then you would truely be able to establish if the divergence is statistically significant and therefore say which adjustments/classifications are an issue and need to be reassesed.

EternalOptimist

If what they say starting on line 163 is accurate, and type 3 has the lowest rate of temperature rise of all five groups, and both bins produce an identical anomaly, then removing 3 from the ‘ok’ bin would mean that the ok bin would go up and the poor bin would go down.
And Muller would thereby prove that sites on UHI would return lower temperatures than those that are perfectly situated
only in the topsy turvy world of Muller

Steve Oregon

Here is a really dumb question.
Why can’t the climate science community acknowledge any significance in the error of their ways?
If nothing that challenges them is allowed to be significant then what’s the point of their pretending to be
scientific?
Isn’t significance important?

On her blog, Climate Etc., Dr Curry on the thread
Observation-based (?) attribution
stated
“curryja | July 31, 2012 at 12:36 pm | Reply
Kip, Muller emailed this to me (he wrote it), I said it was ok to post. I am making my own statements about this, but I thought it was not unreasonable for them to want to post a joint statement since we disagree. They still seem to want me on the team in spite of public disagreements. And I like having an inside track on what is going on with the project.”
Maybe she still feels this way.

When in Rome you have to do what the Romans do. And, at this point in the AGW Hoax, it brings no virtue or righteousness to science but who cares: global warming stopped being about science a long time ago. It’s all about politics and being a skeptic is the last service an honest man can do for science. At least the politics have been corrected even as academia has lost all pretense to scholarship.

And others are there who go along heavily and creakingly, like carts taking stones downhill: they talk much of dignity and virtue–their drag they call virtue!
And others are there who are like eight-day clocks when wound up; they tick, and want people to call ticking–virtue.
Verily, in those have I mine amusement: wherever I find such clocks I shall wind them up with my mockery, and they shall even whirr thereby!
And others are proud of their modicum of righteousness, and for the sake of it do violence to all things: so that the world is drowned in their unrighteousness.
Ah! how ineptly cometh the word “virtue” out of their mouth! And when they say: “I am just,” it always soundeth like: “I am just–revenged!”
With their virtues they want to scratch out the eyes of their enemies; and they elevate themselves only that they may lower others.
~Nietzsche (Zarathustra)

Bryan Mulder

Would someone re-draw the graph with the binning adjusted for the number three temperature data, as you described? How about a five color plot, one color for each of the five categories, as Leif just suggested? (or is the raw data not available?)

Theo Goodwin

Smashing work, Anthony. They cannot hide the pea from you.

AnonyMoose

“However, we note from the histogram analysis shown in Figure 2 that group 3 actually has the lowest rate of temperature rise of any of the 5 groups.”
So they looked for the temperature rise which they were expecting, and tried to arrange their data based upon what they were expecting or what they wanted. In this case, a better match between the bins supports their hoping that the data shows a temperature rise, so they claim their arrangement is better because it produces the expected result and they can claim that the data is good.
Their noting that they are aware of the binning done by the other studies just makes it worse.

lowercasefred

They don’t care, they have the press carrying water for them, they don’t have to care.
“IDIOT, n. A member of a large and powerful tribe whose influence in human affairs has always been dominant and controlling. The Idiot’s activity is not confined to any special field of thought or action, but “pervades and regulates the whole.” He has the last word in everything; his decision is unappealable. He sets the fashions and opinion of taste, dictates the limitations of speech and circumscribes conduct with a dead-line.” – Ambrose Bierce

Something happened to these US temps since 1999.
Here is Hansen, the keeper of US temp’s and father of global warming in 1999, trying to explain why US temperature were lower in 1999 than in the 1930’s. By almost .5C
http://www.giss.nasa.gov/research/briefs/hansen_07/
Then go to the link below and chart the data as it currently stands. Plug in Annual and from 1895 to 1999.
http://www.ncdc.noaa.gov/oa/climate/research/cag3/na.html
Now in the new, improved version of US temps, 1999 is 2-3F higher than 1895, and higher than the 1930’s. So somehow since Hansen published paper in 1999, temps got adjusted about 2-3F UP!
Now plug 1999-2011 into that NCDC site, and you’ll find 2011 about 1F lower than 1999. So if you join the 1999 Hansen chart with the 1999-2011 NCDC chart, you end up with 2011 being about as warm as 1895 and a full 1C cooler than the 1930’s. According the Hansen 1999>spliced with NCDC data, the US is now colder than it has been for much of the past century.

Peeved

Are these differences between “adjusted” sets or between raw data?

gacooke

cut outliers??? You would expect some sort of discussion of “cut outliers” in the text. Did I miss it?

lowercasefred, quoted from ‘the Devil’s Dictionary’ of Ambrose Bierce. The dictionary is very clever and is available free of charge from the Gutenberg site; gutenberg.org.

Pamela Gray

For those of you who think the paper has not made a statistical mistake remixing bins. This has nothing to do with whether or not the bins are mixed this way and that and then track each other’s average anomaly, as BEST seems to want to say. It has to do with error bars being significantly different from each other. Bins 1 and 2 have small error bars. Bins 3, 4, and 5 have larger error bars. Comparing the average anomalies without error bars is deceptive. BEST wants to mix good data with crappy data (without telling us) and say there is nothing wrong, move along. Reminds me of the “hide the decline” trick.
The average of crappy data is meaningless. For a really good lesson in crappy data compared to tight data, plot just the error bars.

greg holmes

1900 = 0.5 , 2010 = 0, looks like a lot of fuss over no result, why the hell are we spending billions on this crap?

Dan in Nevada

Bill Yarber says:
August 3, 2012 at 7:03 am
Bill, I almost said the same thing until I saw your post. I think you nailed the real issue. If I understand correctly, the “adjusted” data used by BEST homogenizes the readings from all stations. In fact that’s a word you see a lot and it literally means blending everything together, good and bad. So why would there be any surprise when the trends from all categories plot right on top of one another?
Worse, and I want to see more on this, Watts 2012 expressly states that the homogenization largely consists of making the “good” stations look like the “poor” stations. That’s a rather bold claim and I don’t think they would make it if they couldn’t back it up.

TomL

The red curve and the blue curve look to me to be too close to identical in every detail to actually represent independent measurements made at different sites. Are we sure they were not “homogenized” to force them to match before the comparison was made?

Coach Springer says:
August 3, 2012 at 6:51 am
Questions from a non-scientist/non-statistician:
1. I see their discussion proving they knowingly binned it the way they did, but their discussion is why it supposedly didn’t “hurt” to exclude it from the poor category. But what was the purpose of including the unacceptable in the acceptable category contrary to established, standard practice in the first place?
=================
Exactly – why didn’t they at least show it both ways?
They must have tried including group 3 with 4 & 5, because they clearly discussed it. So, why didn’t they show it? The obvious answer is that it didn’t support what they were trying to show.
This appears to be a case of selection bias. BEST has almost certainly tried grouping 3 with 4 and 5, and didn’t like what they saw, so they “selected” (cherry picked) a different grouping to give them the results they wanted.
Isn’t “cherry picking” the results by creating an arbitrary, non-standard methodology a form of scientific fraud? Isn’t hiding the results of the standard methodology also a form of scientific fraud?

Paul Fischbeck

In the plot of anomalies, what are the anomalies based on? Wha baseline temperature are they using?
Are the 1+2+3 stations compared to some long-run average temperature of 1+2+3 stations or some average temperature of all stations? Likewise with the 4+5 stations.
What corrections have been made to the station temperatures before they are averaged and used to create the anomalies graph?

Disko Troop

I believe the point of this post is that it is just plain WRONG to include Class 3 in the good grouping. I don’t see why Muller needs to do it . It may or may not significantly alter the result. That is entirely irrelevant in my opinion. If it is WRONG now ,and if the results stand, all future papers based on these results are WRONG. Currently insignificant looking errors will be compounded. The further down the line you get the more the errors will be. WRONG is WRONG.
Take away their computers until they have re-learned the basics

Resourceguy

Keep your eye on the pea as we shuffle the cups. Just don’t call it science.

lowercasefred

@John K Sutherland 8:14 a.m.
Not only is Bierce clever, but he is one of the most accurate observers of mankind that has ever been.

If you’re a vessel, Anthony, you’re one full of learning, wisdom and real-world savvy. Your opponents, on the other hand, exhibit the most famous attribute of empty vessels.

Rud Istvan

The bigger problem is that the BEST binning definitions themselves are not correct. Menne and Fall papers also showed little quality impact because of that. Only when stations are properly binned using the new WMO standard does the magnitude of the problem become clear. Worse, the homogenization procedures exaggerate rather than dampen the problem. And I personally doubt that TOBS adjustments will change the result much, since that variable is orthogonal to micro site quality (so should randomly affect all bins about equally). This is why AW’s paper is so important concerning land records.

Bill Illis,
When doing a US comparison, make sure to use CONUS rather than total US temperatures, since USHCN is CONUS only.
http://berkeleyearth.lbl.gov/regions/contiguous-united-states


[;-)]

pochas

The use of ad hominem, innuendo and invective is characteristic of those with a hidden agenda (follow the money). Judith does not belong in that camp, and she should promptly get up and leave.

polistra

Vessel? What a strange epithet! Do they imagine Anthony to be a luxurious cruise ship? A four-masted schooner? A pirate clipper? Or, if they’re seeing him in a ‘morass’, maybe he’s one of those Louisiana swamp airboats.

BEST graph, “adjusted” vs raw data:

Fred

The new spelling for BEST is WORST.