The surfacestations paper – statistics primer

By John Neilsen-Gammon (from his blog Climate Abyss – be sure to bookmark it, highly recommended – Anthony)

As I mentioned in my last post, I did a lot of the statistical analysis in the recent paper reporting on the effect of station siting on surface temperature trends. For those who are curious or extremely bored, here’s how I did the testing:

I was invited to participate after the bulk of the analysis was completed. I decided to confirm the analysis by doing my own independent analysis. It showed some differences, and we concluded that the technique I was using was better, so after some more testing we went ahead and used it in the paper.

Trend Generating

One subtle point: we didn’t assess the differences in individual station measurements. Because the accuracy of US climate trends was the original motivation, we assessed the differences in estimates of US trends using different subsets of the USHCN data.

There are two basic requirements for getting a robust trend estimate over a geographical area. First, you have to work with anomalies or changes over time (first differences) rather than the raw temperatures themselves. This is because individual temperatures are very location-specific, whereas anomalies are more uniform. If it was a cold year in Amarillo, it was probably a cold year in Lubbock too by about the same amount, even though the average temperatures might be 2-3 C different.

The second requirement is to take account of the uneven distribution of stations. For example, suppose you have climate stations in El Paso, Corpus Christi, and Dallas. An average of the anomalies at these three stations might be a good approximation to the statewide anomaly. But if another station gets added near El Paso, you wouldn’t want to do a straight four-station average because it would be too strongly influenced by weather goings-on near El Paso. A more reasonable approach might be to average the two El Paso stations together first. The more general principle is that a station should matter more in the overall average if it is far from other stations, and matter less if lots of other stations are nearby.

We chose to meet the first requirement by taking 30-year averages (we tested different periods and different ways of averaging and it didn’t matter much) and averaging stations within the nine climate regions (see Fig. 2 of the paper) before computing a US average. There are plenty of other approaches; for example, NCDC’s preliminary analysis of siting quality used a gridded analysis, but we checked and our numbers weren’t very different.

So, for example, the CRN 1&2 trend was computed by computing the anomalies at each CRN 1&2 (well-sited) station, averaging the anomalies within each climate region, then averaging nationally (using the size of each region as a relative weight), then computing the ordinary least-squares trend of those US averages.

Difference Testing: Monte Carlo

The next task was to determine whether trends from different groups of stations were significantly different from each other. The standard statistical tests for this compare the difference in slopes with the scatter of points about the trend lines. But this isn’t appropriate for our data because of a crucial problem: the scatter about the trend line is not uncorrelated noise. There’s a bit of autocorrelation, but more importantly, the scatter in one set of points is always going to be highly correlated with the scatter in another set of points. If a particular year was cold, it was cold no matter what quality class of station you use to measure it.

Whatever test we used had to reflect the correlation between different station classes as well as the autocorrelation within a station class. It also, ideally, would take into account that the distribution of stations among climate regions was uneven so some regions might only have two stations within a class, with each station therefore having a big influence on the overall trends.

No standard test can deal with all that, so I used a Monte Carlo approach. Ritzy name, simple concept. In fact, it’s so simple you don’t need to know statistics to understand it. Given two classes of stations whose trends needed comparing, I randomly assigned stations to each class, while making sure that the total number of stations in each class stayed the same and that each climate region had at least two stations of each class. I then computed and stored the difference in trends. I then repeated this process a total of 10,000 times.

The result is 10,000 trend differences obtained from random sets of stations. The conventional criterion for statistical significance is that there be a less than 5% chance that a trend difference so large could have come about randomly. So all you do is look at the random trend differences and see what percentage of them are larger than the one you computed using the real classification. Since you don’t know ahead of time which trend should be larger, you use the absolute value of the trends, or, equivalently, require that only 2.5% of the random trend differences be more positive (or more negative) than the observed trend difference.

Difference Testing: Proxy Stations

One assumption of our Monte Carlo approach is that the station locations are random. Now, random does not mean evenly spaced. But, as a reviewer pointed out, the good stations were often concentrated on one side or another of a climate region, moreso than would seemingly be expected randomly, and maybe some of the differences were due to the peculiar geographical arrangement of stations.

To test this possibility, I identified “proxy stations”. For each CRN 1&2 station, I found the nearest CRN 3 or CRN 4 station to serve as its proxy. I then compared the trends calculated using the real CRN 1&2 stations to the trends calculated using the proxy CRN 1&2 stations. The test is as follows: if the trend estimates from the proxy stations match those from the larger CRN 3&4 group, then the trend isn’t sensitive to that particular station distribution. If, instead, the trend estimate from the proxies match the trend estimates from the CRN 1&2 stations, then I can’t discard the possibility that the CRN 1&2 trends are due to the station distribution rather than the siting.

Because of the small number of CRN 5 stations, I also created proxies for them and performed a similar test.

The proxy test didn’t affect our trend results much, but it did matter a lot with Section 4, where we tried to look at temperature differences directly. So I’m very grateful to the reviewer for insisting on more proof.

With the proxies, we were also able to do a neat little attribution analysis. Consider a little algebra:

CRN 1&2 – CRN 5 = (CRN 1&2 – CRN 1&2 Proxies) + (CRN 1&2 Proxies – CRN 5 Proxies) + (CRN 5 Proxies – CRN 5)

The temperature difference between the best and worst sited stations can be broken down into three terms: the first term shows how the best stations differ from their (typically-sited) neighbors, the second term shows how the difference in station distribution contributes, and the third term shows how the worst stations differ from their neighbors.

By plotting these differences over time (Fig. 8 in the paper) we were able to show that most of the minimum temperature trend difference between best and worst comes from the third term, while most of the maximum temperature trend difference comes from the first term. There’s some info in there about the relative importance of different types of siting deficiencies on the maxes and mins, and we intend to explore this issue in more detail in a subsequent paper.

The same figure showed that the trend differences arises during the mid to late 1980s, when many stations underwent simultaneous instrumentation and siting changes.

The software I used for my analyses is going to be publicly posted by Anthony Watts once he gets all our supplementary information assembled. With a topic having such lay community interest, we thought it important to make it as easy as possible to duplicate (and go beyond) our results. I did my coding in Python, but it’s only the second Python program package I’ve ever written. I hope critical software engineers overlook the many fortranisms that are undoubtedly embedded in the code.

===============================================================

Note: I hope to have the SI completed later today or tomorrow at the latest. A separate announcement will be made here and also on surfacestations.org – Anthony

UPDATE 5/13 : The SI has been posted on surfacestations.org main page, see the link on the main page

0 0 votes

Article Rating

52 Comments

Inline Feedbacks

View all comments

Rhoda Ramirez

May 12, 2011 10:31 am

If nothing else, this study shows that climate research CAN be done using classic scientific techniques and tools. Thanks

Hector M.

May 12, 2011 11:24 am

I have not read the paper yet, of course, but I wonder whether identifying well-sited and wrong-sited stations today is sufficient to identify the effect over time of urban encroaching or the increase of other local-heat factors over time (such as the station being originally in a grass field that was subsequently paved with cement at a certain date).
Perhaps two stations situated nearby from each other started being rural and located on grass, but one of them subsequently became urban or sited on a paved lot: one should note differences between the trends in these stations, besides differences in the temperature they report today.
The volunteer work, invaluable as it is, as well as other station data, may or may not have recorded station history with such detail. What is the actual information available on the historical evolution of the station surroundings, and its correlation with trend differences?

Paul Deacon

May 12, 2011 12:03 pm

Anthony – Christchurch, New Zealand ought to be an interesting place to study UHI. Thanks to an earthquake on 22FEB11, the entire city centre was vacated. Before and after measurements from nearby stations ought to provide interesting data for UHI study.
All the best.

PhilJourdan

May 12, 2011 12:07 pm

Not boring at all! And very well written for the layman. I have some experience in Statistics (being an Economist), but nothing compared to the experts. Yet I found your piece to be easy to read and understand! Very well done!

kuhnkat

May 12, 2011 12:11 pm

The problem is STILL that ALL the sites are close to anthropogenic influences. when you have a set of stations at least 10 miles from the nearest town or village that spatially cover the country I will be more interested in the numbers you can come up with. Until then only the sorry satellite data is worth considering for coming up with areal trends.
Yes, I understand that means we have no long term data. That long term data has to be evaluated based on the fact that it IS contaminated with anthro effects. Please talk to Dr. Spencer, when he has some time about his limited study showing that small towns had larger UHI than than the big ones. The reason we have seen no temp increase in the last 10 years is more a flattening of the UHI effect than any real change in the climate.

morgo

May 12, 2011 12:12 pm

http://www.weatherzone.com.au/news/cold-blast-has-wide-reaching-effect/17347 australia having early snow coldest since 1970

Jeremy

May 12, 2011 12:34 pm

Oh god, Fortran…

1DandyTroll

May 12, 2011 12:45 pm

Would it be terrible hard for you guys to group stations into classes of urban and suburban locations based on standardized population sizes for villages, towns and cities, and probably a airport strip class, that could be compared to rural locations, which should be its own group of classes, to get a bunch of trend lines to compare?
Personally I think national or global average should be based only on completely rural readings, but it would be rather interesting to see how much civilization adds to the mix.

docduke

May 12, 2011 12:58 pm

I hope critical software engineers overlook the many fortranisms that are undoubtedly embedded in the code.!!! One never forgets one’s mother tongue! I’ve been coding in Python for over a decade, and the first draft of most packages still looks like FORTRAN, but with simpler declarations!
I have learned a lot of very valuable information from this blog, not the least of which is an introduction to R. If you want to mix the two, consider PypeR. I have tried several packages that try to mix Py and R, and I found this the simplest. I have recently downloaded strucchange and look forward to trying it out! I wait for your software with bated breath!

Hector M.

May 12, 2011 1:05 pm

To stress my point, anomalies in a poorly sited station that has always been poorly sited need not be different from anomalies in a well-sited station that has always been well sited. They should not yield much different trends. But the difference would arise whenever one of the stations has undergone more “urbanization” than the other one, i.e. increasing asphalt or steel or cement or engines in its vicinity. I wonder if anyone knows what is exactly the case about this in the new dataset.

Gator

May 12, 2011 1:15 pm

Hey Hector! There are many landlocked rural stations in the midwestern US that are still in grassy fields. Almost all of them show either no warming over the past 100 years, or even a slight cooling. Here is a good example…
http://data.giss.nasa.gov/cgi-bin/gistemp/gistemp_station.py?id=425745560020&data_set=1&num_neighbors=1
You must be careful when selecting stations, that you do not use a station that has suffered infrastructure increase in its surroundings.
All stations can be accessed here…
http://data.giss.nasa.gov/gistemp/station_data/
It becomes quite clear that UHI is the main driver of rising temperature records.

TonyG

May 12, 2011 1:22 pm

It’s good to see all this being explained so clearly – thanks

Hector M.

May 12, 2011 1:33 pm

Dandy Troll,
it is not only a matter of population. First, consider areas with little resident population but large concentration of people and cars (e.g. predominantly office areas, like Downtown Mannhattan). Then consider areas with sparse population but much traffic, as near busy highways and their crossings. And finally, consider the fact (verified by the surfacestation.org volunteers, that many “rural” stations have suffered the encroachment of built structures in their surroundings, some of them formerly on grass but now sitting on a paved parking lot, or on top of a tin roof,although never changing site and always staying in a rural area and within the same institution (say, an agricultural research facility or a meteorological one).

Jim

May 12, 2011 1:34 pm

When you say “Given two classes of stations whose trends needed comparing, I randomly assigned stations to each class, while making sure that the total number of stations in each class stayed the same and that each climate region had at least two stations of each class.,” I take that to mean you assigned stations from class A to the sample class A, and assigned stations from class B into sample class B. Correct?

Dave Andrews

May 12, 2011 2:00 pm

Gator,
Anecdotal, I know, but here in the UK the BBC weather reports regularly, day after day, predict temperatures for London that are 2C or more than temps of the rest of the South East.

Max Hugoson

May 12, 2011 2:38 pm

Anthony:
Sorry to beg for “direct talk”. But is your work leading to the following two results:
1. Isolating those stations, across the USA, which fall into the “Category 1”, or virtually ZERO measuring error.
2. Processing those data over 70 years, and seeing if there is any trend beyond statistical noise (STANDARD use of S.D. and Chi Squared, Student’s Tee, etc. to compare and evaluate data for “statistically significant” variations?)
That would be primo.
If the result is ZERO trend, or if the result is a TREND with no “bifurcation” (say, at about 1940 to 1950, this would be a prima facia evidence of “NO EVIDENCE IN THE TEMPERATURE RECORD”.
I only think the “average temperature” can be compared on a PLACE TO PLACE comparison, not as an AGGREGATE. I.e., “averaging” the temperatures across the UNITED STATES is averaging an “intensive variable” and worthless.
Comparing the changes at EACH STATION over time, may be a ligitemate use of the data.

Gator

May 12, 2011 2:39 pm

Hey Dave! Good science is full of anecdotal observations. I live in a very rural county and commute to a major city for work, 20 miles away. The urban temperatures run up to 10 degrees F more than my property. Not only do cities and infrastructure retain heat, but out here plants have a cooling effect, photosynthesis is an
endothermic (cooling) reaction.
John Muir was another who liked to make crude, yet correct observations.

steven mosher

May 12, 2011 3:34 pm

Please talk to Dr. Spencer, when he has some time about his limited study showing that small towns had larger UHI than than the big ones.
####
you mean the one where he confirmed CRUTEM.
I’ve done a study of PRISTINE rural sites. That is rural sites with no built areas within
20KM. answer? the planet is warming. There was a LIA and its warmer now than it was then. I’ve tried to replicate Spensers study by going far back to 1900 and looked at sites with 0 population and watched what happened as they grew.
No measurable effect.
The planet is getting warmer. The real argument is WHY. the stupid arguments that its not getting warmer, are a waste of your brain cells.

Sam Hall

May 12, 2011 3:46 pm

Jim says:
May 12, 2011 at 1:34 pm
When you say “Given two classes of stations whose trends needed comparing, I randomly assigned stations to each class, while making sure that the total number of stations in each class stayed the same and that each climate region had at least two stations of each class.,” I take that to mean you assigned stations from class A to the sample class A, and assigned stations from class B into sample class B. Correct?

The way I understand it, he put all the stations in a “box” and blindly pulled them out and put them into a sample class at random.

Ken Lydell

May 12, 2011 4:23 pm

The UHI problem is a very difficult variable to control. The best research available suggests that the impact of land use changes on local temperature are best described by an asymptotic curve. Very small changes can have profound impacts. The work of Oke cited here: http://icecap.us/images/uploads/URBAN_HEAT_ISLAND.pdf. appears to be sound.
Surface station sites have always been selected with accessibility in mind. Rural means not far from a road but reasonably far from other land use influences. However, not far from a road isn’t good enough. The road will have influenced the local microclimate. I think it can be fairly argued that all surface stations anywhere in the world are and alway have been influenced by land use changes that made those sites accessible in the first place.
If Oke is correct and the UHI is 0.73 times the Log10 of population size teasing the UHI effect out of historical data is a daunting task. For instance, Central Park in New York shows little or no temperature trend over the last few decades. New York City suffers about as much UHI as you can get and has done so for quite a long time. Additional land use changes have since the complete urbanization of Manhattan have had little effect. Erecting a gas station a few hundred meters from a rural station would have a much greater effect on that station than shoe-horning a million more people in Manhattan would have on the Central Park thermometer.
While I greatly admire Anthony and respect the enormous effort he and his volunteers have made to characterize surface stations I am not sure it will make much difference in the end. Quantifying the impact of land use changes over time is, with the exception of the satellite era, practically impossible. If Oke is correct and a mere 10 people living in an area can increase local recorded temperatures by 0.73 degrees, surface station data is entirely too noisy to be useful.

Ken Lydell

May 12, 2011 4:33 pm

@Steven Mosher
Would you be kind enough to provide links to the papers you have had published on this subject? While I agree with you that things got a bit warmer for a time it seems to me that the degree of warming is debatable and its adverse consequences have thus far been negligible.

Jim

May 12, 2011 4:47 pm

@Sam Hall – It didn’t make sense to me to put them all in an box then arbitrarily put them into CRN1, CRN2, CRN3, CRN4, and CRN5 boxes. It seems one would want to keep 1s with 1s, 2s with 2s, etc.

Jim

May 12, 2011 4:49 pm

It seems there should be a special WUWT link for surface station posts. Easy to find and refer people to.

John N-G

May 12, 2011 9:42 pm

Sam Hall’s got it right. Suppose you’ve got 200 stations, 50 of which are Class A and 150 of which are Class B. The trend you get from the Class A stations is different from the trend you get from the Class B stations. You want to know whether if you just ignored the classes and randomly separated the 200 stations into a group of 50 and a group of 150, you might get a similar trend difference. If the trend differences you get randomly are (mostly) smaller than the trend difference you get by separating stations by class, then maybe the classes really do have something to do with the trend difference.

ZZZ

May 12, 2011 11:31 pm

If you used Monte-Carlo methods, you probably relied heavily on one or more pseudo-random number generators. Pseudo-random number generators allow the programmer to ask for a new “random” number as many times as desired in the computer program. Unfortunately, they have “pseudo” in their name (or they should because lots people misleadingly leave out the pseudo, calling them random number generators) because they are not truly random — after a very, very large number of requests for a new pseudo-random value, they start to repeat the sequence of numbers handed out. At this point, of course, the numbers are not random-looking at all, being 100% correlated with the earlier values.
If the total number of requests made by the program is very small compared to the total number of requests you would need to make before the generator began to repeat, then the pseudo-random generator is probably handing out numbers that look acceptably random (until someone finds a new statistical test it flunks, at which point it’s back to the drawing board for those trying to create something acceptable to picky statisticians). It is, however, surprisingly easy to end up during a large Monte-Carlo analysis asking for so many pseudo-random numbers that the program uses up to 5%, 10% or more of the sequence of pseudo-random values available before repetition begins. When this happens, it becomes questionable whether the pseudo-random numbers in the program behave in an acceptably random way.
One statistician I worked with said that, when using Monte Carlo methods, he always seemed to end up spending most of his time studying the properties of the pseudo-random number generator instead of the actual problem under investigation.

1 2 3 Next »

wpDiscuz

Related Posts

New paper: U.S. temperature extremes have declined since 1899, challenging assumptions about increasing heatwaves

New Temperature Study in Reno Finds Strong Urban Heat Island Bias at Official Climate Station

Another Temperature Bias: The Shrinking Stevenson Screen = Warming

‘Death Valley Days’ May Be Over for Global Temperature Record