Fall et al. 2011: The Statistics
By John Neilsen-Gammon (from his blog Climate Abyss – be sure to bookmark it, highly recommended – Anthony)
As I mentioned in my last post, I did a lot of the statistical analysis in the recent paper reporting on the effect of station siting on surface temperature trends. For those who are curious or extremely bored, here’s how I did the testing:
I was invited to participate after the bulk of the analysis was completed. I decided to confirm the analysis by doing my own independent analysis. It showed some differences, and we concluded that the technique I was using was better, so after some more testing we went ahead and used it in the paper.
Trend Generating
One subtle point: we didn’t assess the differences in individual station measurements. Because the accuracy of US climate trends was the original motivation, we assessed the differences in estimates of US trends using different subsets of the USHCN data.
There are two basic requirements for getting a robust trend estimate over a geographical area. First, you have to work with anomalies or changes over time (first differences) rather than the raw temperatures themselves. This is because individual temperatures are very location-specific, whereas anomalies are more uniform. If it was a cold year in Amarillo, it was probably a cold year in Lubbock too by about the same amount, even though the average temperatures might be 2-3 C different.
The second requirement is to take account of the uneven distribution of stations. For example, suppose you have climate stations in El Paso, Corpus Christi, and Dallas. An average of the anomalies at these three stations might be a good approximation to the statewide anomaly. But if another station gets added near El Paso, you wouldn’t want to do a straight four-station average because it would be too strongly influenced by weather goings-on near El Paso. A more reasonable approach might be to average the two El Paso stations together first. The more general principle is that a station should matter more in the overall average if it is far from other stations, and matter less if lots of other stations are nearby.
We chose to meet the first requirement by taking 30-year averages (we tested different periods and different ways of averaging and it didn’t matter much) and averaging stations within the nine climate regions (see Fig. 2 of the paper) before computing a US average. There are plenty of other approaches; for example, NCDC’s preliminary analysis of siting quality used a gridded analysis, but we checked and our numbers weren’t very different.
So, for example, the CRN 1&2 trend was computed by computing the anomalies at each CRN 1&2 (well-sited) station, averaging the anomalies within each climate region, then averaging nationally (using the size of each region as a relative weight), then computing the ordinary least-squares trend of those US averages.
Difference Testing: Monte Carlo
The next task was to determine whether trends from different groups of stations were significantly different from each other. The standard statistical tests for this compare the difference in slopes with the scatter of points about the trend lines. But this isn’t appropriate for our data because of a crucial problem: the scatter about the trend line is not uncorrelated noise. There’s a bit of autocorrelation, but more importantly, the scatter in one set of points is always going to be highly correlated with the scatter in another set of points. If a particular year was cold, it was cold no matter what quality class of station you use to measure it.
Whatever test we used had to reflect the correlation between different station classes as well as the autocorrelation within a station class. It also, ideally, would take into account that the distribution of stations among climate regions was uneven so some regions might only have two stations within a class, with each station therefore having a big influence on the overall trends.
No standard test can deal with all that, so I used a Monte Carlo approach. Ritzy name, simple concept. In fact, it’s so simple you don’t need to know statistics to understand it. Given two classes of stations whose trends needed comparing, I randomly assigned stations to each class, while making sure that the total number of stations in each class stayed the same and that each climate region had at least two stations of each class. I then computed and stored the difference in trends. I then repeated this process a total of 10,000 times.
The result is 10,000 trend differences obtained from random sets of stations. The conventional criterion for statistical significance is that there be a less than 5% chance that a trend difference so large could have come about randomly. So all you do is look at the random trend differences and see what percentage of them are larger than the one you computed using the real classification. Since you don’t know ahead of time which trend should be larger, you use the absolute value of the trends, or, equivalently, require that only 2.5% of the random trend differences be more positive (or more negative) than the observed trend difference.
Difference Testing: Proxy Stations
One assumption of our Monte Carlo approach is that the station locations are random. Now, random does not mean evenly spaced. But, as a reviewer pointed out, the good stations were often concentrated on one side or another of a climate region, moreso than would seemingly be expected randomly, and maybe some of the differences were due to the peculiar geographical arrangement of stations.
To test this possibility, I identified “proxy stations”. For each CRN 1&2 station, I found the nearest CRN 3 or CRN 4 station to serve as its proxy. I then compared the trends calculated using the real CRN 1&2 stations to the trends calculated using the proxy CRN 1&2 stations. The test is as follows: if the trend estimates from the proxy stations match those from the larger CRN 3&4 group, then the trend isn’t sensitive to that particular station distribution. If, instead, the trend estimate from the proxies match the trend estimates from the CRN 1&2 stations, then I can’t discard the possibility that the CRN 1&2 trends are due to the station distribution rather than the siting.
Because of the small number of CRN 5 stations, I also created proxies for them and performed a similar test.
The proxy test didn’t affect our trend results much, but it did matter a lot with Section 4, where we tried to look at temperature differences directly. So I’m very grateful to the reviewer for insisting on more proof.
With the proxies, we were also able to do a neat little attribution analysis. Consider a little algebra:
CRN 1&2 – CRN 5 = (CRN 1&2 – CRN 1&2 Proxies) + (CRN 1&2 Proxies – CRN 5 Proxies) + (CRN 5 Proxies – CRN 5)
The temperature difference between the best and worst sited stations can be broken down into three terms: the first term shows how the best stations differ from their (typically-sited) neighbors, the second term shows how the difference in station distribution contributes, and the third term shows how the worst stations differ from their neighbors.
By plotting these differences over time (Fig. 8 in the paper) we were able to show that most of the minimum temperature trend difference between best and worst comes from the third term, while most of the maximum temperature trend difference comes from the first term. There’s some info in there about the relative importance of different types of siting deficiencies on the maxes and mins, and we intend to explore this issue in more detail in a subsequent paper.
The same figure showed that the trend differences arises during the mid to late 1980s, when many stations underwent simultaneous instrumentation and siting changes.
The software I used for my analyses is going to be publicly posted by Anthony Watts once he gets all our supplementary information assembled. With a topic having such lay community interest, we thought it important to make it as easy as possible to duplicate (and go beyond) our results. I did my coding in Python, but it’s only the second Python program package I’ve ever written. I hope critical software engineers overlook the many fortranisms that are undoubtedly embedded in the code.
===============================================================
Note: I hope to have the SI completed later today or tomorrow at the latest. A separate announcement will be made here and also on surfacestations.org – Anthony
UPDATE 5/13 : The SI has been posted on surfacestations.org main page, see the link on the main page
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.

Ken.
“Would you be kind enough to provide links to the papers you have had published on this subject? While I agree with you that things got a bit warmer for a time it seems to me that the degree of warming is debatable and its adverse consequences have thus far been negligible.”
1. better than a paper all my code is public and the data is public, you are welcomed to download it and check for yourself. papers ADVERTISE science. they are not science.
2. Things have gotten warmer? Good.
The current estimate is .8C since 1850. of course that number is debateable.
B. Do you think the truth is less than .8C?
C. If yes, then stake out a debate position. how much less.
D. Why do you think the truth is different than the estimate, whats the source of
the bias.
You wanna debate, Sure I will give you one. answer those questions and we will see
if we disagree. simple, no need to confuse the issue, we might agree. so, your answers to the questions. Keep it simple like this.
B. Yes
C. .1C less
D. UHI
If you answer those questions we can have a debate, who knows you might convince me of your position, but first, what’s your position?
Thanks for the primer- very useful
Ken
“If Oke is correct and the UHI is 0.73 times the Log10 of population size teasing the UHI effect out of historical data is a daunting task. ”
well as Oke went on to look at UHI more and more over the years he rather moved away from the simplistic log of population. The orginal work on that(1973)was done with a fairly limited dataset. From 1973 on Oke and others (grimmond who frequently co authored with him) came to understand that UHI is vastly more complicated than a simple log model of population. So, nobody who actually does detailed work in the field would think of characterizing it that way. log of pop
is extremely crude because it ignores key physical/morphology features. It more the physical changes to the surface than the number of people. Also, log of pop doesnt really control for density and density drives building height and building height drives key factors like sky view, radiative canyons, and boundary layer disruption. So that, for example, in the study of portland, after canopy cover the number 1 regressor for predicting UHI was building height.
Start here
http://www.kcl.ac.uk/ip/suegrimmond/news.htm
Or for a very good overview read this
http://www.kcl.ac.uk/ip/suegrimmond/publishedpapers/GJ_Grimmond2007.pdf
I would spend some time looking at urban surface energy balance studies as well.
that’s enough reading for now
PhilJourdan says: Not boring at all! And very well written for the layman. I have some experience in Statistics (being an Economist), but nothing compared to the experts. Yet I found your piece to be easy to read and understand! Very well done!
Since when do “statistics” and “economist” use an initial capital letter?
“If it was a cold year in Amarillo, it was probably a cold year in Lubbock too by about the same amount, even though the average temperatures might be 2-3 C different.”
I have a slight concern about this statement – especially the word “probably”. I wonder if local climatic effects can’t override trends, thus making this assumption dubious. For example, despite the world having just experienced a deep La Nina and world average temperatures well below normal, the west coast of Australia has been extremely hot. They (we) have had the hottest summer in their short record.
My point being, that if temperature trends can be regionally specific, I wonder if by averaging neighbouring regions you can lose some of this detail.
Agnostic
I agree. The UK has had the ‘hottest April ever’ whilst judging by this blog other places are still in winter. Micro climates play a huge part in temperature differential and a large part of that is caused by wind direction and formation of cloud.
tonyb
Agnostic said My point being, that if temperature trends can be regionally specific, I wonder if by averaging neighbouring regions you can lose some of this detail.
Yes, of course. Averaging anything loses detail.
One interesting thing. I note that everyone chimes in with there favorite anecdote.
its cold in XYZ! yes it is.
If you looked at the US population in 1900 and looked at it today, you’d all agree that it went up. If you looked at the states individually, they too would have gone up.
Now, somebody out there in some podunk town says.. hey out population is flat!
or ours has gone down.
We’d look at these people as silly. So when the average for the globe goes up and somebody yells, but my town is cold.. I think about those guys who think the us population hasnt gone up because their little town has stayed the same or gotten smaller. This is one of the funny blind spots people have in their thinking.
Mosh at 2.01
I agree with your general sentiments-many people think that the temperature of ‘their’ town in ‘their’ lifetime is emblematic of the whole world. However in our desire to ‘average’ everything one important message is being lost- a substantial number of places in the world are flat or cooling and the term ‘Global warming’ is therefore a misnomer.
This is not to ‘deny’ that a substantial proportion of the world is warming but it would be more helpful if we acknowledged that various counter trends can be identified that are all happening at the same time. In this respect it is not helpful for the IPCC to say that only South Greenland and a few places in the tropics are cooling as that isn’t correct.
tonyb
Excellent article and one that did not reduce me to feeling statistically challenged and vaguely stupid.
My question is… If there is good evidence that North America and the geographic part of Europe that includes the UK, and other large geographic regions, have differing climate trends which are not mirror images of each other, are some people making a fundamental error in making globasl extrapolations from any temperature series, including those used for the Surfacestaions Project?
There is a very good posting on Jo Nova’s excellent blog about the Maunder Minimum and the terrible experiences it induced in Ireland – well worth a read, particularly for those who do not fear a cooler climate.
@John N-G –
Thanks John, now I see what you are doing.
“The more general principle is that a station should matter more in the overall average if it is far from other stations, and matter less if lots of other stations are nearby.”
I would suggest stern caution with this line of reasoning. Properties of the spatial gradient vary dramatically over space.
Congrats, Anthony.
What I’d like to see is some further analysis. I know that there are few truly rural stations in the database. However, putting any trust in “airport” temp stations, no matter how they’re classified, is way too much of a stretch for me. A subdivision & analysis of “good” stations into categories like airports, urban, suburban & rural would certainly be very interesting.
Alexander K
See my post directly above yours. Some places are cooling, some are static, some are warming. The average of the warming signal is greater than the average of the cooling signal therefore there is said to be ‘global’ warming, which simply isn’t correct. A proportion of those that have ‘warmed’ have done so because they are measuring a different micro climate to the one they started off with ( perhaps in a field on the edge of a small town then moved for convenience sake to say an airfield-or perhaps the field became an airfield) and/or there is uhi as the site becomes urbanised.
Undoubtedly the ‘average’ of the world has warmed a little since the LIA but that disguises numerous counter trends.
tonyb
JNG, any thought as to who it was that reviewed your paper?
@Tonyb 7.49
So the term “global population increase” is not appropriaye either because there are some places which are not increasing? If the earth, on average, is getting warmer (or cooler) then we have global warming (or cooling).
gopher
I don’t think your analogy is appropriate. The term ‘global warming’ is used as a deliberate metaphor to suggest that the entire globe is warming. The IPCC support that view with their incorrect statement that dismisses the areas of cooling as trivial.
Ask any politician or policy maker that believes in AGW and they will trot out this belief that global means global..
The climate is much more nuanced than is being claimed. I don’t think the idea of a ‘global’ temperature is that useful, especially when it based on so many variables and inconsistencies and chooses a particular time scale.
tonyb
John,
When you guys grouped stations into the quality categories and compared time trends, did you make the assumption that each station had always been in its given quality category?
The reason I ask is this: What if a station was at one time a CRN1 and is now a CRN4 or 5. Couldn’t that mean that the observed trends from that station are more indicative of a change in station siting quality than an actual temperature trend?
Maybe I’m just confused about the point of the study.
It’s certainly enouraging that the Meme et al paper (using a subset of the surface station data), the Fall et al. paper in press (using the full data set) generally agree with each other and with other published climate data. This is a good confirmation of the peer-reviewed literature.
@tonyb
“The term ‘global warming’ is used as a deliberate metaphor to suggest that the entire globe is warming.”
Even in the FAQ of the IPCC Assessment Report 4 it says, “Expressed as a global average, surface temperatures have increased by about 0.74°C over the past hundred years (between 1906 and 2005; ).”
If you wikipedia global warming the first line is, “Global warming is the increase in the average temperature of Earth’s ….”
I can accept that perhaps there are problems with _how_ the data is combined to form an average… but not that we shouldn’t talk in averages and trends, with appropriate statistical uncertainties included of course.
steven mosher says:
May 12, 2011 at 3:34 pm
“I’ve done a study of PRISTINE rural sites. That is rural sites with no built areas within
20KM. answer? the planet is warming. ”
That’s quite a feat, since stations meeting such a stringent criterion with intact records long enough (>120yrs) to provide a credible indication of SECULAR trend are virtually nonexistemt outside the remote outposts of some advanced nations. Exacly where are such pristine stations in Canada, Brazil, Africa, Spain, France, Poland, Ukraine, European Russia, the Arabian Peninsula, the Indian subcontinent, China and Mongolia? And how do you know that offsets due to station moves and instrument changes along with land-use changes (deforestation, cultivation) haven’t introduced a spurious trend into the historical record. Inquiring minds want to know!
As Sky responded to Steve M., considering the “siting induced error” found by SurfaceStations.org, when one goes back to the temperatures…say ANY AND ALL taken before 1945, the errors caused by:
1. Reading the instruments, (by eye).
2. Time of DAY for the reading.
3. Any other errors creeping in through “hand recording”. (I’m obliquely referencing the DEW Line, winter measurements where they “dry lab” obtained the results.)
I tend to believe the ERROR BARS are within that 0.8 C number.
Now as to the general “atmospheric energy balance” going up, yes..the retreat of the glaciers is evidence of this. However, as has been pointed out MANY times on this blog and others, that retreat began in the late 1800’s and continues to this day. Because the CO2 did not go up markedly until AFTER WWII, the cause/effect on a logical basis, is lacking to argue that CO2 increases are the cause of the ENERGY BALANCE shift.
It’s clear that the microclimate of most surface stations has changed in the last 100 years. Let’s see:
1. Average station temperature is on the order of 0.6-1.0 C warmer than 100 years ago.
2. Going from a rural location to an urban location increases temperature by 2-5 C or more. More than half of all stations currently in use have had major changes in microclimate in the last century. This can be demonstrated by change in minimum vs. change in maximum temperatures, date of peak vs. trough in temperature, and similar techniques using statistical analysis of known changes in urbanization.
3. (0.6 to 1.0) minus (2 to 5) gives a remainder of -1.0 to -4.4 C over the last century, which is clearly significant cooling, for any stations that have gone from rural to urban in the last century.
4. Therefore, any claim to global warming, based on the surface stations alone, is swamped by UHI effects. Other analysis, such as which crops and trees grow where, would be more useful, as would SST temperatures, which would potentially have much less variation due to human effects, as long as the technique for sampling remained the same.
Max Hugoson
According to the extensive studies by Hubert Lamb-first Director of CRU-the glaciers were melting by 1750.
tonyb