Guest Post by Willis Eschenbach
OK, quick gambler’s question. Suppose I flip seven coins in the air at once and they all seven come up heads. Are the coins loaded?
Near as I can tell, statistics was invented by gamblers to answer this type of question. The seven coins are independent events. If they are not loaded the chances of a heads is fifty percent. The odds of seven heads is the product of the individual odds, or one-half to the seventh power. This is 1/128, less than 1%, less than one chance in a hundred that this is just a random result. Possible but not very likely. As a man who is not averse to a wager, I’d say it’s a pretty good bet the coins were loaded.
However, suppose we take the same seven coins, and we flip all seven of them not once, but ten times. Now what are our odds that seven heads show up in one of those ten flips?
Well, without running any numbers we can immediately see that the more seven-coin-flip trials we have, the better the chances are that seven heads will show up. I append the calculations below, but for the present just note that if we do the seven-coin-flip as few as ten times, the odds of finding seven heads by pure chance go up from less than 1% (a statistically significant result at the 99% significance level) to 7.5% (not statistically unusual in the slightest).
So in short, the more places you look, the more likely you are to find rarities, and thus the less significant they become. The practical effect of this is that you need to adjust your significance level for the number of trials. If the significance level is 95%, as is common in climate science, then if you look at 5 trials, to have a demonstrably unusual result you need to find something significant at the 99% level. Here’s a quick table that relates number of trials to significance level, if you are looking for the equivalent of a single-trial significance level of 95%:
Trials, Required Significance Level 1, 95.0% 2, 97.5% 3, 98.3% 4, 98.7% 5, 99.0% 6, 99.1% 7, 99.3% 8, 99.4%
Now, with that as prologue, following my interest in things albedic I went to examine the following study entitled Spring–summer albedo variations of Antarctic sea ice from 1982 to 2009 :
ABSTRACT: This study examined the spring–summer (November, December, January and February) albedo averages and trends using a dataset consisting of 28 years of homogenized satellite data for the entire Antarctic sea ice region and for five longitudinal sectors around Antarctica: the Weddell Sea (WS), the Indian Ocean sector (IO), the Pacific Ocean sector (PO), the Ross Sea (RS) and the Bellingshausen– Amundsen Sea (BS).
Remember, the more places you look, the more likely you are to find rarities … so how many places are they looking?
Well, to start with, they’ve obviously split the dataset into five parts. So that’s five places they’re looking. Already, to claim 95% significance we need to find 99% significance.
However, they are also only looking at a part of the year. How much of the year? Well, most of the ice is north of 70°S, so it will get measurable sun eight months or so out of the year. That means they’re using half the yearly albedo data. The four months they picked are the four when the sun is highest, so it makes sense … but still, they are discarding data, and that affects the number of trials.
In any case, even if we completely set aside the question of how much the year has been subdivided, we know that the map itself is subdivided into five parts. That means that to be significant at 95%, you need to find one of them that is significant at 99%.
However, in fact they did find that the albedo in one of the five ice areas (the Pacific Ocean sector) has a trend that is significant at the 99% level, and another (the Bellingshausen-Amundsen sector) is significant at the 95% level. And these would be interesting and valuable findings … except for another problem. This is the issue of autocorrelation.
“Autocorrelation” is how similar the present is to the past. If the temperature can be -40°C one day and 30°C the next day, that would indicate very little autocorrelation. But if (as is usually the case) a -40°C day is likely to be followed by another very cold day, that would mean a lot of autocorrelation. And climate variables in general tend to be autocorrelated, often highly so.
Now, one oddity of autocorrelated datasets is that they tend to be “trendy”. You are more likely to find a trend in autocorrelated datasets than in perfectly random datasets. In fact there was an article in the journals not long ago entitled Nature’s Style: Naturally Trendy . (I said “not long ago” but when I looked it was 2005 … carpe diem indeed.) It seems many people understood that concept of natural trendiness, the paper was widely discussed at the time.
What seems to have been less well understood is the following corollary:
Since nature is naturally trendy, finding a trend in observational datasets is less significant than it seems.
In this case, I digitized the trends. While I found their two “significant” trends in the Bellingshausen–Amundsen Sea (BS) at 95% and the Pacific Ocean sector (PO) at 99% were as advertised and they matched my calculations, unfortunately I also found that as I suspected, they had indeed ignored autocorrelation.
Part of the reason that the autocorrelation is so important in this particular case is that we’re only starting with 27 annual data points. As a result, we’re starting with large uncertainties due to small sample size. The effect of autocorrelation is to reduce that already inadequate sample size, so the effective N is quite small. The effective N for the Bellingshausen–Amundsen Sea sector (BS) is 19, and the effective N for the Pacific Ocean sector (PO) is only 8. Once autocorrelation is taken into account both of the trends were not statistically significant at all, as both were down around the 90% significance level.
Adding in the effects of autocorrelation with the effect of repeated trials means that in fact, not one of their reported trends in “spring-summer albedo variations” is statistically significant, nor even near to being significant.
Conclusions? Well, I’d have to say that in climate science we’ve got to up our statistical game. I’m no expert statistician, far from it. For that you want someone like Matt Briggs, Statistician to the Stars. In fact, I’ve never taken even one statistics class ever. I’m totally self-taught.
So if I know a bit about the effects of subdividing a dataset on significance levels, and the effects of autocorrelation on trends, how come these guys don’t? Be clear I don’t think they’re doing it on purpose. I think that this was just an honest mistake on their part, they simply didn’t realize the effect of their actions. But dang, seeing climate scientists making these same two mistakes over and over and over is getting boring.
To close on a much more positive note, I read that Science magazine is setting up a panel of statisticians to read the submissions in order to “help avoid honest mistakes and raise the standards for data analysis”.
Can’t say fairer than that.
In any case, the sun has just come out after a foggy, overcast morning. Here’s what my front yard looks like today …
The redwood tree is native here, the nopal cactus not so much … I wish just such sunny skies for you all.
Except those needing rain, of course …
AS ALWAYS: If you disagree with something I or someone else said, please quote their exact words that you disagree with. That way we can all understand the exact nature of what you find objectionable.
REPEATED TRIALS: The actual calculation of how much better the odds are with repeated trials is done by taking advantage of the fact that if the odds of something happening are X, say 1/128 in the case of flipping seven heads, the odds of it NOT happening are 1-X, which is 1 – 1/128, or 127/128. It turns out that the odds of it NOT happening in N trials is
or (127/128)N. For N = 10 flips of seven coins, this gives the odds of NOT getting seven heads as (127/128)10, or 92.5%. This means that the odds of finding seven heads in ten flips is one minus the odds of it not happening, or about 7.5%.
Similarly. if we are looking for the equivalent of a 95% confidence in repeated trials, the required confidence level in N repeated trials is
AUTOCORRELATION AND TRENDS: I usually use the method of Nychka which utilizes an “effective N”, a reduced number of degrees of freedom for calculating statistical significance.
where n is the number of data points, r is the lag-1 autocorrelation, and neff is the effective n.
However, if it were mission-critical, rather than using Nychka’s heuristic method I’d likely use a Monte Carlo method. I’d generate say 100,000 instances of ARMA model (auto-regressive moving-average model) pseudo-data which matched well with the statistics of the actual data, and I’d investigate the distribution of trends in that dataset.