Repeated Trials, Autocorrelation, and Albedo

Guest Post by Willis Eschenbach

OK, quick gambler’s question. Suppose I flip seven coins in the air at once and they all seven come up heads. Are the coins loaded?

Near as I can tell, statistics was invented by gamblers to answer this type of question. The seven coins are independent events. If they are not loaded the chances of a heads is fifty percent. The odds of seven heads is the product of the individual odds, or one-half to the seventh power. This is 1/128, less than 1%, less than one chance in a hundred that this is just a random result. Possible but not very likely. As a man who is not averse to a wager, I’d say it’s a pretty good bet the coins were loaded.

However, suppose we take the same seven coins, and we flip all seven of them not once, but ten times. Now what are our odds that seven heads show up in one of those ten flips?

Well, without running any numbers we can immediately see that the more seven-coin-flip trials we have, the better the chances are that seven heads will show up. I append the calculations below, but for the present just note that if we do the seven-coin-flip as few as ten times, the odds of finding seven heads by pure chance go up from less than 1% (a statistically significant result at the 99% significance level) to 7.5% (not statistically unusual in the slightest).

So in short, the more places you look, the more likely you are to find rarities, and thus the less significant they become. The practical effect of this is that you need to adjust your significance level for the number of trials. If the significance level is 95%, as is common in climate science, then if you look at 5 trials, to have a demonstrably unusual result you need to find something significant at the 99% level. Here’s a quick table that relates number of trials to significance level, if you are looking for the equivalent of a single-trial significance level of 95%:

Trials, Required Significance Level

1, 95.0%

2, 97.5%

3, 98.3%

4, 98.7%

5, 99.0%

6, 99.1%

7, 99.3%

8, 99.4%

Now, with that as prologue, following my interest in things albedic I went to examine the following study entitled Spring–summer albedo variations of Antarctic sea ice from 1982 to 2009 :

ABSTRACT: This study examined the spring–summer (November, December, January and February) albedo averages and trends using a dataset consisting of 28 years of homogenized satellite data for the entire Antarctic sea ice region and for five longitudinal sectors around Antarctica: the Weddell Sea (WS), the Indian Ocean sector (IO), the Pacific Ocean sector (PO), the Ross Sea (RS) and the Bellingshausen– Amundsen Sea (BS).

Remember, the more places you look, the more likely you are to find rarities … so how many places are they looking?

Well, to start with, they’ve obviously split the dataset into five parts. So that’s five places they’re looking. Already, to claim 95% significance we need to find 99% significance.

However, they are also only looking at a part of the year. How much of the year? Well, most of the ice is north of 70°S, so it will get measurable sun eight months or so out of the year. That means they’re using half the yearly albedo data. The four months they picked are the four when the sun is highest, so it makes sense … but still, they are discarding data, and that affects the number of trials.

In any case, even if we completely set aside the question of how much the year has been subdivided, we know that the map itself is subdivided into five parts. That means that to be significant at 95%, you need to find one of them that is significant at 99%.

However, in fact they did find that the albedo in one of the five ice areas (the Pacific Ocean sector) has a trend that is significant at the 99% level, and another (the Bellingshausen-Amundsen sector) is significant at the 95% level. And these would be interesting and valuable findings … except for another problem. This is the issue of autocorrelation.

“Autocorrelation” is how similar the present is to the past. If the temperature can be -40°C one day and 30°C the next day, that would indicate very little autocorrelation. But if (as is usually the case) a -40°C day is likely to be followed by another very cold day, that would mean a lot of autocorrelation. And climate variables in general tend to be autocorrelated, often highly so.

Now, one oddity of autocorrelated datasets is that they tend to be “trendy”. You are more likely to find a trend in autocorrelated datasets than in perfectly random datasets. In fact there was an article in the journals not long ago entitled Nature’s Style: Naturally Trendy . (I said “not long ago” but when I looked it was 2005 … carpe diem indeed.) It seems many people understood that concept of natural trendiness, the paper was widely discussed at the time.

What seems to have been less well understood is the following corollary:

Since nature is naturally trendy, finding a trend in observational datasets is less significant than it seems.

In this case, I digitized the trends. While I found their two “significant” trends in the Bellingshausen–Amundsen Sea (BS) at 95% and the Pacific Ocean sector (PO) at 99% were as advertised and they matched my calculations, unfortunately I also found that as I suspected, they had indeed ignored autocorrelation.

Part of the reason that the autocorrelation is so important in this particular case is that we’re only starting with 27 annual data points. As a result, we’re starting with large uncertainties due to small sample size. The effect of autocorrelation is to reduce that already inadequate sample size, so the effective N is quite small. The effective N for the Bellingshausen–Amundsen Sea sector (BS) is 19, and the effective N for the Pacific Ocean sector (PO) is only 8. Once autocorrelation is taken into account both of the trends were not statistically significant at all, as both were down around the 90% significance level.

Adding in the effects of autocorrelation with the effect of repeated trials means that in fact, not one of their reported trends in “spring-summer albedo variations” is statistically significant, nor even near to being significant.

Conclusions? Well, I’d have to say that in climate science we’ve got to up our statistical game. I’m no expert statistician, far from it. For that you want someone like Matt Briggs, Statistician to the Stars. In fact, I’ve never taken even one statistics class ever. I’m totally self-taught.

So if I know a bit about the effects of subdividing a dataset on significance levels, and the effects of autocorrelation on trends, how come these guys don’t? Be clear I don’t think they’re doing it on purpose. I think that this was just an honest mistake on their part, they simply didn’t realize the effect of their actions. But dang, seeing climate scientists making these same two mistakes over and over and over is getting boring.

To close on a much more positive note, I read that Science magazine is setting up a panel of statisticians to read the submissions in order to “help avoid honest mistakes and raise the standards for data analysis”.

Can’t say fairer than that.

In any case, the sun has just come out after a foggy, overcast morning. Here’s what my front yard looks like today …

The redwood tree is native here, the nopal cactus not so much … I wish just such sunny skies for you all.

Except those needing rain, of course …

AS ALWAYS: If you disagree with something I or someone else said, please quote their exact words that you disagree with. That way we can all understand the exact nature of what you find objectionable.

REPEATED TRIALS: The actual calculation of how much better the odds are with repeated trials is done by taking advantage of the fact that if the odds of something happening are X, say 1/128 in the case of flipping seven heads, the odds of it NOT happening are 1-X, which is 1 – 1/128, or 127/128. It turns out that the odds of it NOT happening in N trials is

(1-X)^N

or (127/128)^N. For N = 10 flips of seven coins, this gives the odds of NOT getting seven heads as (127/128)¹⁰, or 92.5%. This means that the odds of finding seven heads in ten flips is one minus the odds of it not happening, or about 7.5%.

Similarly. if we are looking for the equivalent of a 95% confidence in repeated trials, the required confidence level in N repeated trials is

0.95^1/N

AUTOCORRELATION AND TRENDS: I usually use the method of Nychka which utilizes an “effective N”, a reduced number of degrees of freedom for calculating statistical significance.

where n is the number of data points, r is the lag-1 autocorrelation, and n_eff is the effective n.

However, if it were mission-critical, rather than using Nychka’s heuristic method I’d likely use a Monte Carlo method. I’d generate say 100,000 instances of ARMA model (auto-regressive moving-average model) pseudo-data which matched well with the statistics of the actual data, and I’d investigate the distribution of trends in that dataset.

[UPDATE] I found a better way to calculate effective N. See below.

A Way To Calculate Effective N

0 0 votes

Article Rating

342 Comments

Inline Feedbacks

View all comments

daveandrews723

June 27, 2015 3:40 pm

The way NOAA has adjusted the temperatures of the 20th century, based on what valid criteria I have no idea, how can anyone have any degree of confidence in what they put forward? And when you go back to the 19th century how does anyone purport to have an accurate record of global temperatures? What percentage of the world were temperatures even record in back then?

Santa Baby

Reply to daveandrews723

June 27, 2015 10:11 pm

The chance of getting heads or tails is a bit less than 50%. Because there is a small chance for the coin to end up on the edge?

The Ghost Of Big Jim Cooley

Reply to Santa Baby

June 28, 2015 12:45 am

And it can happen. We have a small key that opens our letterbox. We are refurbishing our hallway, and currently have a bare concrete floor covered with a membrane. I threw the key onto the floor and didn’t hear the familiar sound of it bouncing. I looked back and it was on its edge! It was so astounding that I quickly called my wife to witness it. What’s even weirder is that the key is very thin, so its edge is about just 1 millimetre. This was the third thing (of a sort) to happen in a few weeks. On two occasions, separated by about four weeks, I threw a dishwasher tablet into the dishwasher and it landed on its edge. The second time it landed on its top edge (even less likely). At least the tablet has a wide edge, so the odds aren’t bad, but a key landing on its edge when it is so thin? Surely the odds are extraordinary?

MCourtney

Reply to Santa Baby

June 28, 2015 1:00 am

Perhaps you have a natural magnet under your property?
Or maybe stuff just happens sometimes.

The Ghost Of Big Jim Cooley

Reply to Santa Baby

June 28, 2015 1:08 am

Yep, stuff happens sometimes, that’s all it is!

Craig

Reply to Santa Baby

June 28, 2015 1:50 am

Check out the classic Twilight Zone episode “A Penney For Your Thoughts” to see what happens if you manage to get a coin to land on its edge.

george e. smith

Reply to Santa Baby

July 1, 2015 3:04 am

Who said it is a small chance ?
Sometimes the coin toss before a game or such is done on a mat where the coin won’t bounce.
So what if you do the coin toss on a flat patch of beach sand, where an edge on coin can dig into the sand and stay there.
So now what is the probability of it being less than 45 degree tilt from perfectly edge on ??

Bill 2

June 27, 2015 3:56 pm

“the odds of finding seven heads by pure chance go up from less than 1% (a statistically significant result at the 99% significance level) to 7.5% (not statistically unusual in the slightest)”
What would you consider “statistically unusual in the slightest”? 5.0%? Would 5.00001% then not be “statistically unusual in the slightest”? The difference between the two is insignificant in itself. Certainly 7.5% is statistically unusual in some sense, just not as statistically unusual as the arbitrarily-chosen threshold of 5.0%.

donb

June 27, 2015 3:57 pm

I once read the following comment by a person who taught statistics.
He assigned his students a task of flipping a coin 100 times and recording the sequence of heads and tails. He then took the results and informed the class which students had done the exercise and which had “dry-lab” the results, i.e. made then up. His answers were mostly correct.
His secret (which he conveyed to the students to make the point) was that most people think that the same result occurring three times in a row (e.g. three heads), and especially four or more times in a row, was very, very unlikely. The dry-lab results were those that had no (or very few) heads or tails occurring three or more times.
In reality these multiple occurrences are more common than people think.

noaaprogrammer

Reply to donb

June 27, 2015 9:23 pm

(A little OT, but similar): Before announcing that the class topic is on pseudo random number generators, I tell my students to write down 5 digits of their choice. We then plot the frequency distribution of those choices. The frequency of the middle digits are higher than the tails, 0 & 1, and 8 & 9; and the frequency of the odd digits are higher than the even digits. This illustrates that people are biased in their choice of digits, as one would expect a more or less even distribution if the choices were done randomly.
I then have them write down 100 consecutive digits chosen at random, (in lines of 10 digits each where the first digit of a succeeding line follows the last digit of the preceding line.) After they’re done, I have them circle the number of pairs of same digits, and the number of triplets of same digits. Most of the students are concentrating so hard on avoiding such occurrences in their efforts to “be random” that they fail to meet the statistical average of 10 pairs and 1 triplet.
The conclusion is that generating pseudo random sequences is difficult, as humans aren’t randomly inclined when they aren’t trying to be random, and they aren’t randomly inclined when they are trying to be random.

E.M.Smith

Editor

Reply to noaaprogrammer

June 28, 2015 1:55 pm

Well, then there is the whole question of “Ought the numbers actual BE random?”. Yes, for a random number generator or a truly fair coin, but real coins are not always evenly balanced and all…
As per the ice data set, this “natural bias” comes from the existence of a natural 60 year ocean / weather cycle, a roughly 8 year Southern Ocean Antarctic Circumpolar Wave
https://en.wikipedia.org/wiki/Antarctic_Circumpolar_Wave
the 18 ish year Saros Cycle of lunar tidal forces
https://en.wikipedia.org/wiki/Saros_%28astronomy%29
and the 1500 year cycle of tides caused by lunar cycles as longer term influences have effect.
So we KNOW there will be various interacting cycles causing observed pseudo-trends in the data as “heads and tails” of those sine waves line up, or not, with the start and end of the data observed…
So how does it make sense to apply a test of non-random to a non-random data set to find ‘trend’?

David A

Reply to noaaprogrammer

June 29, 2015 3:18 am

Yes EM. Is that not the reason for “Autocorrelation”. However the example Willis gave was,
““Autocorrelation” is how similar the present is to the past. If the temperature can be -40°C one day and 30°C the next day, that would indicate very little autocorrelation. But if (as is usually the case) a -40°C day is likely to be followed by another very cold day, that would mean a lot of autocorrelation.”
This is an example of non randomness WITHIN the period of study, but your example is of non randomness outside the period of study. I do not know if autocorrelation corrects for non random trends outside the period of study.
Willis?

Bob Armstrong

Reply to donb

June 27, 2015 9:29 pm

I spent a decade in grad school in a visual psychophysics lab . The bias of humans to disbelieve the frequency of long runs was pervasive . To make the point , here’s a couple of sequences of 100 pluses and minuses from a common random number generator which I’m sure will pass any simple chi^2 test or such .
++—–+-+-+-+-++-+-++-++-++–++++–+—+-+++-++-++-+++—–++—++++-+–++++–++-+-+-++—+++-+-+++
——-+—-+++-+–+-+-+++–+++–+—-+++-+—-+++-+-++-+——+-+-++–+++-+++—+—+++++++–++++–

richard verney

Reply to donb

June 27, 2015 11:18 pm

Something that one can see every night on UK television where there is late night roulette.
One frequently sees runs of 3 and even 4 reds (or blacks) in a row. I have not looked for odd/even streaks because one would have to look at the numbers in detail, I merely channel hop over this late night gambling fad. But every night one can see this as they show the last dozen spins of the wheel, and certainly 3 of one of the two colours in a row is a common occurence.

george e. smith

Reply to donb

June 30, 2015 1:51 pm

Well I have used statistical mathematics; or some aspects of it, for over 50 years, in my daily work; which often included recording the results of repeated experiments; or “trials” as Willis calls them.
But I don’t flip coins, so I haven’t done what Willis has.
So I am all the time (or have been) computing the “average value” of some data set of numbers, along with things like standard deviations.
But in my case, the numbers in my data sets, had a common property. All of the members of my typical data set, were supposed; in the absence of experimental errors, to be exactly the same number.
So my purpose in averaging, was to tend to reduce the random experimental error in my result. Systematic errors, of course posed additional uncertainties, but absent that, my expectation was that the probable random error in my result would diminish in about the square root of the total number of trials, or experiments.
With the caveat, of excluding systematic errors, my expectation is that this statistical average is the best result I can get from that experiment or measurement.
Now that is quite different from tossing a coin, as in Willis’s trial.
With my luck, if I tossed seven coins, just once, all seven of them would likely land on edge, just to annoy me.
So I don’t toss coins, just in case, that should happen.
But the use of statistics, and averaging, related to climate “science”, seems to be quite a different proposition all together.
People in that field, seem to take single, non-repeatable measurements of different things (Temperatures ?? e.g. ) in quite different places, at quite different times, all of which should yield quite random unrelated results, with no expectation at all, that any of those measurements would be the same.
They then engage in what amounts to numerical Origami, during which they discard, all of these unrelated experimental observations, and replace them with either an entirely new and fictitious set of numbers, often referred to as “smoothing”, or else discard them all completely to be replaced by a single number; “the average.”
Now the algorithms of statistical mathematics are all described in detail, in numerous standard texts on that art form; and it is an art form, trying to make nothing out of something.
So you can take a very useful square of paper, on which you could write a nice poem, and by applying a simple algorithm, you can fold that useful piece of paper, and get an ersatz frog that can even jump; but is now much more difficult to right a poem on.
Well of course, the algorithms of statistical mathematics place very few restrictions on the elements of the data set.
The only requirements is that each of them is a real number. That is “real” in the mathematical sense, so NO imaginary, or complex numbers, and NO variables.
Each element is an exactly known number; although it is not necessarily the exact value of anything physically existing.
So the result of applying the algorithm is always exact, and any practitioner, applying the same algorithm, will get the exact same result from the same data set.
So statistics is an exact discipline, with NO uncertainty in the outcome.
And it always works on ANY set of numbers whatsoever that meet the “real” number condition.
The numbers of the data set, do not have to be related in any way. You could choose all the telephone numbers in your local phone yellow pages. Well you could also include the page numbers, or the street address numbers as well, or any subset of them.
So NO uncertainty surrounds the outcome of the application of ANY statistical mathematics algorithm.
Now if you like uncertainty, my suggestion is to look instead, not at the numbers you get from doing statistics, but at the absurd expectations for what meaning lies in that outcome.
There is NO inherent meaning, whatsoever. The result is just a number.
If you add the integers from 1 to 9 inclusive, you get a sum of 45 (always), and dividing by the number of elements (9) you get the average value of 4.5 (always) and as you can see it isn’t even one of the numbers in the data set.
The average value of all the phone numbers in your yellow pages, is likely to not even be a telephone number at all. Averaging numbers that aren’t even supposed to be the same, simply discards all those numbers in favor of a completely fictitious one.
So our self delusion, is in what we expect the outcome of our statistics to mean.
It means only what we choose it to mean. There is no inherent meaning.
Just my opinion of course. Most people (maybe 97%) would likely disagree with me.

lsvalgaard

June 27, 2015 4:05 pm

Even some well-known scientists fall in that trap. Several recent posts have quoted papers by Mike Lockwood, whose grip on statistics is not the best. In his famous paper in Nature that the coronal magnetic field has more than doubled in the last hundred years, Lockwood claimed that his finding was significant at the 99.999999999996% level [he ignored or didn’t know about auto-correlation].

GDauron

Reply to lsvalgaard

June 28, 2015 8:15 am

Doctor you are the best at getting to the point that I have ever seen! Please live many more years.

Gregory

June 27, 2015 4:06 pm

GeoLurking

Reply to Gregory

June 28, 2015 9:20 pm

Which is ASCII for * , the wildcard character on DOS machines… essentially, it means match everything. So, the ultimate answer, is “everything”.

Expat

June 27, 2015 4:08 pm

I took exactly 1 course is statistics when I studied engineering. It was an elective at that. Wish I had taken more. Who would think it’d be that useful? About the only thing I remember about it, besides how to calculate lottery odds (never have bought a ticket) is it’s usually easier to find the odds of something not happening and go from there – as you’ve shown above.
ps Willis, Plant some yellow leafed Japanese Maples under the Redwoods. The effect is excellent on those mostly cloudy days you have there.

Gamecock

Reply to Expat

June 28, 2015 4:20 am

The lottery is a tax on the mathematically challenged.

Louis Hooffstetter

Reply to Gamecock

June 29, 2015 9:03 pm

When it gets above $100 million, I get mathematically challenged,

EdA the New Yorker

Reply to Expat

June 28, 2015 8:15 am

Aside from being a highly regressive tax on people with poor math skills, state lotteries have some redeeming qualities. Students learning statistical mechanics can relate to the expectation value of the ticket exceeding its face value for a sufficiently high prize. For quantum mechanics, the ticket represents Schrödinger’s Cat; the wave function collapses at the drawing. This can then be extended to the probability density of electron position in an atom or molecule.
I also slip in that the students are about as likely to be killed in a traffic accident going to buy the ticket as they are to win the top prize. Bad sport I guess.

The other Casper

Reply to EdA the New Yorker

June 28, 2015 2:19 pm

It’s good having some new teaching models to replace the old ones. I’ve often found myself trying to explain levers and leverage using the playground “see-saw” that’s familiar from my childhood — only to be reminded that youngsters today (in the US, at least) have never seen these. Liability problems, I guess.

Related Posts

A Generation of Kids Thinks They Have No Future. Science Just Admitted Why.

Bugs, Windshields and Climate Change – Oh My!

Dam the Bering Strait? – When Climate Panic Meets Geoengineering Fantasy

‘Easter Eggflation’ is Not Due to Climate Change, Euro News