This post over at Lucia’s Blackboard caught my interest because it shows you don’t need to operate heavy equipment (like Al Gore’s famous elevator scene) to get yourself a big data hockey stick. More evidence that Mann and Briffa’s selection criteria for trees leads to a spurious result.
It doesn’t matter if you are using treemometers, thermometers, stock prices, or wheat futures, if your method isn’t carefully designed to prevent an artificial, and unintentional selection of data that correlates with your theory, the whole study can turn out like a hockey stick. Anyone can do this, no special science skills are needed.
So, even though the method seems reasonable, and the person doing it doesn’t intend to cherry pick, if they don’t do some very sophisticated things, rejecting trees that don’t correlate with the recent record biases an analysis. It encourages spurious results, and in the context of the whole “hockey stick” controversy, effectively imposes hockey sticks on the results.
And she backs it up with a simple experiment anybody can do with Microsoft Excel.
Method of creating hockey stick reconstructions out of nothing
To create “hockey stick” reconstructions out of nothing, I’m going to do this:
- Generate roughly 148 years worth of monthly “tree – ring” data using rand() in EXCEL. This corresponds to 1850-1998. I will impose autocorrelation with r=0.995. I’ll repeat this 154 times. (This number is chosen arbitrarily.) On the one hand, we know these functions don’t correlate with Hadley because they are synthetically generated. However, we are going to pretend we believe “some” are sensitive to temperature and see what sort of reconstruction we get.
- To screen out the series that prove themselves insensitive to temperature, I’ll compute the autocorrelation, R, between Hadley monthly temperature data and the tree-ring data for each of the 154 series. To show this problem with this method, I will compute the correlation only over the years from 1960-1998. Then, I’ll keep all series that have autocorrelations with absolute values greater than 1.2* the standard deviation of the 154 autocorrelations R. I’ll assume the other randomly generated monthly series are “not sensitive” to temperature and ignore them. (Note: The series with negative values of R are the equivalent of “upside down” proxies. )
- I’ll create a proxy by simply averaging over the “proxies” that passed the test just described. I’ll rebaseline so the average temperature and trends for the proxie and Hadley match between 1960-1998.
- I’ll plot the average from the proxies and compare it to Hadley
The comparison from one (typical) case is shown below. The blue curve indicates is the “proxy reconstruction”; the yellow is the Hadley data (all data are 12-month smoothed.)
Notice that after 1960, the blue curve based on the average of “noise” that passed the test mimics the yellow observations. It looks good because I screened out all the noise that was “not sensitive to temperature”. (In reality, none is sensitive to temperature. I just picked the series that didn’t happen to fail. )
Because the “proxies” really are not sensitive to temperature, you will notice there is no correspondence between the blue “proxy reconstruction” and the yellow Hadley data prior to 1960. I could do this exercise a bajillion times and I’ll always get the same result. After 1960, there are always some “proxies” that by random chance correlate well with Hadley. If I throw away the other “proxies” and average over the “sensitive” ones, the series looks like Hadley after 1960. But before 1960? No dice.
Also notice that when I do this, the “blue proxie reconstruction” prior to 1960 is quite smooth. In fact, because the proxies are not sensitive, the past history prior to the “calibration” period looks unchanging. If the current period has an uptick, applying this method to red noise will make the current uptick look “unprecedented”. (The same would happen if the current period had a down turn, except we’d have unprecedented cooling. )
The red curve
Are you wondering what the red curve is? Well, after screening once, I screened again. This time, I looked at all the proxies making up the “blue” curve, and checked whether they correlated with Hadely during the periods from 1900-1960. If they did not, I threw them away. Then I averaged to get the red line. (I did not rebaseline again.)
The purpose of the second step is to “confirm” the temperature dependence.
Having done this, I get a curve that sort of looks sort of like Hadley from 1900-1960. That is: the wiggles sort of match. The “red proxy reconstructions” looks very much like Hadley after 1960: both the “wiggles” and the “absolute values” match. It’s also “noisier” than the blue curve–that’s because it contains fewer “proxies”.
But notice that aprior to 1900, the wiggles in the red proxy and the yellow Hadley data don’t match. (Also, the red proxie wants to “revert to the mean.”)
Can I do this again? Sure. Here are the two plots created on the next two “refreshes” of the EXCEL spreadsheet:
I can keep doing this over and over. Some “reconstructions” look better; some look worse. But these don’t look too shabby when you consider that none of the “proxies” are sensitive to temperature at all. This is what you get if you screen red noise.
Naturally, if you use real proxies and that contain some signal, you should do better than this. But knowing you can get this close with nothing but noise should make you suspect that screening out based on a known temperature record can bias your answers to:
- Make a “proxy reconstruction” based on nothing but noise match the thermometer record and
- Make the historical temperature variations looks flat and unvarying.
So, Hu is correct. If you screen out “bad” proxies based on a match to the current temperature record, you bias your answer. Given the appearance of the thermometer record during the 20th century, you bias toward hockey sticks!
Does this mean it’s impossible to make a reliable reconstruction? No. It means you need to think very carefully about how you select your proxies. Just screening to match the current record is not an appropriate method.
Update
I modified the script to show the number of proxies in the “blue” and “red” reconstructions. Here is one case, the second will be uploaded in a ’sec.
Steve McIntyre writes in comments:
Steve McIntyre (Comment#21669)
October 15th, 2009 at 4:24 pm
Lucia, in addition to Jeff Id, this phenomenon has now been more or less independently reported by Lubos, David Stockwell and myself. David published an article on the phenomenon in AIG News, online at http://landshape.org/enm/wp-co…..6%2014.pdf . We cited this paper in our PNAS comment (as one of our 5 citations.) I don’t have a link for Lubos on it, but he wrote about it.
I mention this phenomenon in a post prior to the starting of Climate Audit, that was carried forward from my old website from Dec 2004 http://www.climateaudit.org/?p=9, where I remarked on this phenomenon in connection with Jacoby and D’Arrigo picking the 10 most “temperature-sensitive” out of 35 that they sampled as follows:
If you look at the original 1989 paper, you will see that Jacoby “cherry-picked” the 10 “most temperature-sensitive” sites from 36 studied. I’ve done simulations to emulate cherry-picking from persistent red noise and consistently get hockey stick shaped series, with the Jacoby northern treeline reconstruction being indistinguishable from simulated hockey sticks. The other 26 sites have not been archived. I’ve written to Climatic Change to get them to intervene in getting the data. Jacoby has refused to provide the data. He says that his research is “mission-oriented” and, as an ex-marine, he is only interested in a “few good” series.
===
Read the whole post at Lucia’s blog here
I encourage readers to try these experiments in hockey stick construction themselves. – Anthony






Tom P>> Would You agree on this statement:
IF the data is random and the method described by Lucia is used THEN the more series you add, the straighter and smoother the “unmatched” parts of the series will be. At the same time, the more series you add, the “uptick part”/blade of the stick will be more and more similar in detail to the data you’re matching against.
?
re Tom P ‘… The issue has indeed been thought about very carefully.’
Thinking about something in no way guarantees understanding.
Perhaps one might think on that.
CA posts on this topic include http://www.climateaudit.org/?p=4908 4216 3821 3838 3858.
While the matter may have been thought about “very carefully”, unfortunately with Team articles, that often means that you also have to examine it very carefully.
If something is a temperature “proxy”, then one would it expect it to have significant correlation in both “early” and “late” periods as defined in M08. Only 342 of the ~400 do (and one of these is disqualified in my opinion because the “significant” correlation has a different sign in each period.
Further breaking down the 342 of 1209: 71 of 71 Luterbacher proxies pass – these use instrumental data and thus are irrelevant. Of the 927 code 9000 dendro proxies (the majority), only 143 (15.4%) pass the first-cut test. There are issues pertaining to AR1 autocorrelation that would reduce this. Also Mann picks the better of two correlations, goosing the number up a bit on the observed without adjusting the benchmark. I didn’t test how to do a proper benchmark – at this stage, one merely knows that Mann’s benchmark is fudged. However 93 of 105 MXD proxies (these are the Rutherford RegEMed version of Briffa’s MXD proxies.) These suffer from the divergence problem and therefore Mann 2008 deleted the post-1960 portion of these proxies and substituted infilled data. I don’t recall offhand whether we figured out which the correlations were calculated on. The Briffa MXD series are relatively short – none go before AD1400 and do not pertain to the MWP-modern comparison that is the main game.
Tom P, as usual, speaks over certainly on a matter where he has not familiarized himself with the relevant analysis.
David Ball (13:12:03) : ” ‘Only interested in a few good series’ and ‘mission oriented’ [speak] volumes. The refusal to provide data for replication is inexcusable. When are the governing bodies going to admonish these people? After science is dead? Methinks the defibrillator is needed now.”
“She’s dead, Jim.”
Instead of throwing out data, should the calibration instead adjust factors which are applied to all of the data? If a linear or geometric relationship doesn’t exist, that’s just too bad.
The Nobel Prize in Chemistry will go to someone who very carefully thought about how to change lead into gold. Someone who tried very hard and very sincerely. That the results are not gold is a detail, some of them do look like yellow metal.
Excellent explanation Lucia.
This tale reminds me of the days long age when I was evaluating gold properties for a mining company. One class of property was the gold – quartz vein with pockets of free (visible) gold. This type of property was typically owned by a junior mining company with a name like Free Gold Inc. or GMITS Inc. The owners, typically a prospector, a promoter and a couple of dentists could proudly point to sections of the vein and truthfully say “John Doe (A reputable geologist whom I knew) took a sample from here that ran 10 ounces to the ton and over there he got 5 ounces to the ton.”
On examining and sampling the exposed portions of the vein I would note the small pockets containing visible gold (A sampling QA/QC and statistical nightmare) and long sections of barren looking quartz vein (a mining / grade control nightmare) .
On receipt of the assay results from my sampling (Typically one or two intresting values and a lot of trace or nils) I would write them a nice letter expressing our regrets that their property was not the type of gold property we were looking for and as a courtesy enclose a copy of the assay results of my sampling.
The owners were still convinced that they had a high grade gold vein and had more assay reports to prove it. In reality they had an essentially barren quartz vein with small pockets containing irratically distributed flakes of gold.
The intresting thing is while they could never get a mining company to put the property into production (and make them millionairs) they could always find another dentist, doctor or lawyer to put more money into the company.
I guess that it will always be that some people (possibly most people) will fixate on data that conforms to their beliefs and reject or ignore that which doesn’t.
My appologies to dentist, doctors and lawyers but they kept cropping up time after time.
Espen (14:34:30) : “I’ve just started to read all the articles on this, especially Steve McIntyre’s work, and am still just flabbergasted. It seems to me that they hockey players are breaking the most fundamental principles of statistics (wrt. randomness of samples and using separate sample sets for model building and the actual tests), but I need to read a little more on PCA on time series to get the grasp of this.”
Special rules seem to apply to climatology in general and to dendroclimatology in particular. The relevance of tree cores is extremely low, given that they are highly ambiguous per se and are being used to attempt to measure localized atmospheric temperatures. Any dendro climatic signals are easily obscured by weather signals. The oceans are ~1200 times as great a heatsink as the atmosphere. How significant can this type of study be? It’s astrology.
“_Jim (15:36:58) :
I see loading forks on a number of those Gradalls and the JLG showing in the picture (meaning: they were used in wharehouse operations).”
Jim,
Now I know what the AGW’ers feel like when they get debunked.
I do have a picture very similar though of construction equipment up at auction.
There are cases in research when this is an acceptable practice (IE gathering data and then using only the sensitive to the treatments subjects). I did this in my research. It was necessary as we did not put our subjects to sleep (which would have caused nearly all subjects to be sensitive to the treatment). So instead we had to find subjects whose brains were quiet enough when awake to allow their auditory brainstem synaptic response to the signals to rise above the noisy synaptic brain when listening to high frequency tone pips. I even had my brain tested and it turned out to be WAAAYYYYY too noisy for any use in the study. A few years later I had an EEG done (I was put to sleep). It confirmed that even when asleep, my brain just keeps talking (go figure). I was able to find 6 subjects who had quiet brains. As a result we were able to demonstrate that the auditory pathway was sensitive to narrow frequency from the get go (at the first synaptic junction as well as the later ones in the brainstem).
So I can understand why a researcher would think that looking through the trees to find the sensitive ones is okay to do. However the difference is this: In my case we were removing higher noisy responses that were not related to frequency response in the brainstem (the noise comes from higher synaptic junctions and could be the result of just “thinking”). So we needed to remove noisy brains or put all our subjects to sleep.
With trees, the rings demonstrate the tree’s response to the environment, which is the treatment under consideration. So all rings should be used within a stand that is subject to the same environment (meaning you should not mix trees in a meadow with trees growing next to a river bank). If the intent was to remove trees that were growing next to a stream from trees that were in the meadow, it would make sense to remove those rings from the data pool. Is it possible this is what was done?
“stevemcintyre (16:02:45) :
Of the 927 code 9000 dendro proxies (the majority), only 143 (15.4%) pass the first-cut test.”
Roughly what they claim would be expected by chance alone (~13%)
Joel Shore, speaking of things that have been understood for some time, the tree ring proxy has been known to have very little accuracy when dealing with temperature for at least three decades. As a child, I remember dinner table conversations between my father and his colleagues over this very issue. It was known to be problematic 30 years ago. Climate science had been way beyond this stuff, and here we are arguing the validity today. The use of tree ring proxies is a joke, yet the people using them are still trying to show they are valid. To discuss the interpretation of this proxy data is a waste of time. Examine your basic assumptions.
Very nice article by Lucia and very clear & valid comment by michel (13:47:43)
I remember when research on tree rings was an important field as a proxy for climate favorability to local farm or food production in areas without written records. Sometimes this was used as a presipatation proxy and at other times as a temperature proxy by lazy theoriests to prove their point.
Tree rings are a good proxy for plant growth conditions and little else. With nearly 60 years of growing trees and other plants, I can attest to that fact.
[i]AnonyMoose (16:25:46) :
Instead of throwing out data, should the calibration instead adjust factors which are applied to all of the data? If a linear or geometric relationship doesn’t exist, that’s just too bad.[/i]
Yes, exactly. That’s the procedure that Pamela Gray describes. Call this Method A. You have a huge number of trees. You may have some reason for thinking some trees not others are thermometers. They may, for instance, be of a particular species, particularly large, old, undamaged, regular in shape, irregular in shape, in a particular region. Whatever.
For some reason connected to a an account of their biology you think you’ve found the thing that makes these particular ones accurate registers of temperature. So you pick them. Notice that so far you have not compared any of them to any temperature record.
THEN you plot the temperatures they indicate.
Now you check your results by comparing the temps they seem to indicate with a known real thermometer record you trust for part of the period you are interested in. They show a reasonably decent correlation for this period. Now you have some reason to find it plausible that this sample are thermometers, at least for some period. So you have some reason to think that they will also be thermometers in other periods.
The way Lucia is exposing, Method B, is, you take the same thermometer record and the same initial large sample. Then you take the same temperature record. You just compare your sample against the temperature record, and you throw out all the ones that do not correlate. This is not legitimate.
Again, you can see why, if you consider how likely the two methods are to lead to the same sub-sample. Only under one condition: that a temperature match in your sample period is only found in trees with some biological reason for being good thermometers throughout their lives. But that is the question at issue, so you’ve assumed what you are trying to prove. Method A provides some kind of test of whether your sample is a sample of thermometers, precisely because you did the picking before doing any temperature checks.
It is possible that Method B could work, and yield a sample of thermometers. How would you know? Well, you’d have to do Method A, find the key biological variables, then see if Method B also selected the trees with them. Bit of a waste of time then, doing Method B. In short, you have no way of avoiding doing Method A. Method B adds nothing.
You can also see the problem if you consider how certain you will be in the first case of correlating to temperature in any given period. Not at all a priori. It really will depend on whether your theory about the biology of the trees is correct, and that is one of the things you’ll be finding out in Method A, when you compare your runs against known temps.
Therefore, in Method A, correlating against temperatures really may give you some new information. In Method B, as the red noise example shows, it does not. It just picks the samples that show what you want them to show in the period you’ve picked for correlation.
Take this a step further. Someone criticizes your procedure in Method B by arguing that you just picked trees that correlate in your temperature measured time, and that this shows nothing about whether this sample correlated earlier or later. Aha, you say. I have calculated the results of picking from my tree sample randomly, and correlating the resulting sample against temperature in my period. If I do this, I find that very few of my trees correlate with temps. This shows I am not picking at random. I am picking ones that really do correlate.
Which of course, we knew all along was being done, and the problem was not that at all. It was that the procedure gave us no reason to think this sample would show the same correlations outside the measured temp period that it did within it.
There is no way around this one. You cannot do sampling like this. This is like the absorption spectrum of CO2, or gravity, this really is settled science. There is no point denying it. Its just that because its statistics, it can be a bit hard to get your head around. But it really is settled.
It does seem very odd that Climate Scientists should refuse to accept settled science when it comes to statistics, or should misrepresent what the settled science actually is – they have form on this, have a look at Tamino’s series on PCA – but accuse others of denialism when they point out that the science on this matter is not what they are saying it is. Doing and defending statistics in this way really is denialism.
Tom P (14:31:51) :
“Although 484 (~40%) pass the temperature screening process over the full (1850 –1995) calibration interval, one would expect that no more than ~150 (~13%) of the proxy series would pass the screening procedure described above by chance alone. This observation indicates that selection bias, although potentially problematic when employing screened predictors… does not appear a significant problem in our case.
I’ve got an issue with this.
Of the 100% of data captured they know 13% is biased. They then discard 60% of the data keeping the remaining 40%, which still contains the 13% biased data. In effect they now have 32.5% biased data.
A Mathematica notebook, and its PDF preview, showing how you get a hockey stick out of red noise if you prefer the series that show warming at the end:
http://cid-9cd81cfa06ff7718.skydrive.live.com/self.aspx/.Public/mann-hockey-fun.pdf
http://cid-9cd81cfa06ff7718.skydrive.live.com/self.aspx/.Public/mann-hockey-fun.nb
http://motls.blogspot.com/2009/09/beaten-with-hockey-sticks-yamal-tree.html
The problem of the MBH98, MBH99, and similar papers was (and is!) that the algorithm preferred proxies – or trees (or their equivalents) – that showed a warming trend in the 20th century, assuming that this condition guaranteed that the trees were sensitive to temperature.
But even if such a 20th century trend occurred by chance for a certain tree (and a fraction of the trees inevitably satisfies this condition), the corresponding tree would influence Mann’s final graphs a lot. Effectively, the algorithm picked a lot of trees that didn’t show any correlation with the temperature but they were rather composed out of random data – red noise – before 1900, and an increasing trend in 1900-2000.
You can’t be surprised that the average of such trees looked like a hockey stick even if the temperature didn’t. The noise before 1900 averages to a constant temperature or something close to it while the 20th century warming survives.
stevemcintyre (16:02:45) :
“I didn’t test how to do a proper benchmark – at this stage, one merely knows that Mann’s benchmark is fudged.”
That’s what I call a statistical analysis! You promised back in January in your original posting you referenced to look further into “questionable Mannian benchmarks”. Are you going to do this?
Why not extend the timeline of the random data to 2100? You can get future climate data this way much cheaper than modelling and at least the same quality, probably even better.
dhmo (13:07:32) :
If its all the same to you, I would prefer to keep the rectal thermometer I have.
Tom P>> Please enlighten us:
Please tell us how Mann et al knows that one should expect ~13% conformity with measurements from a purely random source?
I’m not sure that I will understand the answer, but at least you owe the more mathematically gifted people here that.
Sooner or later someone is going to come up with something witty regarding Pres Washington and cherry trees… I can feel it in my waters. Somehow I sense a missed opportunity that he did not graph the rings.
Anywho… as for data & statistics, look no further than Aesop (and that IS going back a ways):
“We can easily represent things as we wish them to be.”
Tru dat.
Greg Cavanagh (00:47:35)>>
I’m not sure that is the context of the Mann quote. I reacted the same way as you first, but:
I believe they are saying that if the conforming sources would be amount to ~13%, then those series would probably be of a random nature.
This actually doesn’t mean that ~13% is always random noise.
I’m still wondering how on earth they would know that 13% is the magic number.
Perhaps there is some statistical algorithm that can resolve this likelihood?
I’m waiting for Tom P to give us this algortithm.
On the other hand, I really don’t see why such an algorithm would be of any significanse, since:
A) The sources are organic an may well have developed differently during different time spans.
B) Point A) means that any or none source could be valid BEFORE ~1850, totally unrelated to any mathematical formulae.
I believe that A and B is supported by people who knows much about trees and vegetation in general.
And if A and B is true, then this is also true:
C) The probabilty for pre 1850 data being noise, as opposed to temp data, must be resolved also.
If C) is resolved and it is found that the probability for these data being temp is sufficiently high, then one should proceed to:
D) Gather measured temp data from exactly those areas where the samples where collected (which would probably only leave sattelite temp data from ~1980 to compare with) and then:
E) Start over again from A), and only do matching against the actually measured data (sattelite) from the exact areas where one have gathered the proxy data.
I believe the above A-E process is impossible, since there will probably be a huge gap between available proxies and temp data from ~1980, but this process only would reveal how T might have fluctuated in these actual areas of proxy collection.
I believe there is a looong way to wander before any significance can be given to these pre-measurement data used in reconstructions.
I have always thought it ironic that Gore used a cherry picker (what we in the UK call an elevator machine) to demonstrate the Hockey Stick graph. Now, with Mr McIntyre lifiting the lid on the diddled dendro revelations, the irony is all the more delicious.
I like Lucia’s simple article, very informative, and fun too! I like that it demonstrates things in a way a layperson can understand.
One of the things that has often troubled me about Climate Science, is the attitude that, the science is so complex that only experts are qualified to judge. And yet, sooner or later, one has to summarise, and the summary needs to make sense. I’m sure one needs to be expert to grasp the details, but one need only have a brain to grasp the logic.
As for suggestion that the biases have already been corrected or accounted for, well, what can I say, I’ve been taught a little something by various experts on how people deceive themselves, how I deceive myself, and I know that it takes a lot of introspection and testing to actually get out of that hole. The automatic response is to create a smoke screen, and I emphasise, it is automatic–the person under the illusion doesn’t know it. Just watch a friend say he’s planning on starting a diet on Jan 1st, as a New Year’s resolution, and then watch in January and February all the reasons the friend gives for why it isn’t the right time to start yet. All the reasons are reasonable and make sense. We all know this. No expertise required.
So when scientists are criticised, their supporters can claim, “oh, we already knew that! we’ve already corrected!” The more truthful can say, “oh, well, um… I’m sure they probably corrected for that already!”
The cool thing about being a layperson or not having a vested interest in the topic, is that there is no pressure to be right about it. I mean, this is what lay AGW supporters keep saying, they keep claiming that various scientists are paid by oil industry–in effect affirming that scientists can be easily biased, and yet any scientist in favour of AGW is somehow so immune to bias that they can self-correct for any and all bias–and we can even assume that they have done this, to the point that we should be the ones to jump through hoops to disprove this, rather than them jumping through hoops to show that they have.
I’m sure my friend is very right, that he is already perfectly well aware that he didn’t start the diet, and as he says that’s for some very good reasons–plus he’s already planning a new diet that’ll be twice as effective–and he doesn’t need to say what they are, for what business is it of mine anyway.