Guest Post by Willis Eschenbach
Since we’ve been discussing smoothing in datasets, I thought I’d repost something that Steve McIntyre had graciously allowed me to post on his amazing blog ClimateAudit back in 2008. Let me start by saying that when I got involved in climate science, the go-to blog was the late, great John Daly’s blog, “Still Waiting for Greenhouse”. Sadly, when I went today to get the URL, I got the “Account Suspended” message … Yikes! That was an international treasure trove of climate history! Can we reverse that? Or are we at the mercy of the Wayback Machine? Does his archive exist, and can a host for it be found? [UPDATE: thanks to quick action by Ric Werme, John Daly's site is back up again, and plans are afoot to move it to a more secure location. -w.]
Figure 1. Lagged cross-correlations of CO2 and temperature between smoothed (brown line) and unsmoothed (red line) data.
In any case, after John’s death in January 2004, there was a void for about a year, and then Steve McIntyre founded ClimateAudit. From the start his work has been thorough, transparent, and fascinating. To this day he continues to find staggering gaps in the published claims and papers.
In addition to serving as an exemplar of honest, clear, transparent science, there are several other things that I am grateful to Steve for. One is that after I had been commenting on his blog for some time, Steve offered me space on ClimateAudit as a guest author. What follows below is one of those pieces.
A second thing I appreciate about Steve is that he pushed me repeatedly to get off of Excel and move to the computer language “R”. R is far and away the finest computer language I’ve ever learned, and I’m hardly a novice. My first computer language was Algol in 1963. Since then I’ve learned Fortran, LISP/LOGO, C, C++, several dialects of Basic including Visual Basic, Pascal, Hypertalk, Datacom, Assembly Language, VectorScript, and a couple of the Mathematica languages. I could not have done a tenth of the work I’ve done using any other language except Mathematica, and the learning curve for Mathematica was so steep I got nosebleeds. Plus R is free, friends, free, and it’s cross-platform, and it has hosts and hosts of packages for all kinds of special purposes. I can only pass on Steve’s excellent advice, learn R, you won’t regret it. Let me digress quickly and give a quick example of just one the many reasons why R is superior. You’re welcome to skip these couple paragraphs if you desire.
Suppose we have a block of data called ClimateData. It has columns of measurements of temperature, pressure, and the like. The rows represent times, perhaps months. Let’s say we want to add 3 to all of the data. In almost all computer languages, you have to loop repeatedly through the data to do that. The “pseudocode” to add 3 would look like this, with comments indicated by the hashmark “#”
nrows = RowCount(ClimateData) # get the number of rows ncols = ColumnCount(ClimateData) # get the number of columns for myRow varying from 1 to nrows # step through each row for myColumn varying from 1 to ncols # step through each column ClimateData[myRow, myColumn] = ClimateData[myRow, myColumn] + 3 # do the actual work next myColumn # end of the inner loop next myROW # end of the outer loop
Now, compare all of that opportunity for hidden errors with the corresponding actual code to do the same thing in R:
ClimateData = ClimateData + 3
I rest my case, and return to the subject at hand.
Finally, I acknowledge Steve McIntyre for being my guide to the elusive process of becoming more Canadian, that is to say less excitable, more reserved in speech, and not letting my blood get all angrified by the actions of less-than-wellmeaning anonymous internet chuckleheads. Despite good intentions I make slow progress in that regard, I fear. However, progress continues however slow, I figure I’ll be eligible for Canadian citizenship sometime before my age hits triple digits … In any case, here’s my 2008 post on the question of smoothing and correlation.
Data Smoothing and Spurious Correlation
Allan Macrae has posted an interesting study at ICECAP. In the study he argues that the changes in temperature (tropospheric and surface) precede the changes in atmospheric CO2 by nine months. Thus, he says, CO2 cannot be the source of the changes in temperature, because it follows those changes.
Being a curious and generally disbelieving sort of fellow, I thought I’d take a look to see if his claims were true. I got the three datasets (CO2, tropospheric, and surface temperatures), and I have posted them up here. These show the actual data, not the month-to-month changes.
In the Macrae study, he used smoothed datasets (12 month average) of the month-to-month change in temperature (∆T) and CO2 (∆CO2) to establish the lag between the change in CO2 and temperature . Accordingly, I did the same. [My initial graph of the raw and smoothed data is shown above as Figure 1, I repeat it here with the original caption.]
￼Figure 1. Cross-correlations of raw and 12-month smoothed UAH MSU Lower Tropospheric Temperature change (∆T) and Mauna Loa CO2 change (∆CO2). Smoothing is done with a Gaussian average, with a “Full Width to Half Maximum” (FWHM) width of 12 months (brown line). Red line is correlation of raw unsmoothed data (referred to as a “0 month average”). Black circle shows peak correlation.
At first glance, this seemed to confirm his study. The smoothed datasets do indeed have a strong correlation of about 0.6 with a lag of nine months (indicated by the black circle). However, I didn’t like the looks of the averaged data. The cycle looked artificial. And more to the point, I didn’t see anything resembling a correlation at a lag of nine months in the unsmoothed data.
Normally, if there is indeed a correlation that involves a lag, the unsmoothed data will show that correlation, although it will usually be stronger when it is smoothed. In addition, there will be a correlation on either side of the peak which is somewhat smaller than at the peak. So if there is a peak at say 9 months in the unsmoothed data, there will be positive (but smaller) correlations at 8 and 10 months. However, in this case, with the unsmoothed data there is a negative correlation for 7, 8, and 9 months lag.
Now Steve McIntyre has posted somewhere about how averaging can actually create spurious correlations (although my google-fu was not strong enough to find it). I suspected that the correlation between these datasets was spurious, so I decided to look at different smoothing lengths. These look like this:
Figure 2. Cross-correlations of raw and smoothed UAH MSU Lower Tropospheric Temperature change (∆T) and Mauna Loa CO2 change (∆CO2). Smoothing is done with a Gaussian average, with a “Full Width to Half Maximum” (FWHM) width as given in the legend. Black circles shows peak correlation for various smoothing widths. As above, a “0 month” average shows the lagged correlations of the raw data itself.
Note what happens as the smoothing filter width is increased. What start out as separate tiny peaks at about 3-5 and 11-14 months end up being combined into a single large peak at around nine months. Note also how the lag of the peak correlation changes as the smoothing window is widened. It starts with a lag of about 4 months (purple and blue 2 month and 6 month smoothing lines). As the smoothing window increases, the lag increases as well, all the way up to 17 months for the 48 month smoothing. Which one is correct, if any?
To investigate what happens with random noise, I constructed a pair of series with similar autoregressions, and I looked at the lagged correlations. The original dataset is positively autocorrelated (sometimes called “red” noise). In general, the change (∆T or ∆CO2) in a positively autocorrelated dataset is negatively autocorrelated (sometimes called “blue noise”). Since the data under investigation is blue, I used blue random noise with the same negative autocorrelation for my test of random data. However, the exact choice is immaterial to the smoothing issue.
This was my first result using random data:
Figure 3. Cross-correlations of raw and smoothed random (blue noise) datasets. Smoothing is done with a Gaussian average, with a “Full Width to Half Maximum” (FWHM) width as given in the legend. Black circles show peak correlations for various smoothings.
Note that as the smoothing window increases in width, we see the same kind of changes we saw in the temperature/CO2 comparison. There appears to be a correlation between the smoothed random series, with a lag of about 7 months. In addition, as the smoothing window widens, the maximum point is pushed over, until it occurs at a lag which does not show any correlation in the raw data.
After making the first graph of the effect of smoothing width on random blue noise, I noticed that the curves were still rising on the right. So I graphed the correlations out to 60 months. This is the result:
￼Figure 4. Rescaling of Figure 3, showing the effect of lags out to 60 months.
Note how, once again, the smoothing (even for as short a period as six months, green line) converts a non-descript region (say lag +30 to +60, right part of the graph) into a high correlation region, by the lumping together of individual peaks. Remember, this was just random blue noise, none of these are represent real lagged relationships despite the high correlation.
My general conclusion from all of this is to avoid looking for lagged correlations in smoothed datasets, they’ll lie to you. I was surprised by the creation of apparent, but totally spurious, lagged correlations when the data is smoothed.
And for the $64,000 question … is the correlation found in the Macrae study valid, or spurious? I truly don’t know, although I strongly suspect that it is spurious. But how can we tell?
My best to everyone,