The Pitfalls of Data Smoothing

Guest Post by Willis Eschenbach

Since we’ve been discussing smoothing in datasets, I thought I’d repost something that Steve McIntyre had graciously allowed me to post on his amazing blog ClimateAudit back in 2008. Let me start by saying that when I got involved in climate science, the go-to blog was the late, great John Daly’s blog, “Still Waiting for Greenhouse”. Sadly, when I went today to get the URL, I got the “Account Suspended” message … Yikes! That was an international treasure trove of climate history! Can we reverse that? Or are we at the mercy of the Wayback Machine? Does his archive exist, and can a host for it be found? [UPDATE: thanks to quick action by Ric Werme, John Daly’s site is back up again, and plans are afoot to move it to a more secure location. -w.]

Figure 1. Lagged cross-correlations of CO2 and temperature between smoothed (brown line) and unsmoothed (red line) data.

In any case, after John’s death in January 2004, there was a void for about a year, and then Steve McIntyre founded ClimateAudit. From the start his work has been thorough, transparent, and fascinating. To this day he continues to find staggering gaps in the published claims and papers.

In addition to serving as an exemplar of honest, clear, transparent science, there are several other things that I am grateful to Steve for. One is that after I had been commenting on his blog for some time, Steve offered me space on ClimateAudit as a guest author. What follows below is one of those pieces.

A second thing I appreciate about Steve is that he pushed me repeatedly to get off of Excel and move to the computer language “R”. R is far and away the finest computer language I’ve ever learned, and I’m hardly a novice. My first computer language was Algol in 1963. Since then I’ve learned Fortran, LISP/LOGO, C, C++, several dialects of Basic including Visual Basic, Pascal, Hypertalk, Datacom, Assembly Language, VectorScript, and a couple of the Mathematica languages. I could not have done a tenth of the work I’ve done using any other language except Mathematica, and the learning curve for Mathematica was so steep I got nosebleeds. Plus R is free, friends, free, and it’s cross-platform, and it has hosts and hosts of packages for all kinds of special purposes. I can only pass on Steve’s excellent advice, learn R, you won’t regret it. Let me digress quickly and give a quick example of just one the many reasons why R is superior. You’re welcome to skip these couple paragraphs if you desire.

Suppose we have a block of data called ClimateData. It has columns of measurements of temperature, pressure, and the like. The rows represent times, perhaps months. Let’s say we want to add 3 to all of the data. In almost all computer languages, you have to loop repeatedly through the data to do that. The “pseudocode” to add 3 would look like this, with comments indicated by the hashmark “#”

```nrows = RowCount(ClimateData)             # get the number of rows

ncols = ColumnCount(ClimateData)               # get the number of columns

for myRow varying from 1 to nrows              # step through each row

for myColumn varying from 1 to ncols       # step through each column

ClimateData[myRow, myColumn] = ClimateData[myRow, myColumn] + 3 # do the actual work    next myColumn                              # end of the inner loop

next myROW                                     # end of the outer loop```

Now, compare all of that opportunity for hidden errors with the corresponding actual code to do the same thing in R:

`ClimateData = ClimateData + 3`

Finally, I acknowledge Steve McIntyre for being my guide to the elusive process of becoming more Canadian, that is to say less excitable, more reserved in speech, and not letting my blood get all angrified by the actions of less-than-wellmeaning anonymous internet chuckleheads. Despite good intentions I make slow progress in that regard, I fear. However, progress continues however slow, I figure I’ll be eligible for Canadian citizenship sometime before my age hits triple digits … In any case, here’s my 2008 post on the question of smoothing and correlation.

—————————————————————————————–

Data Smoothing and Spurious Correlation

Allan Macrae has posted an interesting study at ICECAP. In the study he argues that the changes in temperature (tropospheric and surface) precede the changes in atmospheric CO2 by nine months. Thus, he says, CO2 cannot be the source of the changes in temperature, because it follows those changes.

Being a curious and generally disbelieving sort of fellow, I thought I’d take a look to see if his claims were true. I got the three datasets (CO2, tropospheric, and surface temperatures), and I have posted them up here. These show the actual data, not the month-to-month changes.

In the Macrae study, he used smoothed datasets (12 month average) of the month-to-month change in temperature (∆T) and CO2 (∆CO2) to establish the lag between the change in CO2 and temperature . Accordingly, I did the same. [My initial graph of the raw and smoothed data is shown above as Figure 1, I repeat it here with the original caption.]

Figure 1. Cross-correlations of raw and 12-month smoothed UAH MSU Lower Tropospheric Temperature change (∆T) and Mauna Loa CO2 change (∆CO2). Smoothing is done with a Gaussian average, with a “Full Width to Half Maximum” (FWHM) width of 12 months (brown line). Red line is correlation of raw unsmoothed data (referred to as a “0 month average”). Black circle shows peak correlation.

At first glance, this seemed to confirm his study. The smoothed datasets do indeed have a strong correlation of about 0.6 with a lag of nine months (indicated by the black circle). However, I didn’t like the looks of the averaged data. The cycle looked artificial. And more to the point, I didn’t see anything resembling a correlation at a lag of nine months in the unsmoothed data.

Normally, if there is indeed a correlation that involves a lag, the unsmoothed data will show that correlation, although it will usually be stronger when it is smoothed. In addition, there will be a correlation on either side of the peak which is somewhat smaller than at the peak. So if there is a peak at say 9 months in the unsmoothed data, there will be positive (but smaller) correlations at 8 and 10 months. However, in this case, with the unsmoothed data there is a negative correlation for 7, 8, and 9 months lag.

Now Steve McIntyre has posted somewhere about how averaging can actually create spurious correlations (although my google-fu was not strong enough to find it). I suspected that the correlation between these datasets was spurious, so I decided to look at different smoothing lengths. These look like this:

Figure 2. Cross-correlations of raw and smoothed UAH MSU Lower Tropospheric Temperature change (∆T) and Mauna Loa CO2 change (∆CO2). Smoothing is done with a Gaussian average, with a “Full Width to Half Maximum” (FWHM) width as given in the legend. Black circles shows peak correlation for various smoothing widths. As above, a “0 month” average shows the lagged correlations of the raw data itself.

Note what happens as the smoothing filter width is increased. What start out as separate tiny peaks at about 3-5 and 11-14 months end up being combined into a single large peak at around nine months. Note also how the lag of the peak correlation changes as the smoothing window is widened. It starts with a lag of about 4 months (purple and blue 2 month and 6 month smoothing lines). As the smoothing window increases, the lag increases as well, all the way up to 17 months for the 48 month smoothing. Which one is correct, if any?

To investigate what happens with random noise, I constructed a pair of series with similar autoregressions, and I looked at the lagged correlations. The original dataset is positively autocorrelated (sometimes called “red” noise). In general, the change (∆T or ∆CO2) in a positively autocorrelated dataset is negatively autocorrelated (sometimes called “blue noise”). Since the data under investigation is blue, I used blue random noise with the same negative autocorrelation for my test of random data. However, the exact choice is immaterial to the smoothing issue.

This was my first result using random data:

Figure 3. Cross-correlations of raw and smoothed random (blue noise) datasets. Smoothing is done with a Gaussian average, with a “Full Width to Half Maximum” (FWHM) width as given in the legend. Black circles show peak correlations for various smoothings.

Note that as the smoothing window increases in width, we see the same kind of changes we saw in the temperature/CO2 comparison. There appears to be a correlation between the smoothed random series, with a lag of about 7 months. In addition, as the smoothing window widens, the maximum point is pushed over, until it occurs at a lag which does not show any correlation in the raw data.

After making the first graph of the effect of smoothing width on random blue noise, I noticed that the curves were still rising on the right. So I graphed the correlations out to 60 months. This is the result:

Figure 4. Rescaling of Figure 3, showing the effect of lags out to 60 months.

Note how, once again, the smoothing (even for as short a period as six months, green line) converts a non-descript region (say lag +30 to +60, right part of the graph) into a high correlation region, by the lumping together of individual peaks. Remember, this was just random blue noise, none of these are represent real lagged relationships despite the high correlation.

My general conclusion from all of this is to avoid looking for lagged correlations in smoothed datasets, they’ll lie to you. I was surprised by the creation of apparent, but totally spurious, lagged correlations when the data is smoothed.

And for the \$64,000 question … is the correlation found in the Macrae study valid, or spurious? I truly don’t know, although I strongly suspect that it is spurious. But how can we tell?

My best to everyone,

w.

Subscribe
Notify of
Phil

Correlation is not causation correlation. Thanks, Willis

Geoff Sherrington

This has some similarities to an essay I wrote a few years ago when the first BEST results were made public. I was concerned by the large correlation coefficients between temperatures at large station separations, the graph is in the following.
It started off with a bit of geostatistics, a sub-discipline that I think needs more examination for context in climate work. It deals a lot with lagged data and correlations.
For simplicity, I started with a single station and then lagged various blocks of Temperature data from daily to monthly to annual, separating Tmax from Tmin, showing that at this station (Melbourne BoM Central) they had different behaviour.
A four-part series was intended, but the first part (here) drifted off because there was too much noise in the data.
I’d really appreciate some feedback as I know Willis would also, because as you take these concepts further they end up interacting with procedures like gridding, interpolating, contouring, map making etc. I think that we have a current case in Australia where maps showing Australia temperature as a whole have some bad looks about them and some headlines that might not be supportable.
I will have to learn R program. I started with machine language in 1969.
http://www.geoffstuff.com/GHS%20on%20chasing%20R%2c%20extended.pdf

geologyJim

Willis – I always enjoy your contributions. Particular thanks this time for noting the exemplar of McIntyre to thoroughness and the gentlemanly art of polite disagreement.
Both of you have the gift of droll wit, pointed irony, and damnation with faint praise.
From my geological perspective, I can only say that “Because the world has not gone to ruin in the past, it is highly unlikely to do so in the future. Any belief to the contrary is an arrogance of human influence.”

Jeff L

This post leaves me with more questions than answers & my gut says something is wrong with the calculations here , although not enough information is provided to tear this apart.
“Since the data under investigation is blue”. Is it really ? Did you look at the power spectrum & did it have increasing power density with increasing frequency? Very few signals in nature have this characteristic. This would surprise me, but since the original data sets & their associated power spectra aren’t present, I really can’t say if this is right or not (this is critical to the rest of the thoughts below). So, I would love to see a plot of the original raw data & it’s power spectra if you could add those to this post – that would certainly help clarify things. Next, is this the character of both the CO2 signal & the temp signal vs time ( That would be even more surprising !! ).
All that being said, if the data has a blue characteristic to it, a gausian filter will hammer the data. Remember that a gausian filter is basically a high cut / low pass filter . If the data is blue, then most of the energy is in the higher frequencies , so if we run a gausian filter over the data, we will remove most of the energy from the data (and that’s likely where the signal is – the rest may be just noise). So, again, looking at the original datasets, filtered & unfiltered would be instructive & useful. If the data is blue, the filtered data is going to look like a very lazy & very flat signal compared to the unfiltered signal – is that in fact the case ? As described , it should be – since most of the energy (amplitude) was in the higher frequencies – which you filtered out – so the remaining signal has very little amplitude of all & may only be the noise component of the dataset.
Which bring us to the next point – a proper cross-correlation of signals pre-conditions the signal by dividing through by the mean, but the mean has now been completely changed by the filtering. Just because you are getting a strong cross-correlation peak with the filtered data doesn’t mean anything now – as again, if the data is blue, you have basically removed the majority of the energy from the signal – all it is saying is that there is some sort of correlation in the low frequencies, which supposedly don’t have much energy in them to start with – it could just be showing you some non-random noise in the datasets .
Again, the way this is all presented, it leaves me with a whole lot more questions than answers. A re-post showing all the intermediate steps, with datasets vs time, associated power spectra, filtered data sets & spectra & ultimately the cross-correlations, both filtered & unfiltered would be a lot more instructive & would help answer your question :
” … is the correlation found in the Macrae study valid, or spurious? ”
I don’t think you even need to do the random data set if you can set forth the above plots – it should be pretty obvious whether it is valid or not & exactly what the physical meaning of the cross-correlations are (both filtered & unfiltered).
BTW, thanks for the tip on R – I will be looking into that!

Bob Koss

Willis,
About a week ago I noticed John Daly’s site was suspended. I inquired at Jo Nova’s site and she said it had been down for about a week already at that time. John’s wife passed away last year. So it may be down permanently. Jo inquired of someone in the area who is trying to get more information, but she hadn’t heard back. I’m with you, it would be a shame to have John’s site gone permanently, but John passed in 2004 and eventually all good things come to an end.

AJ

“Since then I’ve learned… several dialects of Basic including… Assembly Language…”
I truly hope there was an editing problem here. Actually, probably should say “… and several others including Basic… Assembly Language…”
[Thanks, clarified I think. -w.]

Jon

Who let John Daly become suspended?

nice.
for those wanting to learn R. get Rstudio.
subscribe to the R list.

Richard Thal

Domain Name: JOHN-DALY.COM
Registrar: DNC HOLDINGS, INC.
Whois Server: whois.directnic.com
Referral URL: http://www.directnic.com
Name Server: DNS1.HRNOC.NET
Name Server: DNS2.HRNOC.NET
Status: clientDeleteProhibited
Status: clientTransferProhibited
Status: clientUpdateProhibited
Updated Date: 18-jul-2009
Creation Date: 06-apr-2001
Expiration Date: 06-apr-2014
>>> Last update of whois database: Sun, 31 Mar 2013 04:07:23 UTC <<<
NOTICE: The expiration date displayed in this record is the date the
registrar's sponsorship of the domain name registration in the registry is
currently set to expire. This date does not necessarily reflect the expiration
date of the domain name registrant's agreement with the sponsoring
registrar. Users may consult the sponsoring registrar's Whois database to
view the registrar's reported date of expiration for this registration.
database through the use of electronic processes that are high-volume and
automated except as reasonably necessary to register domain names or
modify existing registrations; the Data in VeriSign Global Registry
Services' ("VeriSign") Whois database is provided by VeriSign for
information purposes only, and to assist persons in obtaining information
about or related to a domain name registration record. VeriSign does not
guarantee its accuracy. By submitting a Whois query, you agree to abide
for lawful purposes and that under no circumstances will you use this Data
to: (1) allow, enable, or otherwise support the transmission of mass
unsolicited, commercial advertising or solicitations via e-mail, telephone,
or facsimile; or (2) enable high volume, automated, electronic processes
that apply to VeriSign (or its computer systems). The compilation,
repackaging, dissemination or other use of this Data is expressly
prohibited without the prior written consent of VeriSign. You agree not to
use electronic processes that are automated and high-volume to access or
query the Whois database except as reasonably necessary to register
domain names or modify existing registrations. VeriSign reserves the right
reserves the right to modify these terms at any time.
The Registry database contains ONLY .COM, .NET, .EDU domains and
Registrars.Registration and WHOIS Service provided by Directnic.com
DNC Holdings, Inc. provides the data in the directNIC.com Registrar WHOIS
database for informational purposes only. The information may only be
used to assist in obtaining information about a domain name's registration
record. The use of this data for any other purpose without prior written
consent by DNC Holdings, Inc. is expressly forbidden.
By submitting a WHOIS query, you agree to all the terms and limitations
herein and that you will use this data only for lawful purposes. You also
agree that under no circumstances will you use this data to:
(a) allow, enable, or otherwise support the transmission by email,
telephone, or facsimile of mass, unsolicited, commercial advertising or
solicitations, including, but not limited to, spam, to entities other than
the data recipient's own existing customers;
(b) enable high volume, automated, electronic processes that send queries
or data to the systems of any Registry Operator or ICANN-accredited
registrar or
(c) enable any automated or robotic processes to collect or compile data
for any purpose, including data mining.
DNC Holdings makes this information available "as is", and provides no guarantee
or warranty as to its accuracy.
Registrant:
Jerry Brennan
5 Craigmoor Terrace
Danbury, CT 06810
US
203 743 7899
Domain Name: JOHN-DALY.COM
Brennan, Jerry brennan@john-daly.com
5 Craigmoor Terrace
Danbury, CT 06810
US
203 743 7899
Technical Contact:
Brennan, Jerry brennan@john-daly.com
5 Craigmoor Terrace
Danbury, CT 06810
US
203 743 7899
Record last updated 03-20-2004 08:22:50 PM
Record expires on 04-06-2014
Record created on 04-06-2001
Domain servers in listed order:
DNS1.HRNOC.NET 216.120.225.19
DNS2.HRNOC.NET 216.120.238.254
The compilation, repackaging, dissemination, or other use of this WHOIS
data is expressly prohibited without the prior written consent of
DNC Holdings, Inc.
database in its sole discretion, including without limitation, for
excessive querying of the database or for failure to otherwise abide by
this policy.
DNC Holdings reserves the right to modify these terms at any time.
NOTE: THE WHOIS DATABASE IS A CONTACT DATABASE ONLY.
LACK OF A DOMAIN RECORD DOES NOT SIGNIFY DOMAIN AVAILABILITY.

Fortran, LISP/LOGO, C, C++, several dialects of Basic including Visual Basic, Pascal, Hypertalk, Datacom, Assembly Language, VectorScript, and a couple of the Mathematica languages.
But no LabView; that, my friend, (LV) represents a paradigm shift …
BTW, for those interested Mathics (*1) a Mathematica look-alike “is a free, general-purpose computer algebra system featuring Mathematica-compatible syntax and functions”
Mathics also offers an online calc engine, then there is Wolfram Alpha (*2), a “computational knowledge or answer engine developed by Wolfram Research”
*1 – http://www.mathics.org/
*2 – http://www.wolframalpha.com/
.

Thanks, Willis. Most instructive.
And thanks for the “R” plug. I’ll add:
An Introduction to Statistics – R Tutorial: http://www.r-tutor.com/
The Blackboard » Learning R: http://rankexploits.com/musings/category/statistics/learning-r/

I believe William Briggs, professional statistician, counsels against smoothing data. Here is the link to his blog on this subject. http://wmbriggs.com/blog/?p=195

I have an idea why the smoothed data shows a correlation and a lag, and
the unsmoothed does not.
There are annual cycles in global temperature and in CO2 level. The
annual cycle in global temperature comes from the northern hemisphere
having more land and less water than the southern hemisphere, and so
the northern hemisphere has greater seasonal variation in temperature.
Global troposphere temperature probably peaks in August, when the
northern hemisphere as a whole (land and sea, including temporarily
ice-covered sea) is hottest. Or a little after northern hemisphere land
temperature or maybe surface temperature peaks – the surface warms the
troposphere, so the troposphere lags the surface – or at least lags land.
Also, seasons on northern hemisphere extratropical land affects that
land’s production and capture of CO2. CO2 tends to peak in May, just
before northern hemisphere vegetation gets busiest at converting CO2
to biomass.
As for lack of correlation in the unsmoothed data: I suspect the
unsmoothed data has its variations mainly short-term noisy or noise-like
ones that the smoothing removes. I suspect that a spectrum analysis of
the temperature and CO2 datasets will show most of the “AC content”
being at frequencies high enough for the smoothing to largely remove.
And the short term (few months or less) noise items and “noise-
resembling signals” in one dataset are unlikely to have much all-same-
lag correlation with each other, if any at all.

Wayne2

I thought that’s a pretty old rule: never analyze smoothed data. When you average over $n$ data points, you are causing correlation between each point and the $n-1$ others it was averaged with.

Willis Eschenbach

Steven Mosher says:
March 30, 2013 at 9:11 pm

nice.
for those wanting to learn R. get Rstudio.
subscribe to the R list.

Thanks, Mosh. Since I’d not heard of either one, let me add the links:
Rstudio I just took a look at that, very, very impressive. I’m migrating, at least I think so …
R list
I wasn’t clear which list you referred to, as the cite says there are four of them.
Regards, appreciated,
w.

I think that the breadth of features and ease of use of R can make it *too* simple for modelers and data analysts to achieve glib results from methods which they have not adequately analyzed.
The technology should perhaps be harder and more conducive to requiring careful thought about what is being done at each step.
Something like Haskell, which is a pure functional language and therefore very unforgiving of sloppy work, would be my preference.

Willis Eschenbach

_Jim says:
March 30, 2013 at 9:14 pm

” Fortran, LISP/LOGO, C, C++, several dialects of Basic including Visual Basic, Pascal, Hypertalk, Datacom, Assembly Language, VectorScript, and a couple of the Mathematica languages. ”
But no LabView; that, my friend, (LV) represents a paradigm shift …

True ‘dat … I played with it a little, never could afford the modules. I did like the paradigm, though. That kind of visual building-block programming was used as well in a database whose name now escapes me.
w.

John F. Hultquist

“Do not smooth times series, you hockey puck!”
3 in a series by William M. Briggs
Number I:
http://wmbriggs.com/blog/?p=195
Number II:
http://wmbriggs.com/blog/?p=86
Number III:
http://wmbriggs.com/blog/?p=735

I just realized something else: Looking at smoothings of more than a year,
the correlation time increases with smoothing time. I suspect the reason here
is that for longer term smoothing, annual cycles are smoothed out.
When smoothing is Gaussian with FWHM of 9-12 months, CO2 lag is
seasonal. With longer term smoothing, the lag could increase due to the
smoothing causing the correlation to concentrate more on longer term
correlations, such as with more lag when the (non-constant) positive
feedbacks are greater.
Something else I noticed: The correlation curves for smoothing by 2 to 24
months appear to me to have a fair amount of symmetry about zero both
horizontally and vertically. I would expect seasonal variations to have a pair
of correlation peaks, one leading and one lagging, 1 year apart – showing
1-year periodicity, rather than symmetry about the origin (zero-zero point).
Or, am I missing something? Perhaps, temperature anomalies lasting a
few months to a year have an effect on production and decomposition of
biomass, causing biomass short-term-accumulated-decomposition to lag
upward temperature anomalies by almost a year.
Something else I noted: Figure 4 shows positive correlation running high
in longer correlation periods, when the two correlated datasets are random
samples of “blue noise”. Is not “blue noise” something biased to higher
frequency spectral content? If random samples repeat positive correlation
towards longer of periods of correlation, then I question the correlation
method. Does the correlation method intrinsically have a bias to indicate
positive correlation – even (and especially) if for long lag periods and higher
frequency noise spectral content? Since Fig. 4 shows mostly positive
correlation over all of the frequencies being considered, I would suspect the
smoothing method to have a bias to show positive correlation, especially at
frequencies among the lower frequency ones being considered.
By any chance, does the smoothing method use RMS calculations for
smoothing when calculations of averages instead of RMS could be what
shows a type of random noise to be random?

MaxL

An interesting read on smoothing data:
http://wmbriggs.com/blog/?p=195

johanna

Slightly OT, but after reading this post I checked John Daly’s Wikipedia entry. What a shambles.
He gets less than this week’s reality TV nobody, and looking at the history of amendments, his entry has been a battleground for years even though he died in 2004.
I suppose that it’s a backhanded compliment (the Supreme Censor Connolly has been involved), but it’s just another reminder that Wiki is really useful for checking episode guides for your favourite TV show, but utterly unreliable when it comes to anything that is contested.

Geoff Sherrington

Shameless commercial plug – do read my little essay about 4 posts down from the top, because it raises similar outcomes but without smoothing. It simply uses averaging, as in making days into weeks. And the process constructs artefacts from numbers. And people make the mistakes daily.

Mike McMillan

I learned Algol on the great god Burroughs B5500 back in 1967. Hollerith cards, overnight batch processing. The advanced Computer Science majors were using a new high-level language called BASIC.

@Willis: You’ll probably want to subscribe to -help and -announce.
cheers,
gary

Michel

For muti-varaible correlation use the software “formulize” available on http://www.nutonian.com (for free for limited dataset, in earlier times it was totally free).
You can get amazing results as for example this: http://climate.mr-int.ch/NotesImages/Correlation_1.png which correlates observed monthly temperature anomalies (HADCRUT3) with Atlantic Multi Decadal Oscillations (AMO), El Niño-La Niña, transmitted solar radiation (which reveals volcanic eruptions almost as a Dirac impulse), CO2 atmospheric concentration, and solar spots. Caution: correlation does not necessarily imply causation!

wrt the delay from temperature to CO2:
There is a lot of noise in data for both temperature and CO2. However, the 1998 El Nino shows up quite clearly –
http://members.westnet.com.au/jonas1/CO2FocusOn1998.jpg
Temperature is RSS TLT Tropics Ocean for the given date.
CO2s are as at the given date, averaged over various stations in each of the 5 given regions, minus the same value as at 12 months earlier.
The delay from temperature to CO21 is clearly visible. Interestingly, there isn’t a large difference in travel times.
It’s easier to see if the CO2 data is smoothed –
http://members.westnet.com.au/jonas1/CO2FocusOn1998Smoothed.jpg
Is it OK to use smoothed data for this? It looks OK in this example, but as W shows, it’s best to check carefully, and to do proper calcs on the unsmoothed data if you’re using it for anything other than just seeing what it looks like.

PS. Tropic temperature is scaled in the 2 graphs for easy visual comparison. It isn’t smoothed.

Michel

Additional note to my previous post at 1:01 am: no smoothing was made prior to the correlation. But the Hadley dataset is anyway the result of data massaging to calculate global averages etc.

Silver Ralph

johanna says: March 31, 2013 at 12:09 am
Slightly OT, but after reading this post I checked John Daly’s Wikipedia entry. What a shambles.
____________________________
So why not update it? Unfortunately, I don’t know enough about him to do it myself, but surely someone here can tidy it up and explain things a bit more.
.

Greg Goodman

Willis, what you have discovered by this study is that “smoothers” don’t smooth they corrupt.
Maybe you should have used a filter instead.
I say this because those who are using a “smoother” usually don’t even realise they are using a filter. They just want the data to look “smoother”. If they realised they needed to low pass filter the data, they would realise they needed to design a filter or chose a filter based of some criterion. That would force them to decide what the criterion was and chose a filter that satisfies it.
Sadly, most times they just smooth and end up with crap.
This is one of my all biggest time gripes about climate science, that they can not get beyond runny mean “smoothers”.
You have not shown that you should not filter data, what you have shown is that runny means are a crap filter. . That’s why I call them runny mean filters. You use them and end up with crap everywhere.
The frequency response of the rectangular window used in a running mean is the sync function. It has a zero ( the bit you want to filter out is bang on ) at pi and a negative lobe that peaks at pi*1.3317 ( tan(x)=x at 1.3771*pi if you were wondering ) .
This means that it lets through stuff you imagined you “smoothed” away . Not only that but it inverts it !!
Now guess what? 12 / 1.3317 = 8.97 BINGO
Your nine month correlation is right in the hole.
Now have a look at the data and the light 2m “smoother” There is a peak either side and a negative around 8 months !! It is that 8m negative peak that is getting trough the 12m smoother and being inverted.
Not only have you let through something you intended to remove , you turned it upside down and made a negative correlation into a positive one.
So Allan Macrae may (or may not) have found true correlation but if he did it was probably negated.
There was a similar article that got some applause here a while called something like “Don’t smooth , you hockey puck” in which the author made similar claims similarly based SOLEY on problems of runny means. He totally failed to realise it was not whether you filter but waht filter you choose. But there again he was talking about “smoothers” so probably had not even realised the difference.
I emailed him explaining all this and got a polite but dismissive one word reply: “thanks”.
I really ought to right this up formally and post it somewhere.
Bottom line: don’t smooth, filter. And if you don’t know how to filter either find out or get a job as a climate scientist 😉

Greg Goodman

BTW there is +ve correlation in CO2 at about 3m though 0.1 looks a bit low in terms of 95% confidence.
Of course the other problem is that he’s also starting with monthly averages , which are themselves sub-sampled running means of 30 days. That’s two more data distortions, the mean and then sub sampling without a proper anti-alias filter.
With a method like that you’d be better flipping a coin. There’s a better chance of getting the right answer.
And I kid you not, this is par for the course in climatology.

kim

Nice discussion of sawtooth CO2 @ the end of that old thread.
========================

RERT

FWIW, I think the fact that temperature leads CO2 jumps out of the data.
Look here http://www.robles-thome.talktalk.net/carbontemp.pdf
This is just two charts: the twelve month change in atmospheric Carbon, and the twelve month change in temperature (HADCRUT3). These are the very noisy faint lines. The thick lines are the 12 month moving averages of each of these separately. Without doing any correlations, what leads what is very clear. My best fit is that temperature leads carbon by about 7 months.
There are no smoothed series being correlated here, so can be no spurious correlations. I’ll read the article again more slowly to see if it shows some errors in my analysis.
In addition to the numbers, there is of course a good reason why temperature should lead CO2: the gas is less soluble in warmer water, so higher temp is (eventually) more CO2.

Bill Illis

The CO2 vs temperature lags are interesting.
But let’s remember CO2 has a seasonal cycle (which varies from location to location). It is tied to the vegetation growth and decay cycles which vary across the planet. It also moves across the planet with large-scale winds which also vary in time. CO2 also has a long-term exponentially increasing trend which should be taken into account.
Temperature, as well, has a seasonal cycle which varies from location to location. Normally we deal with anomalies that are adjusted for the known seasonal patterns but both of these series have seasonal cycles which are offset from each.
It’s hard to say CO2 lags X months behind Temperature changes without properly accounting for all these time series patterns properly.
If you are smoothing either of them improperly compared to their true seasonal and underlying increasing/decreasing trends, your X will not be the true one.
The Dangers of smoothing. (And if you are a climate scientist, a fabulous Opportunity to mislead, which is why nearly every climate science paper uses smoothed data ONLY. Reminds one of a recent Marcott and a recent Hansen paper).

DAV

RStudio is a step forward but Eclipse with the StatET add-on is more advanced. For example, multiple plot windows; ability to view multiple sections of code simultaneously; source code debugging with breakpoints; and views of variable space. Really great if you’re combining R with other languages such as C or Perl or Java. They can all be handled under Eclipse with appropriate add-ons.
Matt Briggs has a number of posts on the dangers inherent in smoothing, particularly when combined with prediction.
http://wmbriggs.com/blog/?s=smoothing&x=0&y=0
or just go to wmbriggs.com and search for “smoothing” if the above doesn’t work.

johanna

Silver Ralph says:
March 31, 2013 at 1:40 am
johanna says: March 31, 2013 at 12:09 am
Slightly OT, but after reading this post I checked John Daly’s Wikipedia entry. What a shambles.
____________________________
So why not update it? Unfortunately, I don’t know enough about him to do it myself, but surely someone here can tidy it up and explain things a bit more.
———————-
Ralph, people have been trying to do that for nearly a decade. That is my point.
Any attempt to write an objective account of John Daly’s work would immediately be jumped all over by the resident “rapid response team” on wikipedia.
I absolutely agree that someone who is young and wakeful and interested enough should take up the task. It is a worthy project.
As I am older, and need to husband my energy to what will get results (the 80/20 rule), this one is not for me. But, I will never forgive the bastards who sent, received, and subsequently acquiesced to (by silence) that awful email where they cheered John Daly’s death. That includes those who saw the first round of released emails, when it appeared, and said nothing.
Sorry, don’t have the reference at hand, but it is well known to Anthony and long term readers of WUWT.

Crispin in Waterloo but actually in Yogyakarta

MODS: I am willing to consider a hosting option for John Daly’s data mine.
Crispin

Steve Richards

I was taught to smooth data only prior to display for human consumption, all previous steps and calculation were performed on unfiltered data.
After all, the unknown signal we are looking for is in the original data, careless filtering/smoothing can lose or change these signals.

Pamela Gray

In its infancy, smoothing of brainwave patterns was also fraught with complications and could result in lost peaks that were valuable in calculating stimulus onset to peak and peak to peak measures. Worse, an industry standard was not set early on so it was difficult to compare results across studies completed by different labs. Climate science is still in its infancy and is hardly making gains to become anything other than an infant.

Jeff Alberts

Let me start by saying that when I got involved in climate science, the go-to blog was the late, great John Daly’s blog, “Still Waiting for Greenhouse”. Sadly, when I went today to get the URL, I got the “Account Suspended” message … Yikes! That was an international treasure trove of climate history! Can we reverse that? Or are we at the mercy of the Wayback Machine? Does his archive exist, and can a host for it be found?

I’d be happy to host it. I lease a dedicated Linux server and have plenty of space and bandwidth. No idea who I’d need to contact, so if anyone knows, my email is alberts dot jeff at gmail dot com.

Jeff L

Greg Goodman says:
March 31, 2013 at 2:21 am
“The frequency response of the rectangular window used in a running mean is the sync function. It has a zero ( the bit you want to filter out is bang on ) at pi and a negative lobe that peaks at pi*1.3317 ( tan(x)=x at 1.3771*pi if you were wondering ) .
This means that it lets through stuff you imagined you “smoothed” away . Not only that but it inverts it !!
Now guess what? 12 / 1.3317 = 8.97 BINGO
Your nine month correlation is right in the hole.”
————————————————————————————
I think you may be onto something here however, Willis states he used a gausian filter , implying a gausian operator / gauasian weights were applied in the smoothing, which would get rid of the sync function / ringing / bleeding issues associated with a square wave operator. Your assumption is that he basically used a square wave (no weights ) in calculating the smoothing. Now, based on Willis’ results & your analysis, I think you might be on to something – that the actual filtering was a square wave & not a gausian filter as stated. So, once again, this raises more questions & increases my suspicion there is something fundamentally wrong with the calculations presented here as there are many inconsistencies. None of it really makes sense as presented. I would add to my list of what I would like to see the filter operator & it’s associated power spectrum.
Answering the question of “… is the correlation found in the Macrae study valid, or spurious?” should not be a very hard question to answer – it just needs a different analysis – plots of the raw data , the filter operator(s), the filtered data, the spectra of all of the above & the then cross-correlations of both filtered & unfiltered data – if you could look at all of those together, anyone with some signal analysis background ought to be able to look the plots & answer the question, quickly & definitively.

John F. Hultquist

Greg Goodman says:
March 31, 2013 at 2:21 am

Yours is an intriguing comment. Anyone that can include crap, Bingo, and π in a few lines of text deserves a crack at a full-blown post. Set yourself down and have a go at getting your runny means and filtered points properly sorted out. I’ll suggest having a couple of others (Willis, Geoff S., ?) review it before posting. Why not ask Anthony if this would work for him, insofar as this is his site?

FrozenOut

As somebody involved professionally in the analysis of time series for over a decade can I make a few points:
1, smoothing is of NO VALUE unless it is used to create a forecast; I don’t care what the “smooth trend” of past data is — the past data is the best presentation of the past data.
2, never-ever compute an auto-correlation function or cross-correlation function from data to which a process that induces auto-correlation has already been applied (i.e. from a smooth). The random errors of independent and identically distributed data are computable (or bootstrappable), and so the difference of your ACF or CCF from that expected for IID noise processes is also computable. Once you start throwing ad-hoc filters into the data, who knows how those errors are going to behave. Remember the window size of your filter is a degree of freedom that is being adjusted — are you using the standard error of that in your induced error covariance matrix?
3, there are so many ad-hoc smoothing windows thrown around because the make the data look “nice” to the analyst (see #1 above) that it makes one cringe.
Time series analysis was studied extensively by several excellent English statisticians. Kendall, Box and Jenkins made huge contributions. The Box-Jenkins book is really a gem. If you want to do any time-series analysis please read at least that — or Hamilton, for a more modern treatment. The Akiake Information Criterion (AICc) is an excellent tool to tune up Box-Jenkins style models to find the best approximating model for in-sample data. This is based upon very well defined information theoretic analysis of the estimation process.

Mark T

I was taught to smooth data only prior to display for human consumption, all previous steps and calculation were performed on unfiltered data.

When you don’t know what the underlying structure is, yes. If you do, however, then there’s nothing wrong with the practice.

After all, the unknown signal we are looking for is in the original data, careless filtering/smoothing can lose or change these signals.

Exactly.
Mark

Willis Eschenbach

gary turner says:
March 31, 2013 at 12:55 am

@Willis: You’ll probably want to subscribe to -help and -announce.
cheers,
gary

Thanks, Gary, noted. Also thanks again to Mosh, Rstudio is awesome.
w.

Phil.

RERT says:
March 31, 2013 at 2:57 am
FWIW, I think the fact that temperature leads CO2 jumps out of the data.
Look here http://www.robles-thome.talktalk.net/carbontemp.pdf
This is just two charts: the twelve month change in atmospheric Carbon, and the twelve month change in temperature (HADCRUT3). These are the very noisy faint lines. The thick lines are the 12 month moving averages of each of these separately. Without doing any correlations, what leads what is very clear. My best fit is that temperature leads carbon by about 7 months.

What’s clear from that plot is that by the arbitrary shift of the CO2 axis by about -0.3% you’ve given the impression that the linear increase in CO2 independent of T doesn’t exist! What your graph actually shows is that CO2 increases steadily independently of temperature with a superimposed modulation due to temperature. As far as the lag is concerned, you don’t say whether your data is global or not, but if so there’s a problem due to the differences between the hemispheres, Arctic showing intra-annual fluctuations of ~10ppm, Mauna Loa ~5ppm, S Pole ~0ppm

Greg Goodman

Geoff L: ”
I think you may be onto something here however, Willis states he used a gausian filter , implying a gausian operator / gauasian weights were applied in the smoothing”
Willis (article): “In the Macrae study, he used smoothed datasets (12 month average) of the month-to-month change in temperature (∆T) and CO2 (∆CO2) to establish the lag between the change in CO2 and temperature . Accordingly, I did the same. ”
I read this to mean “running 12 month average” since he is clearly still working with monthly data , not annual data, as would be the case if it was (12 month average) as stated by Willis.
However, he does state later it was done with gaussian filters. So it appears that he was calling his 12m FWHM gaussian which would be an average over 72 months of data a “12 month average”. At least that’s the best I can make of it.
None of that goes against what I said about the problems with running means in general.
What would seem rather odd with what is reported of the McRae study is why anyone would look for a lag correlation of less than 12 months in data that they have tried to remove variations of less than twelve months from.