I missed this article from July, and it deserves wide distribution. Steve McIntyre writes (excerpts):
In 2012, the then much ballyhoo-ed Australian temperature reconstruction of Gergis et al 2012 mysteriously disappeared from Journal of Climate after being criticized at Climate Audit. Now, more than four years later, a successor article has finally been published. Gergis says that the only problem with the original article was a “typo” in a single word. Rather than “taking the easy way out” and simply correcting the “typo”, Gergis instead embarked on a program that ultimately involved nine rounds of revision, 21 individual reviews, two editors and took longer than the American involvement in World War II. However, rather than Gergis et al 2016 being an improvement on or confirmation of Gergis et al 2012, it is one of the most extraordinary examples of data torture (Wagenmakers, 2011, 2012) that any of us will ever witness.
The re-appearance of Gergis’ Journal of Climate article was accompanied by an untrue account at Conversation of the withdrawal/retraction of the 2012 version. Gergis’ fantasies and misrepresentations drew fulsome praise from the academics and other commenters at Conversation. Gergis named me personally as having stated in 2012 that there were “fundamental issues” with the article, claims which she (falsely) said were “incorrect” and supposedly initiated a “concerted smear campaign aimed at discrediting [their] science”. Their subsequent difficulty in publishing the article, a process that took over four years, seems to me to be as eloquent a confirmation of my original diagnosis as one could expect.
I’ve drafted up lengthy notes on Gergis’ false statements about the incident, in particular, about false claims by Gergis and Karoly that the original authors had independently discovered the original error “two days” before it was diagnosed at Climate Audit. These claims were disproven several years ago by emails provided in response to an FOI request. Gergis characterized the FOI requests as “an attempt to intimidate scientists and derail our efforts to do our job”, but they arose only because of the implausible claims by Gergis and Karoly to priority over Climate Audit.
Although not made clear in Gergis et al 2016 (to say the least), its screened network turns out to be identical to the Australasian reconstructions in PAGES2K (Nature 2013), while the reconstructions are nearly identical. PAGES2K was published in April 2013 and one cannot help but wonder at why it took more than three years and nine rounds of revision to publish something so similar.
In addition, one of the expectations of the PAGES2K program was that it would identify and expand available proxy data covering the past two millennia. In this respect, Gergis and the AUS2K working group failed miserably. The lack of progress from the AUS2K working group is both astonishing and dismal, a failure unreported in Gergis et al 2016 which purported to “evaluate the Aus2k working group’s regional consolidation of Australasian temperature proxies”.
Detrended and Non-detrended Screening
The following discussion of data torture in Gergis et al 2016 draws on my previous and similar criticism of data torture in PAGES2K.
Responding to then recent scandals in social psychology, Wagenmakers (2011 pdf, 2012 pdf) connected the scandals to academics tuning their analysis to obtain a “desired result”, which he classified as a form of “data torture”:
we discuss an uncomfortable fact that threatens the core of psychology’s academic enterprise: almost without exception, psychologists do not commit themselves to a method of data analysis before they see the actual data. It then becomes tempting to fine tune the analysis to the data in order to obtain a desired result—a procedure that invalidates the interpretation of the common statistical tests. The extent of the fine tuning varies widely across experiments and experimenters but is almost impossible for reviewers and readers to gauge…
Some researchers succumb to this temptation more easily than others, and from presented work it is often completely unclear to what degree the data were tortured to obtain the reported confession.
As I’ll show below, it is hard to contemplate a better example of data torture, as described by Wagenmakers, than Gergis et al 2016.
The controversy over Gergis et al, 2012 arose over ex post screening of data, a wildly popular technique among IPCC climate scientists, but one that I’ve strongly criticized over the years. Jeff Id and Lucia have also written lucidly on the topic (e.g. Lucia here and, in connection with Gergis et al, here). I had raised the issue in my first post on Gergis et al on May 31, 2012. Closely related statistical issues arise in other fields under different terminology e.g. sample selection bias, conditioning on post-treatment variable, endogenous selection bias. The potential bias of ex post screening seems absurdly trivial if one considers the example of a drug trial, but, for some reason, IPCC climate scientists continue to obtusely deny the bias. (As a caveat, objecting to the statistical bias of ex post screening does not entail that opposite results are themselves proven. I am making the narrow statistical point that biased methods should not be used.)
Despite the public obtuseness of climate scientists about the practice, shortly after my original criticism of Gergis et al 2012, Karoly privately recognized the bias associated with ex post screening as follows in an email to Neukom (June 7, 2012; FOI K,58):
If the selection is done on the proxies without detrending ie the full proxy records over the 20th century, then records with strong trends will be selected and that will effectively force a hockey stick result. Then Stephen Mcintyre criticism is valid. I think that it is really important to use detrended proxy data for the selection, and then choose proxies that exceed a threshold for correlations over the calibration period for either interannual variability or decadal variability for detrended data…The
criticism that the selection process forces a hockey stick result will be valid if the trend is not excluded in the proxy selection step.
Gergis et al 2012 had purported to avoid this bias by screening on detrended data, even advertising this technique as a method of “avoid[ing] inflating the correlation coefficient”:
For predictor selection, both proxy climate and instrumental data were linearly detrended over the 1921-1990 period to avoid inflating the correlation coefficient due to the presence of the global warming signal present in the observed temperature record. Only records that were significantly (p<0.05) correlated with the detrended instrumental target over the 1921-1990 period were selected for analysis. This process identified 27 temperature-sensitive predictors for the SONDJF warm season.
As is now well known, they didn’t actually perform the claimed calculation. Instead, they calculated correlation coefficients on undetrended data. This error was first reported by CA commenter Jean S on June 5, 2012 (here). Two hours later (nearly 2 a.m. Swiss time), Gergis coauthor Raphi Neukom notified Gergis and Karoly of the error (FOI 2G, page 77). Although Karoly later (falsely) claimed that his coauthors were unaware of the Climate Audit thread, emails obtained through FOI show that Gergis had sent an email to her coauthors (FOI 2G, page 17) drawing attention to the CA thread, that Karoly himself had written to Myles Allen (FOI 2K, page 11)about comments attributed to him on the thread (linking to the thread) and that Climate Audit and/or myself are mentioned in multiple other contemporary emails (FOI 2G).
When correlation coefficients were re-calculated according to the stated method, only a handful actually passed screening, a point reported at Climate Audit by Jean S on June 5 and written up by me as a head post on June 6. According to my calculations, only six of the 27 proxies in the G12 network passed detrended screening. On June 8 (FOI 2G, page 112), Neukom reported to Karoly and Gergis that eight proxies passed detrended screening (with the difference between his results and mine perhaps due to drawing from the prescreened network or to difference in algorithm) and sent them a figure (not presently available) comparing the reported reconstruction with the reconstruction using the stated method:
Dashed reconstruction below is using only the 8 proxies that pass detrended screening. solid is our original one.
This figure was unfortunately not included in the FOI response. It would be extremely interesting to see.
As more people online began to be aware of the error, senior author Karoly decided that they needed to notify Journal of Climate. Gergis notified the journal of a “data processing error” on June 8 and their editor, John Chiang, immediately rescinded acceptance of the paper the following day as follows, stating his understanding that they would redo the analysis to conform with their described methodology:
After consulting with the Chief Editor regarding your situation, my decision is to rescind the acceptance of your manuscript for publication. My understanding is that you will be redoing your analysis to conform to your original description of the predictor selection, in which case you may arrive at a different conclusion from your original manuscript. Given this, I request that you withdraw the manuscript from consideration.
Contrary to her recent story at Conversation, Gergis tried to avoid redoing the analysis, instead she tried to persuade the editor that the error was purely semantic (“error in words”), rather than a programming error, invoking support for undetrended screening from Michael Mann, who was egging Gergis on behind the scenes:
Just to clarify, there was an error in the words describing the proxy selection method and not flaws in the entire analysis as suggested by amateur climate skeptic bloggers…People have argued that detrending proxy records when reconstructing temperature is in fact undesirable (see two papers attached provided courtesy of Professor Michael Mann) .
The Journal of Climate editors were unpersuaded and pointedly asked Gergis to explain the difference between the first email in which the error was described as a programming error and the second email describing the error as semantic:
Your latest email to John characterizes the error in your manuscript as one of wording. But this differs from the characterization you made in the email you sent reporting the error. In that email (dated June 7) you described it as “an unfortunate data processing error,” suggesting that you had intended to detrend the data. That would mean that the issue was not with the wording but rather with the execution of the intended methodology. would you please explain why your two emails give different impressions of the nature of the error?
Gergis tried to deflect the question. She continued to try to persuade the Journal of Climate to acquiesce in her changing the description of the methodology, as opposed to redoing the analysis with the described methodology, offering only to describe the differences in a short note in the Supplementary Information:
The message sent on 8 June was a quick response when we realised there was an inconsistency between the proxy selection method described in the paper and actually used. The email was sent in haste as we wanted to alert you to the issue immediately given the paper was being prepared for typesetting. Now that we have had more time to extensively liaise with colleagues and review the existing research literature on the topic , there are reasons why detrending prior to proxy selection may not be appropriate. The differences between the two methods will be described in the supplementary material, as outlined in my email dated 14 June. As such, the changes in the manuscript are likely to be small, with details of the alternative proxy selection method outlined in the supplementary material .
The Journal of Climate editor resisted, but reluctantly gave Gergis a short window of time (to end July 2012) to revise the article, but required that she directly address the sensitivity of the reconstruction to proxy selection method and “demonstrate the robustness” of her conclusions:
In the revision, I strongly recommend that the issue regarding the sensitivity of the climate reconstruction to the choice of proxy selection method (detrend or no detrend) be addressed. My understanding that this is what you plan to do, and this is a good opportunity to demonstrate the robustness of your conclusions.
Chiang’s offer was very generous under the circumstances. Gergis grasped at this opportunity and promised to revert by July 27 with a revised article showing the influence of this decision on resultant reconstructions:
Our team would be very pleased to submit a revised manuscript on or before the 27 July 2012 for reconsideration by the reviewers . As you have recommended below, we will extensively address proxy selection based on detrended and non detrended data and the influence on the resultant reconstructions.
Torturing and Waterboarding the Data
In the second half of 2012, Gergis and coauthors embarked on a remarkable program of data torture in order to salvage a network of approximately 27 proxies, while still supposedly using “detrended” screening. Their eventual technique for ex post screening bore no resemblance to the simplistic screening of (say) Mann and Jones, 2003.
One of their key data torture techniques was to compare proxy data correlations not simply to temperatures in the same year, but to temperatures of the preceding year and following year.
To account for proxies with seasonal definitions other than the target SONDJF season (e. g., calendar year averages), the comparisons were performed using lags of -1, 0, and +1 years for each proxy (Appendix A).
This mainly impacted tree ring proxies. In their practice, a lag of -1 year meant that a tree ring series is assigned one year earlier than the chronology (+1 is assigned one year later.) For a series with a lag of -1 year (e.g. Celery Top East), ring width in the summer of (say) 1989-90 is said to correlate with summer temperatures of the previous year. There is precedent for correlation to previous year temperatures in specialist studies. For example, Brookhouse et al (2008) (abstract here) says that the Baw Baw tree ring data (a Gergis proxy), correlates positively with spring temperatures from the preceding year. In this case, however, Gergis assigned zero lag to this series, as well as a negative orientation.
The lag of +1 years assigned to 5 sites is very hard to interpret in physical terms. Such a lag requires that (for example) Mangawhera ring widths assigned to the summer of 1989-1990 correlate to temperatures of the following summer (1990-1991) – ring widths in effect acting as a predictor of next year’s temperature. Gergis’ supposed justification in the text was nothing more than armwaving, but the referees do not seem to have cared.
Of the 19 tree ring series in the 51-series G16 network, an (unphysical) +1 lag was assigned to five series, a -1 lag to two series and a 0 lag to seven series, with five series being screened out. Of the seven series with 0 lag, two had inverse orientation in the PAGES2K. In detail, there is little consistency for trees and sites of the same species. For example, New Zealand LIBI composite-1 had a +1 lag, while New Zealand LIBI composite-2 had 0 lag. Another LIBI series (Urewara) is assigned an inverse orientation in the (identical) PAGES2K and thus presumably in the CPS version of G16. Two LIBI series (Takapari and Flanagan’s Hut) are screened out in G16, though Takapari was included in G12. Because the assignment of lags is nothing more than an ad hoc after-the-fact attempt to rescue the network, it is impossible to assign meaning to the results.
In addition, Gergis also borrowed from and expanded a data torture technique pioneered in Mann et al 2008. Mann et al 2008 had been dissatisfied with the number of proxies passing a screening test based on correlation to local gridcell, a commonly used criterion (e.g. Mann and Jones 2003). So Mann instead compared results to the two “nearest” gridcells, picking the highest of the two correlations but without modifying the significance test to reflect the “pick two” procedure. (See here for a contemporary discussion.) Instead of comparing only to the two nearest gridcells, Gergis expanded the comparison to all gridcells “within 500 km of the proxy’s location”, a technique which permitted comparisons to 2-6 gridcells depending both on the latitude and the closeness of the proxy to the edge of its gridcell:
As detailed in appendix A, only records that were significantly (p < 0.05) correlated with temperature variations in at least one grid cell within 500 km of the proxy’s location over the 1931-90 period were selected for further analysis.
As described in the article, both factors were crossed in the G16 comparisons. Multiplying three lags by 2-6 gridcells, Gergis appears to have made 6-18 detrended comparisons, retaining those proxies for which there was a “statistically significant” correlation. It doesn’t appear that any allowance was made in the benchmark for the multiplicity of tests. In any event, using this “detrended” comparison, they managed to arrive at a network of 28 proxies, one more than the network of Gergis et al 2012. Most of the longer proxies are the same in both networks, with a shuffling of about seven shorter proxies. No ice core data is included in the revised network and only one short speleothem. It consists almost entirely of tree ring and coral data.
Obviously, Gergis et al’s original data analysis plan did not include a baroque screening procedure. It is evident that they concocted this bizarre screening procedure in order to populate the screened population with a similar number of proxies to Gergis et al 2012 (28 versus 27) and to obtain a reconstruction that looked like the original reconstruction, rather than the divergent version that they did not report. Who knows how many permutations and combinations and iterations were tested, before eventually settling on the final screening technique.
It is impossible to contemplate a clearer example of “data torture” (even Mann et al 2008).
Nor does this fully exhaust the elements of data torture in the study, as torture techniques previously in Gergis et al 2012 were carried forward to Gergis et al 2016. Using original and (still) mostly unarchived measurement data, Gergis et al 2012 had re-calculated all tree ring chronologies, except two, using an opaque method developed by the University of East Anglia. The two exceptions were the two long tree ring chronologies reaching back to the medieval period:
All tree ring chronologies were developed based on raw measurements using the signal-free detrending method (Melvin et al., 2007; Melvin and Briffa, 2008) …The only exceptions to this signal-free tree ring detrending method was the New Zealand Silver Pine tree ring composite (Oroko Swamp and Ahaura), which contains logging disturbance after 1957 (D’Arrigo et al., 1998; Cook et al., 2002a; Cook et al., 2006) and the Mount Read Huon Pine chronology from Tasmania which is a complex assemblage of material derived from living trees and sub-fossil material. For consistency with published results, we use the final temperature reconstructions provided by the original authors that includes disturbance-corrected data for the Silver Pine record and Regional Curve Standardisation for the complex age structure of the wood used to develop the Mount Read temperature reconstruction (E. Cook, personal communication, Cook et al., 2006).
This raises the obvious question why “consistency with published results” is an overriding concern for Mt Read and Oroko, but not for the other series, which also have published results. For example, Allen et al (2001), the reference for Celery Top East, shows the chronology at left for Blue Tier, while Gergis et al 2016 used the chronology at right for a combination of Blue Tier and a nearby site. Using East Anglia techniques, the chronology showed a sharp increase in the 20th century and “consistency” with the results shown in Allen et al (2001) was not a concern of the authors. One presumes that Gergis et al had done similar calculations for Mount Read and Oroko, but had decided not to use them. One can hardly avoid wondering whether the discarded calculations didn’t emphasize the desired story.
Nor is this the only ad hoc selection involving these two important proxies. Gergis et al said that their proxy inventory was a 62-series subset taken from the inventory of Neukom and Gergis, 2011. (I have been unable to exactly reconcile this number and no list of 62 series is given in Gergis et al 2016.) They then excluded records that “were still in development at the time of the analysis” (though elsewhere they say that the dataset was frozen as of July 2011 due to the “complexity of the extensive multivariate analysis”) or “with an issue identified in the literature or through personal communication”:
Of the resulting 62 records we also exclude records that were still in development at the time of the analysis .. and records with an issue identified in the literature or through personal communication
However, this criterion was applied inconsistently. Gergis et al acknowledge that the Oroko site was impacted by “logging disturbance after 1957” – a clear example of an “issue identified in the literature” but used the data nonetheless. In some popular Oroko versions (see CA discussion here), proxy data after 1957 was even replaced by instrumental data. Gergis et al 2016 added a discussion of this problem, arm-waving that the splicing of instrumental data into the proxy record didn’t matter:
Note that the instrumental data used to replace the disturbance-affected period from 1957 in the silver pine [Oroko] tree-ring record may have influenced proxy screening and calibration procedures for this record. However, given that our reconstructions show skill in the early verification interval, which is outside the disturbed period, and our uncertainty estimates include proxy resampling (detailed below), we argue that this irregularity in the silver pine record does not bias our conclusions.
There’s a sort of blind man’s buff in Gergis’ analysis here, since it looks to me like G16 may have used an Oroko version which did not splice instrumental data. However, because no measurement data has ever been archived for Oroko and a key version only became available through inclusion in a Climategate email, it’s hard to sort out such details.
Gergis has received much credulous praise from academics at Conversation, but none of them appear to have taken the trouble to actually evaluate the article before praising it. Rather than the 2016 version being a confirmation of or improvement on the 2012 article, it constitutes as clear an example of data torture as one could ever wish. We know Gergis’ ex ante data analysis plan, because it was clearly stated in Gergis et al 2012. Unfortunately, they made a mistake in their computer script and were unable to replicate their results using the screening methodology described in Gergis et al 2012.
One wonders whether the editors and reviewers of Journal of Climate fully understood the extreme data torture that they were asked to approve. Clearly, there seems to have been some resistance from editors and reviewers – otherwise there would not have been nine rounds of revision and 21 reviews. Since the various rounds of review left the network unchanged even one iota from the network used in the PAGES2K reconstruction (April 2013), one can only assume that Gergis et al eventually wore out a reluctant Journal of Climate, who, after four years of submission and re-submission, finally acquiesced.
As noted above, Wagenmakers defined data torture as succumbing to the temptation to “fine tune the analysis to the data in order to obtain a desired result” and diagnosed the phenomenon as being particularly likely when the authors had not “commit themselves to a method of data analysis before they see the actual data”. In this case, Gergis et al had, ironically, committed themselves to a method of data analysis not just privately, but in the text of an accepted article, but they obviously didn’t like the results.
One can understand why Gergis felt relief at finally getting approval for such a tortured manuscript, but, at the same time, the problems were entirely of her own making. Gergis took particular umbrage at my original claim that there were “fundamental issues” with Gergis et al 2012, a claim that she called “incorrect”. But there is nothing “incorrect” about the actual criticism:
One of the underlying mysteries of Gergis-style analysis is one seemingly equivalent proxies can be “significant” while another isn’t. Unfortunately, these fundamental issues are never addressed in the “peer reviewed literature”.
This comment remains as valid today as it was in 2012.
In her Conversation article, Gergis claimed that her “team” discovered the errors in Gergis et al 2012 independently of and “two days” before the errors were reported at Climate Audit. These claims are untrue. They did not discover the errors “independently” of Climate Audit or before Climate Audit. I will review their appropriation of credit in a separate post.
It is Mannian splicing and attribution denial all over again, just to obtain a “hockey stick”.
This isn’t science, it is advocacy for “the cause”.
Read the full post here: https://climateaudit.org/2016/07/21/joelle-gergis-data-torturer/
It is gobsmacking that this sort of thing continues, and I salute Steve McIntyre for the patience and microscopic detail he had to wade through to sort out this complete failure of peer review – twice. The Journal of Climate should retract Gergis 2016.
Update: For those that wish to contact Joelle Gergis, and/or the Journal of Climate about this paper, here are the respective web pages. If you do, please be factual and respectful; nothing is accomplished by boorish behavior.