Guest Essay by Kip Hansen
Preview: In this essay I will discuss the efforts of various scientific bodies and individual scientists to regularize, to bring into line with correct scientific procedures, the budding field of science investigating the effects of increasing atmospheric concentrations of CO2 on the oceans, its chemical make-up including pH, the atmosphere/ocean carbon cycle and what those changes might mean for ocean organisms over the next 100 years – a subject popularly known as Ocean Acidification (hereafter OA).
The 6 August 2015 issue of the journal Nature carried a highlight article under the subject heading Ocean Acidification entitled “Seawater studies come up short — Experiments fail to predict size of acidification’s impact.” (.pdf here)
The Nature highlight article, by Daniel Cressey (a full time Nature reporter based in London) states:
“The United Nations has warned that ocean acidification could cost the global economy US$1 trillion per year by the end of the century, owing to losses in industries such as fisheries and tourism. Oyster fisheries in the United States are estimated to have already lost millions of dollars as a result of poor harvests, which can be partly blamed on ocean acidification.
The past decade has seen accelerated attempts to predict what these changes in pH will mean for the oceans’ denizens — in particular, through experiments that place organisms in water tanks that mimic future ocean-chemistry scenarios.
Yet according to a survey published last month by marine scientist Christopher Cornwall, who studies ocean acidification at the University of Western Australia in Crawley, and ecologist Catriona Hurd of the University of Tasmania in Hobart, Australia, most reports of such laboratory experiments either used inappropriate methods or did not report their methods properly.”
(all in reference to Cornwall and Hurd ICES J. Mar. Sci. http://dx.doi.org/10.1093/icesjms/fsv118 ; 2015 )
“Cornwall says that the “overwhelming evidence” from such studies of the negative effects of ocean acidification still stands. For example, more-acidic waters slow the growth and worsen the health of many species that build structures such as shells from calcium carbonate. But the pair’s discovery that many of the experiments are problematic makes it difficult to assess accurately the magnitude of effects of ocean acidification, and to combine results from individual experiments to build overall predictions for how the ecosystem as a whole will behave, he says.”
(Just to be clear, the two quotes above are from the Creesey Nature highlight.)
The paper by Cornwall and Hurd is a masterful piece of science of a type rarely seen in academia today (with a few exceptions to be discussed later). It investigated the experimental design of the current crop of papers in a scientific field and evaluated whether or not the study designs and results analyses used were appropriate to return scientifically meaningful results.
This paper was published in International Council for the Exploration of the Sea (ICES) – Journal of Marine Science. Here’s the abstract:
“Ocean acidification has been identified as a risk to marine ecosystems, and substantial scientific effort has been expended on investigating its effects, mostly in laboratory manipulation experiments. However, performing these manipulations correctly can be logistically difficult, and correctly designing experiments is complex, in part because of the rigorous requirements for manipulating and monitoring seawater carbonate chemistry.
To assess the use of appropriate experimental design in ocean acidification research, 465 studies published between 1993 and 2014 were surveyed, focusing on the methods used to replicate experimental units. The proportion of studies that had interdependent or non-randomly interspersed treatment replicates, or did not report sufficient methodological details was 95%. Furthermore, 21% of studies did not provide any details of experimental design, 17% of studies otherwise segregated all the replicates for one treatment in one space, 15% of studies replicated CO2 treatments in a way that made replicates more interdependent within treatments than between treatments, and 13% of studies did not report if replicates of all treatments were randomly interspersed. As a consequence, the number of experimental units used per treatment in studies was low (mean = 2.0).
In a comparable analysis, there was a singnificant decrease in the number of published studies that employed inappropriate chemical methods of manipulating seawater (i.e. acid–base only additions) from 21 to 3%, following the release of the “Guide to best practices for ocean acidification research and data reporting” in 2010; however, no such increase in the use of appropriate replication and experimental design was observed after 2010.
We provide guidelines on how to design ocean acidification laboratory experiments that incorporate the rigorous requirements for monitoring and measuring carbonate chemistry with a level of replication that increases the chances of accurate detection of biological responses to ocean acidification. “
(I have added paragraphing to the above for readability – kh)
Note: Despite heroic efforts, I have been unable to find a freely available full copy of C&H 2015 online. Chris Cornwall kindly supplied me with an Advance Access .pdf copy of the full study and the supplemental information file. Those wishing to read the full study should either email Dr. Cornwall requesting a copy or email me (my first name at the domain i4 dot net).
First, let me point out that Chris Cornwall and Catriona Hurd are OA research insiders. Unfortunately, the title of the Nature highlight makes their study sound like an indictment of OA research, which it is not.
Chris Cornwall tells me (in personal communication) that their study has been generally well received in the OA field and that “Many scientists have received the suggested solutions with open arms.” And while the Nature highlight will go a long ways towards making the points raised in C&H 2015 clear to scientists all across the OA research field — a good thing — he felt that the Nature piece had elements that were “overly dramatic or incorrect” which had been latched onto by the popular press. Further, Chris says “Debates between scientists about improving a field of research do not invalidate that field, contrary to that reported by the Daily Mail.”
(Late addition: Chris Cornwall responds to the Daily Mail here. There is some slight contradiction between his public statement and his published paper but he does have to continue to work in the field – Cornwall and Hurd intentionally did not publish the details of their analyses of the 465 papers – which ones were appropriate and which inappropriate and why – they only published, as a supplement, a list of the titles of those studies surveyed, for what I assume is the same reason.)
Just what have he and Catriona Hurd done? They have looked at published OA papers from 1993 to 2014 – 465 of them, which must have been an incredibly time consuming task — mostly laboratory manipulation experiments (manipulating atmospheric CO2 concentrations associated with ocean water tanks, usually with oceanic organisms, and pH manipulation of the same). They evaluated each one for inappropriate experimental design and/or analysis of results. The main issue and the major problem with the papers, though not the only one, dealt with the replication, or lack of, of experimental units.
Definition: Experimental units for this discussion can be thought of as individual tanks of sea water + organisms to be studied + treatment (or lack of treatment, in the case of a control tank). In the following diagram, from Cornwall and Hurd (C&H 2015), only experimental designs precede by the letter A are acceptable – all those preceded by B are not. (ref: Hurlbert 1984). (In 2013, Hurlbert used the definition “the smallest… unit of experimental material to which a single treatment (or treatment combination) is assigned by the experimenter and is dealt with independently …”).
The point being that “Regardless of the degree of precision that the treatment is applied and its effects measured, if treatment effects are confused with the effects of other factors not under investigation, then an accurate assessment of the effects of the treatment cannot be made.” If experimental units are not independent, if they are not truly randomized, if co-confounders can be seen to exist, then the results are not scientifically reliable.
What did C&H find in this regard? Out of the 465 OA studies done between 1993 and 2014, “The proportion of studies that had interdependent or non-randomly interspersed treatment replicates, or did not report sufficient methodological details, was 95%.” That leaves just 5% of the studies judged to have appropriate experimental designs.
We all know that there are many things that can go wrong in lab experiments such as these, those which take months and months, require constant monitoring of finicky details and that can be sabotaged by a moment’s inattention of a lab assistant. These factors we understand and are part of the difficulty of all lab work. But when the original experimental design is insufficient for the purpose from the outset then time, money, and effort are wasted and results become difficult or impossible to interpret – certainly impossible or very difficult to use to perform any sort of meta-analysis across studies.
Further, “the number of experimental units used per treatment in studies was low (mean = 2.0).” Think about that — imagine doing a medical study, an RCT, but using only 2 patients per cohort. Then consider that there are obvious co-confounders with the two patients, such as being siblings! No journal would touch the resultant paper – it would have no significance at all. Granted, one might get away with reporting it as a Case Study, but it would never be considered clinically important or predictive. And yet that is precisely the situation we find generally in OA research – very small numbers of experimental units poorly isolated, often with co-confounders that obfuscate or invalidate treatment effects.
C&H report (at the head of the discussion section):
“This analysis identified that the most laboratory manipulation experiments in ocean acidification research used either an inappropriate experimental design and/or data analysis, or did not report these details effectively. Many studies did not report important methods, such as how treatments were created and the number of replicates of each treatment. The tendency for the use of inappropriate experimental design also undermines our confidence in accurately predicting the effects of ocean acidification on the biological responses of marine organisms.”
The authors maintain nonetheless that even poorly designed studies contain useful information, even if getting at it requires a full re-analysis of reported results. Some experiments however are hopelessly compromised by poor study design.
Having determined the biggest problem to be:
“Confusion regarding what constitutes an experimental unit is evident in ocean acidification research. This is demonstrated by a large proportion of studies that either treated the responses of individuals …. to treatments as experimental units, when multiple individuals were in each tank, or used tank designs where all experimental tanks of one treatment are more interconnected to each other than experimental tanks of other treatments (181 studies total).”
C&H proceed to give suggestions on proper experimental design that will prevent the problems found in the majority of previous studies as well as a series of suggestions regarding statistical evaluation of results. They attempt to set a gold standard for OA research in which known problems are avoided to improve reliability, significance, and usefulness of results.
C&H recommend 1) various approaches to be determined and adopted before the OA manipulation system is designed, 2) lab layout and randomization of the positions of experimental tanks, 3) measurement schemes to avoid pseudo-replication and statistical confusion caused by treatment of measurements — either interdependent measurements treated as independent, or multiple measurements of the same unit treated as independent measurements, and other similar offenses, 4) tips for reviewers (and self-review by authors) . Those interested in the details of this should read this section of the study – it is a valuable lesson in how complicated good experimental design can be even for “simple” hypotheses (I give suggestions above on how to obtain a full copy of C&H 2015).
This study follows up on a major effort in 2010 along the same lines – an effort by the European Project on OCean Acidification (EPOCA) – which produced the booklet “Guide to best practices for ocean acidification research and data reporting” (mentioned in C&H 2015), which gave strict guidelines meant to correct the 21% of pH perturbation experiments that, as of 2010, had been found using methods that did not properly replicate real ocean carbonate chemistry (see sections on seawater carbonate chemistry). The good news from C&H 2015 is that the percentage of studies that contained gross carbonate chemistry errors in the OA studies after 2010 were reduced to just 3% (down from 21% before 2010). The rest of C&H 2015 is the bad news: even though the “Guide to best practices…..” contained an entire section on “Designing ocean acidification experiments to maximise inference” (Section 4 of the guide), 95% of studies surveyed in 2014 failed to meet minimal standards of experimental design (some of these, of course, must have been carried out before the guide was published – nonetheless, C&H report no improvement in experimental design between 2010-2014).
This new field is to be congratulated on its internal attempts to set itself right – to correct endemic errors in its research and educate those involved in better ways to conduct that research so that results will be significant and meaningful in the real world – results that not only are correct and get published, but that add to the sum total of human knowledge.
And yes, it is a shame that so much effort and so many research dollars have been spent for results that, so far, cannot tell us very much that is reliably useful and almost nothing that can be considered accurately predictive. But the hopeful thing is that this field of endeavor is actively engaged in self-correction.
Try to imagine such a thing happening in some other field of Climate Science – insider scientists producing a survey of research that points out that the majority of those studies about some aspect of Climate Science are seriously flawed and will have to be redone with experimental designs and statistical approaches that will actually produce dependable, scientific results.
Back in the OA world, Chris Cornwall has expressed his hope that their new paper in ICES – Journal of Marine Science (and the Nature editorial highlight which significantly raised its profile), will bring improvements to OA experimental design over the next five years similar to those improvements they found for the chemistry aspects of OA studies post-2010.
I hope so too – Chris Cornwall and Catriona Hurd have my congratulations and I wish the entire OA field success, looking forward to new research based on proper experimental design and correct oceanic carbonate chemistry.
* * * *
And elsewhere in Science?
Psychology has been rocked by this NY Times story – “Many Social Science Findings Not as Strong as Claimed” which reports about the Reproducibility Project: Psychology. The original report summary is here: Estimating the Reproducibility of Psychological Science. The Times quotes Ioannidis (author of “Why Most Published Research Findings Are False”):
“Less than half [of 100 experiments were able to be replicated]— even lower than I thought,” said Dr. John Ioannidis, a director of Stanford University’s Meta-Research Innovation Center, who once estimated that about half of published results across medicine were inflated or wrong. Dr. Ioannidis said the problem was hardly confined to psychology and could be worse in other fields, including cell biology, economics, neuroscience, clinical medicine, and animal research.”
The Reproducibility Project (RP) was attempting to validate studies, not invalidate them. They involved original authors in the design of replication attempts. Psychology has long known that many of their journal articles reported experiments that were unlikely to be correct, which exaggerated effect size and significance or did not report real effects at all. The RP is trying to help Psychology as a field of research to regain some semblance of reliability and public confidence, especially after a series of high profile exposés of falsified data and subsequent retractions.
Ioannidis gives a series of suggestions in his “Why Most Published….” paper on what could be done to improve this dismal record.
Reading Ioannidis will give you a lot of insight into what is wrong with CliSci research. Among the suggestions: “large studies with minimal bias should be performed on research findings that are considered relatively established, to see how often they are indeed confirmed. I suspect several established “classics” will fail the test.” For examples of this, see “Contradicted and initially stronger effects in highly cited clinical research” and, from the Mayo Clinic, “A Decade of Reversal: An Analysis of 146 Contradicted Medical Practices” .
Annemarie Zand Scholten & her team at the University of Amsterdam have produced an online course at https://www.coursera.org called Solid Science: Research Methods primarily aimed at social science students/researchers on the theory and practice of proper experimental design.
In the field of forecasting, J. Scott Armstrong has been leading the way with “Standards and Practices for Forecasting.”. There are several leaders in Statistics as well, battling against the dreaded “P-value hacking” (and here).
Those of us (if there is an “us” amongst readers) who believe we need better (not just more) and Feymanian-honest (not just correct) science should applaud these efforts to improve various fields of research, to point out their flaws while avoiding the temptation to “throw the baby out with the bathwater”.
A lot of the ongoing conversation regarding what to do about the what-some-believe-to-be-broken peer-review system include such things as advanced registration of all proposed experiments with their hypotheses, approvals, proposed methods and metrics. Along with this, repositories for all research data and results, raw and processed, all findings, along with resultant papers and subsequent corrections. I believe all these efforts should be supported as well.
I encourage readers to share, in comments, other “self-correction of science” efforts that they are aware of.
It is long past time to end Climate Science’s standard approach which seems to be “Instead of Correction, Collusion.”
(and “Yes, you may quote me on that.”)
# # # # #
Author’s Comment Policy: I am happy to try to answer your questions about the topics I have brought up in this essay. I act here as a free-lance science journalist, and not a climate or oceanic scientist. Though somewhat knowledgeable, I am unable, and mostly unqualified, to answer questions regarding the science of AGW, CAGW, Global Warming, Global Cooling, Climate Change, Sunspot numbers, solar irradiation, ocean/CO2/carbonate chemistry or other related topics – and will not engage in conversations on those issues. It would be nice if comments here could be about the positive side of self-correcting science efforts and not on the “I knew those blank-ity-blank ocean acidification guys were full of it” side. Thank you for reading.
# # # # #