The reproducibility crisis in science

Dorothy Bishop describes how threats to reproducibility, recognized but unaddressed for decades, might finally be brought under control.

From Nature:

More than four decades into my scientific career, I find myself an outlier among academics of similar age and seniority: I strongly identify with the movement to make the practice of science more robust. It’s not that my contemporaries are unconcerned about doing science well; it’s just that many of them don’t seem to recognize that there are serious problems with current practices. By contrast, I think that, in two decades, we will look back on the past 60 years — particularly in biomedical science — and marvel at how much time and money has been wasted on flawed research.

How can that be? We know how to formulate and test hypotheses in controlled experiments. We can account for unwanted variation with statistical techniques. We appreciate the need to replicate observations.

Yet many researchers persist in working in a way almost guaranteed not to deliver meaningful results. They ride with what I refer to as the four horsemen of the reproducibility apocalypse: publication bias, low statistical power, P-value hacking and HARKing (hypothesizing after results are known). My generation and the one before us have done little to rein these in.

In 1975, psychologist Anthony Greenwald noted that science is prejudiced against null hypotheses; we even refer to sound work supporting such conclusions as ‘failed experiments’. This prejudice leads to publication bias: researchers are less likely to write up studies that show no effect, and journal editors are less likely to accept them. Consequently, no one can learn from them, and researchers waste time and resources on repeating experiments, redundantly.

That has begun to change for two reasons. First, clinicians have realized that publication bias harms patients. If there are 20 studies of a drug and only one shows a benefit, but that is the one that is published, we get a distorted view of drug efficacy. Second, the growing use of meta-analyses, which combine results across studies, has started to make clear that the tendency not to publish negative results gives misleading impressions.

Low statistical power followed a similar trajectory. My undergraduate statistics courses had nothing to say on statistical power, and few of us realized we should take it seriously. Simply, if a study has a small sample size, and the effect of an experimental manipulation is small, then odds are you won’t detect the effect — even if one is there.

I stumbled on the issue of P-hacking before the term existed. In the 1980s, I reviewed the literature on brain lateralization (how sides of the brain take on different functions) and developmental disorders, and I noticed that, although many studies described links between handedness and dyslexia, the definition of ‘atypical handedness’ changed from study to study — even within the same research group. I published a sarcastic note, including a simulation to show how easy it was to find an effect if you explored the data after collecting results (D. V. M. Bishop J. Clin. Exp. Neuropsychol. 12, 812–816; 1990). I subsequently noticed similar phenomena in other fields: researchers try out many analyses but report only the ones that are ‘statistically significant’.

This practice, now known as P-hacking, was once endemic to most branches of science that rely on P values to test significance of results, yet few people realized how seriously it could distort findings. That started to change in 2011, with an elegant, comic paper in which the authors crafted analyses to prove that listening to the Beatles could make undergraduates younger (J. P. Simmons et al. Psychol. Sci. 22, 1359–1366; 2011). “Undisclosed flexibility,” they wrote, “allows presenting anything as significant.”

Full story here h/t to WUWT reader Dave Heider

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

80 Comments
Inline Feedbacks
View all comments
Robert B
May 9, 2019 3:57 pm

My favorite paper is one that I was only a coauthor of. A little different to the problems discussed but still a big problem.

It was my simple model that explained a strange result of a colleage. It had three variable parameters but such a strange result that we were confident that it was right. Two values from an almost perfect fit to the data were well within expected values. We didn’t have a clue about what the third should be but I suggested a simple experiment to measure it. It came back an order of magnitude off.

The interesting part to the story is that we published our work in a good journal and included the evidence that we were wrong. Went down well with reviewers and readers. You don’t see scientists debunk their own work enough.

David L. Hagen
May 9, 2019 6:09 pm

Separate data to test prediction from data used to model.
McKitrick and Christy provide a clear example of how to explicitly separate the data used to tune global climate models from the data used to test their predictions. The IPCC’s models strongly fail their most sensitive prediction of the anthropogenic signature of the highest warming in the tropical tropospheric temperature.

McKitrick, R. and Christy, J., 2018. A Test of the Tropical 200‐to 300‐hPa Warming Rate in Climate Models. Earth and Space Science, 5(9), pp.529-536.
https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2018EA000401

Abstract
Overall climate sensitivity to CO2 doubling in a general circulation model results from a complex system of parameterizations in combination with the underlying model structure. We refer to this as the model’s major hypothesis, and we assume it to be testable. We explain four criteria that a valid test should meet: measurability, specificity, independence, and uniqueness. We argue that temperature change in the tropical 200- to 300-hPa layer meets these criteria. Comparing modeled to observed trends over the past 60 years using a persistence-robust variance estimator shows that all models warm more rapidly than observations and in the majority of individual cases the discrepancy is statistically significant. We argue that this provides informative evidence against the major hypothesis in most current climate models.

Third, the independence criterion means that the target of the prediction must not be an input to the empirical tuning of the model. Once a model has been tuned to match a target, its reproduction of the target is no longer a test of its validity. In the case of GCMs, this rules out using the global average surface temperature record for testing, since during development models are often adjusted to broadly match its evolution over time.

GCMs unanimously project that warming will reach a global maximum in the tropics near the 200- to 300-hPa layer, due to the so-called negative lapse rate feedback (National Academy of Sciences, 2003) and that the warming will occur rapidly in response to increased greenhouse forcing.

Third, by focusing on the 200- to 300-hPa layer we avoid contaminating the test by searching for a signal to which the models were already tuned. The surface temperature record is ruled out for this reason, but satellite-based lower- and middle-troposphere composites are also somewhat contaminated since they include the near-surface layer in their weighting functions. Radiosonde samples measure each layer of the atmosphere independently, not simply as a gradient against the surface.

Fourth, simulations in the IPCC AR4 Chapter9 (Hegerletal., 2007) indicate that, within the frame work of mainstream GCMs, greenhouse forcing provides the only explanation for a strong warming trend in the target region.

Table 2 lists the model-specific comparisons against the average observational series. In the restricted case, 62 of 102 models reject, while in the general case, 87 of 102 models reject. It is striking that all model runs exhibit too much warming and in a clear majority of cases the discrepancies are statistically significant

If tuning to the surface added empirical precision to a valid physical representation, we would expect to see a good fit between models and observations at the point where the model predicts the clearest and strongest thermodynamic response to greenhouse gases. Instead, we observe a discrepancy across all runs of all models, taking the form of a warming bias at a sufficiently strong rate as to reject the hypothesis that the models are realistic. Our interpretation of the results is that the major hypothesis in contemporary climate models, namely, the theoretically based negative lapse rate feedback response to increasing greenhouse gases in the tropical troposphere, is incorrect

For further such testing see:
Varotsos, C.A. and Efstathiou, M.N., 2019. Has global warming already arrived? Journal of Atmospheric and Solar-Terrestrial Physics, 182, pp.31-38.

Richard Feynman describes a similar exemplary exhaustive test of rats running a maze – that sadly was ignored. Cargo Cult Science, Caltech 1974

For example, there have been many experiments running rats through all kinds of mazes, and so on—with little clear result. But in 1937 a man named Young did a very interesting one. He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.
The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? . . .He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand. So he covered one after another of all possible clues and finally was able to fool the rats so that they had to learn to go in the third door. If he relaxed any of his conditions, the rats could tell.
Now, from a scientific standpoint, that is an A‑Number‑l experiment. That is the experiment that makes rat‑running experiments sensible, because it uncovers the clues that the rat is really using—not what you think it’s using. And that is the experiment that tells exactly what conditions you have to use in order to be careful and control everything in an experiment with rat‑running.

Reply to  David L. Hagen
May 10, 2019 10:08 am

David –> A very good explanation.

I have always wondered about how a model that is built to match an input to the output you desire could be a valid predictor of the future. It is an extreme case of bias.

PatrickB
May 9, 2019 6:22 pm

There was a time that playing too loose with data or claiming significance on questionable data would destroy your career. But now political patronage protects such behavior.

Caligula Jones
Reply to  PatrickB
May 10, 2019 6:27 am

Step 1: remove classical education (too many white males), which taught you to actually think
Step 2: create modern math (too hard), which requires belief, not proof
Step 3: wait 18 years until the students who have been “educated” in Steps 1 and 2 can vote
Step 4: get elected

Dr Deanster
May 9, 2019 7:23 pm

Wow …. this article gets at the heart of why I left academia. I couldn’t stomach the dishonesty and lack of integrity. The problem is ego!! Scientist just can’t stand to be wrong. Medical Scientist can’t stand the thought that maybe their pet idea wasn’t so grand after all. Climate Scientists can’t take that their CO2 climate theory is rubbish. Sooooo . They fudge data, fudge statistics, and jump to publish the positive results and hide the negative results.

It’s not surprising that “Climate Science” has found a home among the most dishonest group of people in existence …… Politicians. By default, the scientist themselves gravitate to their lowest qualities of egoism in such an environment where such is celebrated.

Sad …. just really sad.

Mortimer Phillip Zilch
May 10, 2019 5:12 am

Lawyers should save this article in an active file. It can serve them well as a general tool to de-bunk an “expert” in the courtroom.

Solomon Green
May 10, 2019 5:36 am

At least part of the problem is caused by the fact that too many papers appear without the raw data. Even when data is attached as an appendix, it has often evident been redacted. Caligula Jones cites how it is done in the civil service.

I once had the job of assessing risk to residents of aircraft flying over their properties on landing or take-off. The Civil Aviation Authority model ignored all such accidents that had not occurred within a few hundred meters of a runway. These discarded “outliers” were very rare but the total number of non-passenger deaths that they had caused far outweighed the total number of non-passenger deaths that had been caused by the rare, but still statistically significant, accidents in the CAA’s artificially curtailed sample.

Fran
May 10, 2019 8:56 am

One of my best ones was when, at a conference I complained to a ‘heavy’ that the compound he had claimed was specific for a particular receptor subtype was pretty ‘dirty’. He said the mistake I made was not to work with the new compounds ‘when they were still specific’.

DEEBEE
May 11, 2019 6:17 pm

Many, many new moons ago — almost 500 — I was told that you can write a “failure” thesis for your Masters, but never for your PhD. Thus begins the tiny acts of putting your pinky on the scale.