Dorothy Bishop describes how threats to reproducibility, recognized but unaddressed for decades, might finally be brought under control.
From Nature:
More than four decades into my scientific career, I find myself an outlier among academics of similar age and seniority: I strongly identify with the movement to make the practice of science more robust. It’s not that my contemporaries are unconcerned about doing science well; it’s just that many of them don’t seem to recognize that there are serious problems with current practices. By contrast, I think that, in two decades, we will look back on the past 60 years — particularly in biomedical science — and marvel at how much time and money has been wasted on flawed research.
How can that be? We know how to formulate and test hypotheses in controlled experiments. We can account for unwanted variation with statistical techniques. We appreciate the need to replicate observations.
Yet many researchers persist in working in a way almost guaranteed not to deliver meaningful results. They ride with what I refer to as the four horsemen of the reproducibility apocalypse: publication bias, low statistical power, P-value hacking and HARKing (hypothesizing after results are known). My generation and the one before us have done little to rein these in.
In 1975, psychologist Anthony Greenwald noted that science is prejudiced against null hypotheses; we even refer to sound work supporting such conclusions as ‘failed experiments’. This prejudice leads to publication bias: researchers are less likely to write up studies that show no effect, and journal editors are less likely to accept them. Consequently, no one can learn from them, and researchers waste time and resources on repeating experiments, redundantly.
That has begun to change for two reasons. First, clinicians have realized that publication bias harms patients. If there are 20 studies of a drug and only one shows a benefit, but that is the one that is published, we get a distorted view of drug efficacy. Second, the growing use of meta-analyses, which combine results across studies, has started to make clear that the tendency not to publish negative results gives misleading impressions.
Low statistical power followed a similar trajectory. My undergraduate statistics courses had nothing to say on statistical power, and few of us realized we should take it seriously. Simply, if a study has a small sample size, and the effect of an experimental manipulation is small, then odds are you won’t detect the effect — even if one is there.
…
I stumbled on the issue of P-hacking before the term existed. In the 1980s, I reviewed the literature on brain lateralization (how sides of the brain take on different functions) and developmental disorders, and I noticed that, although many studies described links between handedness and dyslexia, the definition of ‘atypical handedness’ changed from study to study — even within the same research group. I published a sarcastic note, including a simulation to show how easy it was to find an effect if you explored the data after collecting results (D. V. M. Bishop J. Clin. Exp. Neuropsychol. 12, 812–816; 1990). I subsequently noticed similar phenomena in other fields: researchers try out many analyses but report only the ones that are ‘statistically significant’.
This practice, now known as P-hacking, was once endemic to most branches of science that rely on P values to test significance of results, yet few people realized how seriously it could distort findings. That started to change in 2011, with an elegant, comic paper in which the authors crafted analyses to prove that listening to the Beatles could make undergraduates younger (J. P. Simmons et al. Psychol. Sci. 22, 1359–1366; 2011). “Undisclosed flexibility,” they wrote, “allows presenting anything as significant.”
Full story here h/t to WUWT reader Dave Heider
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
My favorite paper is one that I was only a coauthor of. A little different to the problems discussed but still a big problem.
It was my simple model that explained a strange result of a colleage. It had three variable parameters but such a strange result that we were confident that it was right. Two values from an almost perfect fit to the data were well within expected values. We didn’t have a clue about what the third should be but I suggested a simple experiment to measure it. It came back an order of magnitude off.
The interesting part to the story is that we published our work in a good journal and included the evidence that we were wrong. Went down well with reviewers and readers. You don’t see scientists debunk their own work enough.
Separate data to test prediction from data used to model.
McKitrick and Christy provide a clear example of how to explicitly separate the data used to tune global climate models from the data used to test their predictions. The IPCC’s models strongly fail their most sensitive prediction of the anthropogenic signature of the highest warming in the tropical tropospheric temperature.
McKitrick, R. and Christy, J., 2018. A Test of the Tropical 200‐to 300‐hPa Warming Rate in Climate Models. Earth and Space Science, 5(9), pp.529-536.
https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2018EA000401
For further such testing see:
Varotsos, C.A. and Efstathiou, M.N., 2019. Has global warming already arrived? Journal of Atmospheric and Solar-Terrestrial Physics, 182, pp.31-38.
Richard Feynman describes a similar exemplary exhaustive test of rats running a maze – that sadly was ignored. Cargo Cult Science, Caltech 1974
David –> A very good explanation.
I have always wondered about how a model that is built to match an input to the output you desire could be a valid predictor of the future. It is an extreme case of bias.
There was a time that playing too loose with data or claiming significance on questionable data would destroy your career. But now political patronage protects such behavior.
Step 1: remove classical education (too many white males), which taught you to actually think
Step 2: create modern math (too hard), which requires belief, not proof
Step 3: wait 18 years until the students who have been “educated” in Steps 1 and 2 can vote
Step 4: get elected
Wow …. this article gets at the heart of why I left academia. I couldn’t stomach the dishonesty and lack of integrity. The problem is ego!! Scientist just can’t stand to be wrong. Medical Scientist can’t stand the thought that maybe their pet idea wasn’t so grand after all. Climate Scientists can’t take that their CO2 climate theory is rubbish. Sooooo . They fudge data, fudge statistics, and jump to publish the positive results and hide the negative results.
It’s not surprising that “Climate Science” has found a home among the most dishonest group of people in existence …… Politicians. By default, the scientist themselves gravitate to their lowest qualities of egoism in such an environment where such is celebrated.
Sad …. just really sad.
Lawyers should save this article in an active file. It can serve them well as a general tool to de-bunk an “expert” in the courtroom.
At least part of the problem is caused by the fact that too many papers appear without the raw data. Even when data is attached as an appendix, it has often evident been redacted. Caligula Jones cites how it is done in the civil service.
I once had the job of assessing risk to residents of aircraft flying over their properties on landing or take-off. The Civil Aviation Authority model ignored all such accidents that had not occurred within a few hundred meters of a runway. These discarded “outliers” were very rare but the total number of non-passenger deaths that they had caused far outweighed the total number of non-passenger deaths that had been caused by the rare, but still statistically significant, accidents in the CAA’s artificially curtailed sample.
One of my best ones was when, at a conference I complained to a ‘heavy’ that the compound he had claimed was specific for a particular receptor subtype was pretty ‘dirty’. He said the mistake I made was not to work with the new compounds ‘when they were still specific’.
Many, many new moons ago — almost 500 — I was told that you can write a “failure” thesis for your Masters, but never for your PhD. Thus begins the tiny acts of putting your pinky on the scale.