The reproducibility crisis in science

Dorothy Bishop describes how threats to reproducibility, recognized but unaddressed for decades, might finally be brought under control.

From Nature:

More than four decades into my scientific career, I find myself an outlier among academics of similar age and seniority: I strongly identify with the movement to make the practice of science more robust. It’s not that my contemporaries are unconcerned about doing science well; it’s just that many of them don’t seem to recognize that there are serious problems with current practices. By contrast, I think that, in two decades, we will look back on the past 60 years — particularly in biomedical science — and marvel at how much time and money has been wasted on flawed research.

How can that be? We know how to formulate and test hypotheses in controlled experiments. We can account for unwanted variation with statistical techniques. We appreciate the need to replicate observations.

Yet many researchers persist in working in a way almost guaranteed not to deliver meaningful results. They ride with what I refer to as the four horsemen of the reproducibility apocalypse: publication bias, low statistical power, P-value hacking and HARKing (hypothesizing after results are known). My generation and the one before us have done little to rein these in.

In 1975, psychologist Anthony Greenwald noted that science is prejudiced against null hypotheses; we even refer to sound work supporting such conclusions as ‘failed experiments’. This prejudice leads to publication bias: researchers are less likely to write up studies that show no effect, and journal editors are less likely to accept them. Consequently, no one can learn from them, and researchers waste time and resources on repeating experiments, redundantly.

That has begun to change for two reasons. First, clinicians have realized that publication bias harms patients. If there are 20 studies of a drug and only one shows a benefit, but that is the one that is published, we get a distorted view of drug efficacy. Second, the growing use of meta-analyses, which combine results across studies, has started to make clear that the tendency not to publish negative results gives misleading impressions.

Low statistical power followed a similar trajectory. My undergraduate statistics courses had nothing to say on statistical power, and few of us realized we should take it seriously. Simply, if a study has a small sample size, and the effect of an experimental manipulation is small, then odds are you won’t detect the effect — even if one is there.

I stumbled on the issue of P-hacking before the term existed. In the 1980s, I reviewed the literature on brain lateralization (how sides of the brain take on different functions) and developmental disorders, and I noticed that, although many studies described links between handedness and dyslexia, the definition of ‘atypical handedness’ changed from study to study — even within the same research group. I published a sarcastic note, including a simulation to show how easy it was to find an effect if you explored the data after collecting results (D. V. M. Bishop J. Clin. Exp. Neuropsychol. 12, 812–816; 1990). I subsequently noticed similar phenomena in other fields: researchers try out many analyses but report only the ones that are ‘statistically significant’.

This practice, now known as P-hacking, was once endemic to most branches of science that rely on P values to test significance of results, yet few people realized how seriously it could distort findings. That started to change in 2011, with an elegant, comic paper in which the authors crafted analyses to prove that listening to the Beatles could make undergraduates younger (J. P. Simmons et al. Psychol. Sci. 22, 1359–1366; 2011). “Undisclosed flexibility,” they wrote, “allows presenting anything as significant.”

Full story here h/t to WUWT reader Dave Heider

80 thoughts on “The reproducibility crisis in science

    • Not really. It is a factor, but “p-hacking” doesn’t simply follow grant money around. It occurs in most sciences and many of those sciences really don’t see that much grant money. The primary trouble has to do with low training standards for graduate and pre-degree students. Courses in logic, philosophy and statistical reasoning are not required or even recommended in many departments. The main driver is that tiny probabilities are understood as “meaningful” when they real are not necessarily anything but an accidental result. So, your typical young glory grabber sees all those zeros after the decimal and goes “OH, WOW!” His advisers who are often no wiser, do as well. The department chair often gets to be listed as lead author when the paper is written, and everybody basks in glory of discovery. But, a well known phenomenon in the sciences is that innitial findings are often impressive, but as more studies are reported in, the impressiveness declines, the number of zeros after the decimal starts shrinking. What’s happened is that quite often, a student researcher gets a result that raises their eyebrows. What they aren’t aware of is how many in the past did the same study and found nothing of interest. Then they get their awesome number and immediatelt publish it. Others jump in and struggle to find the same impressive results. Unless the “impressive” results are regarding something really controversial, of the operators attempting to replicate the results, only those with somewhat impressive results publish. Some who tried before try again. They get even less impressive results. Ultimately the effect either disappears or is written off as real but unimportant.

      • “Courses in logic, philosophy and statistical reasoning are not required or even recommended in many departments. ”

        One of my sons is a PhD in immunology. He echo’s what you say. “Well get a math major to analyze the results” is a common refrain.

      • I recall the statistics course in my undergraduate studies in applied physics was worst course we had. Intractable in impossible to follow. Probably because the doctor giving the course had little idea about it himself.

        • “Probably because the doctor giving the course had little idea about it himself.”

          Its starts even earlier than that: most public school teachers have zero math skills. When the students get to high school, they might find one teacher in four years with some.

          BTW, best math teacher I had was nuts. Big Velikovsky fan…

    • IMO its the rise of managerialism in tertiary institutions and organisations like NOAA, BOM ( Australia) etc. The managerialists set up ‘science communication’ outfits and science gets plugged into the 24 hour news cycle, ready to publish a sexed up summary of some LPU ( Least Publishable Unit) to keep the topic and funding for further research in the public/political/msm eye. In its very essence it is a corruption, a prostitution of science. The money then just follows the corrupt furrows.

  1. There’s a xxxx load of bad science being done and my impression is that quality is declining. It could be that the brighter students are pursuing other areas, business, computer science, engineering and science,especially climate and ecological sciences, are being populated with snowflakes.

    • There’s a lot more “science” being done today than even a decade ago.

      It’s extremely difficult to significantly increase the output of anything, without reducing the overall quality of the output. This is especially true when the production process is heavily dependent on skilled human labor.

      It’s very difficult to say double the labor force in a skilled/professional field without resorting to hiring less talented/qualified/skilled workers.

  2. … and HARKing (hypothesizing after results are known)…

    There is nothing wrong about this practice unless one uses the earlier results to verify the hypothesis. There is not just “a” scientific method, but actually many. Some perfectly valid research came about from winding one’s way, spaghetti-like, through hypotheses and tests.

    • I have had one or two spectacular wins in the technical ceramic industry when designing a modification on equipment that had unexpected beneficial results as well as achieving the initial aim. I was happy to be lucky as well as good.
      As long as you understand the new information i don’t see a problem.

      • Agreed, but I don’t think the one phrase description of HARKing explains it sufficiently.

        If you design an experiment to test the efficacity of a drug and it comes out negative but you then modify the hypothesis to test its effects on women over the age of 65 using the same data you may produce false positives.

        If you see an unexpected effect in some data then try to hypothesise possible causes, that is legit. If you have a credible hypothesis you would them presumably construct new experiments to test it. I’m sure that’s what you would have done, rather than just relying on a post hoc hypothesis as proof.

    • A hypothesis is usually formed by an examination of prior data or the ramifications of an accepted finding. After that, an experiment is devised to best isolate the hypothetical mechanism/variable. The results from *that* experiment are what are to be used in publishing a finding.

      • “Patrick May 10, 2019 at 3:52 am

        A hypothesis is formulated by an examination of prior data or the ramifications of an accepted finding. After that, experiment is devised to isolate the hypothetical mechanism / variable. The results from * that * experiment are what are being used in publishing a finding. ”
        ____________________________________________

        Whole lotta work, Patrick, needs min. 3 classes / generations of (re) searchers AND time:

        – an idea for a hypothesis, working on it and getting exhausted because of time overrun or plain fatigue.

        – some (re) searchers finding idea + work on it and try to follow suit.

        – (re) searchers trying to collect, combine AND reproduce found.

        – a competent science history journalist and a competent scientist to make the long way home.

  3. In the full post, Prof. Bishop is very optimistic that the “four horsemen” she talks of are being slain. In the medical sciences she may be correct, but I am much less optimistic about other branches of science that are not as amenable to i) direct experimentation (as opposed to observational modelling) and ii) accepting refutation of hypotheses.

    Maybe these two points actually go together. Where experimental results are not the norm, there seems to be very little attention to refutation of hypotheses and much more focus on finding evidence to support a theory. People become much too invested in a pet theory and will defend it way beyond the point at which it clearly has no validity or utility.

    I remain quite pessimistic about the future of science.

    • Kinda been my point for quite some time: climate science, such as it is, appears to be totally immune to charges that it might have an reproducibility issue.

      Possibly because so much of the “science” actually exists only as programs in supercomputers?

    • One thing that they’ve stopped teaching (assuming that they ever did) to science majors is: “Never fall in love with a particular theory”.

      • http://www.gutenberg.org/files/45122/45122-h/45122-h.htm#Page_190

        Every true Science is like a hardy Alpine guide that leads us on from the narrow, though it may be the more peaceful and charming, valleys of our preconceived opinions, to higher points, apparently less attractive, nay often disappointing for a time, till, after hours of patient and silent climbing, we look round and see a new world around us.

        PROFESSOR MAX MÜLLER

  4. I’ve spent much of my professional career performing root cause analysis on industrial equipment failures. Much that is called “scientific research” would get rejected by people responsible for knowing why something happened so corrective actions will be assured of succeeding.

    Among root cause professionals, there are two sure-fire methods for getting your findings laughed out of the office. 1) Ignore data that contradicts your pet theory, and 2) Try to get a senior manager to order everyone to see it your way. Unfortunately, both of these have become all too common in junk science as contradictory data is said to come from “deniers,” and junk scientists appeal to the courts to have their theories enforced by a judge.

    • Good point, have also have been involved in structured approaches for acquiring and analyzing facts and data
      Which is effective in preventing the group from jumping to wrong root cause
      Interviewing muttiple individuals who have first hand exposure is especially helpful.

  5. P-hacking has been known for a while. One of our irregular posters here, Wm Briggs, has written about it, but even before him I read about criticisms of P-values in a book by Richard Royall (Statistical Evidence, 1997), a book I heartily recommend, who harks (not the sort mentioned in this note) back to A.W.F. Edwards (Likelihood, 1992) and even earlier work.

  6. Maybe the world needs an interdisciplinary online magazine called The Null Hypothesis that would publish everyone’s studies of “failed” experiments. Readership might be low, but it could act as a repository for studies that deserve to be in the public record.

    • Steve

      On the contrary, I imagine readership would be quite high with academics having an immediately searchable source of research that may save them a great deal of work.

      The ‘null hypothesis’ repository might contain several papers approaching the same subject, each from a different perspective. A researcher might gain a great deal of insight from these studies and perhaps with new technology, be able to take things a stage further.

      But then I’m not a scientist so it seems like a great idea to me, but perhaps not in the real world.

      • Thomas Edison was said to reply ” I have learned 1000 things not to do” when he was criticized for all the failed experiments before he finally got the working invention. Sad to note, it was Tesla who look down on Edison’s lack of mathematical and theoretical savvy but history and their life history has bn kinder to Edison than Tesla.

        • Except of the fact that our entire electrical power system is now based on alternating current and transformers thanks to Tesla instead of the DC system of Edison.

  7. How about this fail safe test. A) If the raw data doesnt show at least encouragement that a relationship appears to have some “principal component” legs, statistical analysis is definitely a waste of time. B) If a convincing relationship can’t be demonstrated using existing suitable stat methods, inventing your own statistics to find something is scientific misconduct. A classic example of the latter is Mann’s hockey stick. A real statistician, S. McIntyre showed the method developed by Mann aitomatically produced hocke sticks from red noise! This single study has cost the wellfare of mankind 10s of $trillions.

    • “Calibration” in climate science and is considered a valid technique.

      However in statistics it is called “selecting on the dependent variable and is a mathematical error”.

      For example, it is mathematically forbidden to filter temperature proxies using temperature as your filter. Temperature is the thing you are trying to study. You have begun your study by skewing the results.

      The calibration problem is widespread in climate science and few climate scientists appear to be aware of their faulty mathematics.

      Suffice to say, any climate paper involving proxy calibration is likely junk science.

      • Publish or Perish would be fine if there was no bias towards positive results.

        The problem is that spending a million dollars to find apricots don’t cure cancer is not likely to win you any future funding.

        Publish a study shoeing apricots cure cancer and now you have something.

        • “Publish or Perish would be fine if there was no bias towards positive results.”

          Sciency version of the MSM: If it bleeds, it leads.

    • The bottom line. If you see any study that separates data into “good” and “bad” the results are immediately suspect.

      Maybe they are stripping off outliers, or filtering or weighting the rows. Anytime the investigator in some fashion says “I believe this data more than I believe that data”.

      As soon as the experimenter looks at the data to decide the next step in the analysis, the results can no longer be trusted, because there is no need to look at the data.

      If your hypothesis is correct, there is no need to examine the data prior to testing the hypothesis.

      In the case where you are likely wrong, examining the data help make your hypothesis appear correct.

      The exact opposite of why we test a hypothesis. But of course you want to prove your hypothesis correct, so the temptation to peek at the data is irresistible. After all, no one else will know.

      • “Maybe they are stripping off outliers, or filtering or weighting the rows. Anytime the investigator in some fashion says “I believe this data more than I believe that data”.”

        This was part of my job for a few years when I was a civil servant: had to make the bosses look good, or at least make the old bosses look bad (i.e., current government needs to blame former government).

        So…you had 10 years of data and it makes things look bad.

        You can cut a few of the earlier years because the data is “old and soft”, i.e., weren’t really looking at the same thing, apples and oranges, etc.

        Maybe build a proxy in for a year or so to smooth out those variables.

        Then you cut off a few current years because the data hasn’t been audited or verified completely.

        Boom! Good news…

      • If you see any study that separates data into “good” and “bad” the results are immediately suspect.

        What’s suspect here is that the holder of such pedantic, academic views has never taken any scientific instrument into the field. In actual practice, along with usual questions of calibration, there are many physical factors that can produce “data” that do not accurately capture what one wants to measure: man-made sources of heat impinging upon thermometers, birds sitting on directional vanes of anemometers, barnacles encrusting underwater pressure sensors, etc., etc.

        Data validation is an indispensable requirement in any scientific enterprise. What has been the chronic problem in “climate science” is the eagerness to grab any data that supports the grant-getting hypotheses and the reliance upon quaint, ad hoc methods of establishing “statistical significance” in the face of pitifully short records. Well-founded, universal methods of signal analysis remain terra incognita. And if the barriers to reliable causal inference presented by short, vetted records are not enough, there’s always the butchery of low-frequency components performed by”scalpeling” corrupted station records into disjointed snippets of data under the false premise that temperature signals are simply “red noise.”

    • Gary… Perhaps I am ‘way off base’ here, but I am struck with the how your post relates (in my mind) to what the MSM spews to the masses, and the actual resulting benefits to humanity, of Trump’s policies.

  8. Well to Dr Bishop I would respond:

    When you create and perpetuate an academic tenure attainment system that is publish or perish, this is what you get.

  9. publishing bad papers is not even the biggest problem….left alone, they would self correct

    the biggest problem is once a paper is published….the misconception of “peer review”

    ..and that paper will be built on and cited a million times in following papers

    making it almost impossible to dispute

  10. Science has been reimagined with assumptions/assertions, recharacterized through inference, reconstructed as models, and held together with a “consistent with” mythos, a veritable conflation of logical domains. That said, we cannot even reach a [political] consensus when human life begins, or basic concepts of semantics, logic, ethics, morality, and biology, that survive urbane choices, which vary as the stork flies.

    • nn: There has been no serious discussion about when life begins. The pro-choice forces avoid this.

      A serious discussion would lead more assuredly than ever to the view that has prevailed for about 100 years: life begins at conception, or surely within an hour afterwards, if not at that moment. There are eight ways to Sunday to “cite” this. First, this has been in the embryology texts going back 60 years or more, except for recent years where the matter must be avoided for political reasons. Second: AMA held the position that life begins at conception, until they decided, based on political influences, that the matter should be ignored. Third: Planned Parenthood held the position that abortion killed a genuine human being in the 1950s, and a former president of PP clearly said this, in a magazine interview, in the 1990s.

      In a milestone PP event, 1959’s conference “Mechanisms Concerned with Conception,” a scholarly meeting to share knowledge of reproduction with the aim of advancing research in birth control, one author concludes their talk with a dismissive conclusion that “If, then, implantation, like life, is a continuity of mechanisms, how shall we say when it begins…Whether eventual control of implantation may be reserved the social advantage of being considered to prevent conception rather than to destroy an established pregnancy could depend upon something so simple as a prudent habit of speech.” Boving, Implantation Mechanisms, CG Hartmann, Ed., Mechanisms Concerned With Conception, 1963 Pergamon Press.

      Finally, you cannot find a serious analysis of whether we should draw the line at “heartbeat,” “viability,” etc., anywhere. The story of how Roe v. Wade arrived at second trimester – states have liberty to ban abortion at second trimester – is fairly well covered in B. Nathanson’s book “Aborting America,” where he tells how this “compromise” was arrived at in NY state politics, and then floated up to influence SC.

      The pro-choice crowd does not want a debate. They decide “there is no consensus” because they have decided, for political reasons, that it is better to look the other way than to solidly lose such arguments, that once they claim “controversy,” or “lack of consensus,” they can ignore the clear matter of when life begins and move on to use other principles to “guide” the evaluation of whether abortion should be legal.

      We could do the same for immunizations. “There is controversy,” therefore let us leave this “personal choice” to a “woman and her physician.”

      To prove I am right: start a discussion on the web, somewhere in social media, and stick to the point that life begins at conception, per scientific consensus, and see how an opponent carries on rhetoric. They will NOT stick to a discussion of when life begins. THey will not provide some evidence or citation that it begins at implantation, or heartbeat, or viability, or personhood, or “sentience,” or other candidate demarcations.

      Rather, they have two tactics that are inevitable: “change the topic,” and “name-calling.”

  11. All carefully conducted research is valuable. Even if the conclusion is that the Theory under examination is not correct. This is just a valuable as finding something new, which is rare at best. So I whole heartedly support publication of all meaningful research no matter the conclusion. I do wish negative results were put out for examination as there may have been something in the original hypothesis that was not supported by poor experimental design. Poor design of experiments is probably the most common issue we face as doing it right is often difficult, expensive, or both. And everyone loves a shortcut.

  12. My unscientific study shows that Government controlled taxpayer dollars can and will buy any scientific results the Swamp requires along with the statistics to back it.

  13. show how easy it was to find an effect if you explored the data after collecting results
    =========
    Unfortunately there is no way to detect this. If I collect a bunch of data and then filter out the records that don’t support my hypothesis, there is no way anyone can detect this, unless I am dumb enough to place them in a folder labelled “censored”.

    • Or you can just respond that “No, you can’t see my raw data because you just want to find errors in my data”.

    • And if anything, the “censored” folder showed that the experimenter in question didn’t understand why this was mathematically forbidden.

      And why to this day the faulty practice continues. The researchers don’t know enough outside their own field to recognize their error.

      After all, it does sound reasonable. Filter or weight the proxy records to separate the good from the bad to improve the signal to noise level, to tease out a relationship everyone else had missed.

      And the beauty is, that so long as other researchers continue to use your faulty technique, they will be able to reproduce your faulty conclusions!

      This is the hidden problem. Faulty conclusions are reproducible when your use the same faulty methods.

      • “And if anything, the “censored” folder showed that the experimenter in question didn’t understand why this was mathematically forbidden.”

        If you think it was misunderstanding, then your rose-colored glasses are REALLY thick.

    • We can.

      If an article does not include description of the wording of questions, or how the sample was gathered, it is suspect. The way the “scientific community” should regard this is that a study is automatically suspect to be vulnerable to bias; there is no need to argue or prove the bias.

  14. Years ago, I had the same problem when handicapping horse races. I discovered that finding an apparently “statistically significant” correlation between past performance and winning probability was merely a FIRST step. Once I found such a correlation, I had to test it against NEW races to rule out the possibility of data fishing.

    • Same here, but with the dog races in Florida.

      One day I won $75 or so (a lot for a teen in the 80s), with a foolproof method (I believe it had to do with last three races and if it did its business on the track before the race).

      Next time, though, I lost it all. Seems that the first method only worked on Tuesdays (or when it was overcast, or overcast on a Tuesday). It certainly didn’t work on a sunny Thursday…

      Problem is, I wasn’t sophisticated enough to get a government grant to continue my studies.

  15. Anyone really interested in this topic, especially as it applies to biomedical research, should get a copy of the book Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions
    by Richard F. Harris
    . I got my copy from the local library.

    If you want to know why so many of the authors here, and the growing number of mainstream scientists are pushing back so hard against the rising tide of terrifying predictions about future climate change, you should read this book and realize that biomedical research is orders of magnitude stricter in the present than climate science (where anything goes, even upside down data).

    Much of biomedical research is simply wrong — not entirely worthless, but more harmful than helpful.

    As for climate science? There are some scientists really trying to figure out how the climate system works, atmospheric dynamics and oceanic circulations folks. I believe they are hampered in their good work by the journals and their colleagues that demand that all results must toe the lines drawn by the IPCC.

    The vast majority of climate science that one reads about in the media is not only wrong — it is dangerously wrong and causes harm. (See recent press on “climate anxiety”)

    • My granddaughter just graduated from medical school and when she entered a dean told the class that half of what they would learn would be wrong. We just went to her graduation and were impressed with what we saw. It was interesting that although she had a degree and great interest in neurobiology, she and many others are going into family practice. Apparently this is advancing and attracting those who are interested in problem solving. I just saw a family example, hopefully successful, from a problem solving doctor.

      Replication in many areas of research is hard enough, especially so with human constrictions. Specialization is often important but the lateral type that crosses disciplines is often more successful. My grandfather was a successful problem solving auto mechanic. My son found one to fix his truck.

    • ” I got my copy from the local library.”

      That wasn’t your copy, it belongs to the library.

  16. Aside from the reasons cited below (above?) in the comments, the problem arises from the misunderstanding of ‘statistical significance.’ When Fischer concocted that notion, and the use of p=0.05 as a criterion, his idea was to use the results – almost always done on very small samples since computers were not available – to indicate when it might be worthwhile to do a deeper study with much larger sample sizes.

    When I learned design of experiments and statistical testing in graduate school, we were taught also to computer the false positive probability for every experiment. What most scientists and engineers don’t know is that the false positive rate in a small sample size with p=0.05 is about 0.6. In other words, 60% of ‘statistically significant’ findings are false positives (wrong).

    Its not a solution – especially to the misunderstanding of what the p statistic means – but current doctrine is to require that a publishable positive result be shown to be ‘significant’ at p=0.005. That will disqualify about 90% of false positive results.

    • 60% of ‘statistically significant’ findings are false positives (wrong).
      ======
      Exactly. Test 100 athletes with 5% false positive test, where only 3% are actual drug users. You get 8 hits. 3 real, 5 false. 5 out of 8, 62.5% chance, the accusatation of drug use is false.

  17. “The reproducibility crisis in science”

    What is it we would need to reproduce for the AGW hypothesis?

    The original Groupthink? The original Cargo Cult Science?

    Maybe we could go back to basic physics and attempt to reproduce how a small anthropogenic addition to a trace gas which has the property of being able to absorb and emit photons in the 15 micron wavelength is able to raise the temperature of that object which emitted the IR in the first place? And also take into consideration that while these radiative exchanges are taking place, atmospheric molecules are messing things up by exchanging energy by conduction at a vastly faster rate and moving it around by convection.

    It’s BS, Jim, but not as we know it.

  18. Perhaps it is the change from “testing your hypothesis” in hope of finding a correct answer to “proving your hypothesis” in hope of finding fame and fortune.

  19. At this stage is it worth remembering that climate ‘science’ doesn’t follow the rules seen in other sciences and that oddly these issue at not a problem in this area. That by falling to follow these rules it proves it is not a science at and is closer to marketing , astrology and religion is another problem .
    But fallen to meet the standards required in practicing good science, is not a problem when you simply do not practice good or anyone other type of science in reality .

    • closer to marketing , astrology and religion is
      ======
      Astrology has à bad rap. Astrology was used to predict the seasons long before humans had any idea of the cause. Astrology is still used to predict the chaotic ocean tides.

      In that respect it could be said that astrology is one of the few approaches that can provide an accurate solution to solving a chaotic system.

      Look outside your house at night. If you can’t see the moon or stars odds are it is going to rain. The astrologers guide to weather forecasting. No fancy computer required.

  20. It’s not only the 4 horsemen that are the problem. The huge elephant in the room is the politicization if science. Something Eisenhower saw coming.

    • That’s it, exactly.

      As science becomes more about advocacy than facts, its gonna get worse, too.

      I don’t like hyperbole, but the current drift into “climate extinction” and the ridiculous street theatre that is being performed by grown adults (let along uneducated children) reminds me of suicide cults. It truly does.

      What will happen when even more people turn off the hysterical messaging and just get on with their lives? What stunts can they come up with next?

      I am legitimately worried that self-harm in the name of “saving the world” could very well become a theme among impressionable teens and younger. And it only takes a few to get a horrible trend started.

      Seriously, the Tide Pod Challenge generation scares me.

  21. Tabloid Science is an embarrassment that has few ways of being addressed. It will continue until professional orgs step up. But alas, they have been compromised by agenda science.

  22. Burn it over. Decimate the funding. Most of it is crap we would never notice its absence. Science has become a cancer threatening the people. Aggressive chemo is in order. You cannot reform a cesspool. What to much for you. What did you expect. Sanity fights back and it ain’t always polite.

  23. The author talks as if this affects all sciences; in actuality this crisis affects mainly the “soft sciences” like psychology. Some of it has spilled over into medicine, however. That is big money.

      • Lord Kelvin: “In science there is only physics; all the rest is stamp collecting.”

    • “That is big money.”

      Yes, that is part of why most of the bad papers are in medicine, Big Pharma and Big Fast Food are obviously easy targets.

      But there is also big money now in climate alarmism, but as mentioned, apparently the science there is pristine, as we don’t see the same vigilance from the likes of:

      https://retractionwatch.com/

  24. My favorite paper is one that I was only a coauthor of. A little different to the problems discussed but still a big problem.

    It was my simple model that explained a strange result of a colleage. It had three variable parameters but such a strange result that we were confident that it was right. Two values from an almost perfect fit to the data were well within expected values. We didn’t have a clue about what the third should be but I suggested a simple experiment to measure it. It came back an order of magnitude off.

    The interesting part to the story is that we published our work in a good journal and included the evidence that we were wrong. Went down well with reviewers and readers. You don’t see scientists debunk their own work enough.

  25. Separate data to test prediction from data used to model.
    McKitrick and Christy provide a clear example of how to explicitly separate the data used to tune global climate models from the data used to test their predictions. The IPCC’s models strongly fail their most sensitive prediction of the anthropogenic signature of the highest warming in the tropical tropospheric temperature.

    McKitrick, R. and Christy, J., 2018. A Test of the Tropical 200‐to 300‐hPa Warming Rate in Climate Models. Earth and Space Science, 5(9), pp.529-536.
    https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2018EA000401

    Abstract
    Overall climate sensitivity to CO2 doubling in a general circulation model results from a complex system of parameterizations in combination with the underlying model structure. We refer to this as the model’s major hypothesis, and we assume it to be testable. We explain four criteria that a valid test should meet: measurability, specificity, independence, and uniqueness. We argue that temperature change in the tropical 200- to 300-hPa layer meets these criteria. Comparing modeled to observed trends over the past 60 years using a persistence-robust variance estimator shows that all models warm more rapidly than observations and in the majority of individual cases the discrepancy is statistically significant. We argue that this provides informative evidence against the major hypothesis in most current climate models.

    Third, the independence criterion means that the target of the prediction must not be an input to the empirical tuning of the model. Once a model has been tuned to match a target, its reproduction of the target is no longer a test of its validity. In the case of GCMs, this rules out using the global average surface temperature record for testing, since during development models are often adjusted to broadly match its evolution over time.

    GCMs unanimously project that warming will reach a global maximum in the tropics near the 200- to 300-hPa layer, due to the so-called negative lapse rate feedback (National Academy of Sciences, 2003) and that the warming will occur rapidly in response to increased greenhouse forcing.

    Third, by focusing on the 200- to 300-hPa layer we avoid contaminating the test by searching for a signal to which the models were already tuned. The surface temperature record is ruled out for this reason, but satellite-based lower- and middle-troposphere composites are also somewhat contaminated since they include the near-surface layer in their weighting functions. Radiosonde samples measure each layer of the atmosphere independently, not simply as a gradient against the surface.

    Fourth, simulations in the IPCC AR4 Chapter9 (Hegerletal., 2007) indicate that, within the frame work of mainstream GCMs, greenhouse forcing provides the only explanation for a strong warming trend in the target region.

    Table 2 lists the model-specific comparisons against the average observational series. In the restricted case, 62 of 102 models reject, while in the general case, 87 of 102 models reject. It is striking that all model runs exhibit too much warming and in a clear majority of cases the discrepancies are statistically significant

    If tuning to the surface added empirical precision to a valid physical representation, we would expect to see a good fit between models and observations at the point where the model predicts the clearest and strongest thermodynamic response to greenhouse gases. Instead, we observe a discrepancy across all runs of all models, taking the form of a warming bias at a sufficiently strong rate as to reject the hypothesis that the models are realistic. Our interpretation of the results is that the major hypothesis in contemporary climate models, namely, the theoretically based negative lapse rate feedback response to increasing greenhouse gases in the tropical troposphere, is incorrect

    For further such testing see:
    Varotsos, C.A. and Efstathiou, M.N., 2019. Has global warming already arrived? Journal of Atmospheric and Solar-Terrestrial Physics, 182, pp.31-38.

    Richard Feynman describes a similar exemplary exhaustive test of rats running a maze – that sadly was ignored. Cargo Cult Science, Caltech 1974

    For example, there have been many experiments running rats through all kinds of mazes, and so on—with little clear result. But in 1937 a man named Young did a very interesting one. He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.
    The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? . . .He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand. So he covered one after another of all possible clues and finally was able to fool the rats so that they had to learn to go in the third door. If he relaxed any of his conditions, the rats could tell.
    Now, from a scientific standpoint, that is an A‑Number‑l experiment. That is the experiment that makes rat‑running experiments sensible, because it uncovers the clues that the rat is really using—not what you think it’s using. And that is the experiment that tells exactly what conditions you have to use in order to be careful and control everything in an experiment with rat‑running.

    • David –> A very good explanation.

      I have always wondered about how a model that is built to match an input to the output you desire could be a valid predictor of the future. It is an extreme case of bias.

  26. There was a time that playing too loose with data or claiming significance on questionable data would destroy your career. But now political patronage protects such behavior.

    • Step 1: remove classical education (too many white males), which taught you to actually think
      Step 2: create modern math (too hard), which requires belief, not proof
      Step 3: wait 18 years until the students who have been “educated” in Steps 1 and 2 can vote
      Step 4: get elected

  27. Wow …. this article gets at the heart of why I left academia. I couldn’t stomach the dishonesty and lack of integrity. The problem is ego!! Scientist just can’t stand to be wrong. Medical Scientist can’t stand the thought that maybe their pet idea wasn’t so grand after all. Climate Scientists can’t take that their CO2 climate theory is rubbish. Sooooo . They fudge data, fudge statistics, and jump to publish the positive results and hide the negative results.

    It’s not surprising that “Climate Science” has found a home among the most dishonest group of people in existence …… Politicians. By default, the scientist themselves gravitate to their lowest qualities of egoism in such an environment where such is celebrated.

    Sad …. just really sad.

  28. Lawyers should save this article in an active file. It can serve them well as a general tool to de-bunk an “expert” in the courtroom.

  29. At least part of the problem is caused by the fact that too many papers appear without the raw data. Even when data is attached as an appendix, it has often evident been redacted. Caligula Jones cites how it is done in the civil service.

    I once had the job of assessing risk to residents of aircraft flying over their properties on landing or take-off. The Civil Aviation Authority model ignored all such accidents that had not occurred within a few hundred meters of a runway. These discarded “outliers” were very rare but the total number of non-passenger deaths that they had caused far outweighed the total number of non-passenger deaths that had been caused by the rare, but still statistically significant, accidents in the CAA’s artificially curtailed sample.

  30. One of my best ones was when, at a conference I complained to a ‘heavy’ that the compound he had claimed was specific for a particular receptor subtype was pretty ‘dirty’. He said the mistake I made was not to work with the new compounds ‘when they were still specific’.

  31. Many, many new moons ago — almost 500 — I was told that you can write a “failure” thesis for your Masters, but never for your PhD. Thus begins the tiny acts of putting your pinky on the scale.

Comments are closed.