Raising the bar on statistical significance

I was searching the early edition of PNAS for the abstract of yet another sloppy “science by press release” that didn’t bother to give the the title of the paper or the DOI, and came across this paper, so it wasn’t a wasted effort.

Steve McIntyre recently mentioned:

Mann rose to prominence by supposedly being able to detect “faint” signals using “advanced” statistical methods. Lewandowsky has taken this to a new level: using lew-statistics, lew-scientists can deduce properties of population with no members.

Josh (N=0) humor aside, this new paper makes me wonder how many climate science findings would fail evidence thresholds under this new proposed standard?pvalue_curve

Revised standards for statistical evidence

Valen E. Johnson

Significance

The lack of reproducibility of scientific research undermines public confidence in science and leads to the misuse of resources when researchers attempt to replicate and extend fallacious research findings. Using recent developments in Bayesian hypothesis testing, a root cause of nonreproducibility is traced to the conduct of significance tests at inappropriately high levels of significance. Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.

Abstract

Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.

From the discussion:

The correspondence between P values and Bayes factors based on UMPBTs suggest that commonly used thresholds for statistical significance represent only moderate evidence against null hypotheses. Although it is difficult to assess the proportion of all tested null hypotheses that are actually true, if one assumes that this proportion is approximately one-half, then these results suggest that between 17% and 25% of marginally significant scientific findings are false. This range of false positives is consistent with nonreproducibility rates reported by others (e.g., ref.5). If the proportion of true null hypotheses is greater than one-half, then the proportion of false positives reported in the scientific literature, and thus the proportion of scientific studies that would fail to replicate, is even higher.

In addition, this estimate of the nonreproducibility rate of scientific findings is based on the use of UMPBTs to establish the rejection regions of Bayesian tests. In general, the use of other default Bayesian methods to model effect sizes results in even higher assignments of posterior probability to rejected null hypotheses, and thus to even higher estimates of false-positive rates.

This phenomenon is discussed further in SI Text, where Bayes factors obtained using several other default Bayesian procedures are compared with UMPBTs (seeFig. S1). These analyses suggest that the range 17–25% underestimates the actual proportion of marginally significant scientific findings that are false.

Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.

=================================================================

The full paper is here: http://www.pnas.org/content/early/2013/10/28/1313476110.full.pdf

The SI is here: Download Supporting Information (PDF)

For our layman readers who might be a bit behind on statistics, here is a primer on statistical significance and P-values as it relates to weight loss/nutrition, which is something that you can easily get your mind around.

Gross failure of scientifical nutritional studies is another topic McIntyre recently discussed: A Scathing Indictment of Federally-Funded Nutrition Research

So, while some dicey science findings might simply be low threshold problems, there are real human conduct problems in science too.

The climate data they don't want you to find — free, to your inbox.
Join readers who get 5–8 new articles daily — no algorithms, no shadow bans.
0 0 votes
Article Rating
219 Comments
Chris4692
November 12, 2013 10:00 am

polistra says:
November 12, 2013 at 4:28 am

The only proper standard for science is NO STATS. If a result has to be reached by statistics, it’s not a scientific result. Only well-calibrated measurements plus a well-balanced experiment can give scientific results.

Measurements, no matter how well calibrated and careful, are subject to observational errors and accuracy inherent in the equipment used for measurement. The mathematical method used to evaluate those errors is statistics. Until the results of the experiment are evaluated in light of the potential and real errors of measurement, the results are not scientific.

pouncer
November 12, 2013 10:01 am

rgbatduke:
Are any of the courses you formally teach available for my kids online?

Chris4692
November 12, 2013 10:05 am

A problem inherent with increasing the level of confidence required to accept the hypothesis, is that you increase the possibility of rejecting an hypothesis that is actually true. This has not been considered here, but I’ve not yet read the original paper, to if or how that is approached.

Gail Combs
November 12, 2013 10:25 am

Chris says: November 12, 2013 at 8:25 am
Please check in on the Warsaw meeting.
http://www.worldcoal.org/
>>>>>>>>>>>>>>>>>>>>>
GACK, Even coal is jumping on the CAGW bandwagon.

November 12, 2013 10:26 am

ferd berple asked, “10 thousand Olympic athletes are tested for drug use with a test that is 99.99% accurate. what are the odds that an athlete identified as a drug user is a drug user?
In the first place, there are normally two accuracy measures for a test: the percentage of true positives which are correctly reported as positives (i.e., not false negatives), and the percentage of true negatives which are correctly reported as negatives (i.e., not false positives). The two accuracy measures are not often the same, though in some cases it is possible to adjust cut-off thresholds to make them the same.
For real-world tests, there may also be some results reported as “inconclusive.”
But, for the sake of this conversation & simplicity, let’s assume that for this hypothetical drug test both accuracy measures are 99.99% (i.e., 1 in 10,000 erroneous results), and test results are never reported as inconclusive.
Then the answer to your question depends on how many actual drug users there are in the population of 10,000.
Example 1: 10,000 “clean” athletes, zero drug users.
The most likely outcome is that one athlete will be identified as a drug user.
The odds that an athlete identified as a drug user is a drug user are zero.
Example 2: zero “clean” athletes, 10,000 drug users.
The most likely outcome is that 9999 athletes will be identified as drug users.
The odds that an athlete identified as a drug user is a drug user are 100%.
Example 3: 5000 “clean” athletes, 5000 drug users.
Approximately 5000 athletes will be identified as drug users.
The odds that an athlete identified as a drug user is a drug user are 99.99%.
Example 4: 9999 “clean” athletes, one drug user.
The most likely outcome is that two athletes will be identified as drug users.
The odds that an athlete identified as a drug user is a drug user are 50%.
Example 5: 9995 “clean” athletes, five drug users.
The most likely outcome is that six athletes will be identified as drug users.
The odds that an athlete identified as a drug user is a drug user are about 83%.
Note, though, that if you know the accuracy of the test with high confidence, you can infer the approximate percentage of the population which are drug users, even if your confidence in any particular result is inadequate. For instance, for your hypothetical test case, if your test identified six athletes as drug users, you could say with reasonable confidence that some of the 10,000 athletes (best est. 0.2%) are drug users, even though you would not have sufficient evidence to convict any of the six suspected drug users from competition.

Chris R.
November 12, 2013 10:28 am

As pointed out by others, John Brignell’s posts at
http://www.numberwatch.co.uk also argue this point.
He argues persuasively that in the
old days, statistical giants such as R.A. Fisher would
regard the p=0.05 level as merely the first gate
that some hypothesis would have to get through,
not as some kind of “final proof”.

November 12, 2013 10:36 am

Tighter stats requirements do not address the very real problem that correlation does not equal causality.
This problem is especially important whenever system-response metrics are chosen to be mega-scopic functionals; functions of functions that map important details to a single number. By its very name, the global average surface temperature is not a valid system-response metic.
Weather is local, climate is local, valid system response metrics should focus at this scale.
Equally, so long as the energy content for the various sub-systems remains a focus, metics directly related to the physical scale of the data for each sub-system should be developed. The production, transport, and storage of the phases ( solid,liquid,vapor ) of water, a critically important sub-system, as calculated by GCMs should be compared with data for that sub-system. Conservation of mass is a good response metric.
Internally to the GCMs, verification of the numerical solution methods is a mathematical problem that does not need stats. Verification, meaning that the coded equations are correctly solved, is a critically important metric that is completely ignored in Climate Science. While ignoring verification Climate Science insists that evaluation of the models is a valid processes. It is not. Verification must always, without exceptions, precede validation.
For me, looking at tighter stats for mega-scopic metrics has got the problems upside down.

November 12, 2013 10:45 am

Using recent developments in Bayesian hypothesis testing, a root cause of nonreproducibility is traced to the conduct of significance tests at inappropriately high levels of significance.
Huh? “Inappropriately High?” as in Too High? or not high enought?
Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs;
Well, not necessarily the result. These do happen. There just may be other reasons.
it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.
Ah…. this would classify as flawed statistical design and a bias toward publication of insignificant results. I’m not willing to rule out scientific misconduct at this point. Insufficient skepticism of one’s own work is probably in the mix, too.

Chris
November 12, 2013 10:55 am

John Brignell. A tribute to John Daily (still waiting for greenhouse)
http://www.john-daly.com/
“Most of his admirers around the world never met him, but nevertheless held him in great esteem, simply on the basis of his writings. Forget all the pornographers, mass mailers and virus producers; one Daly is sufficient justification for the existence of the World Wide Web.”
Daly,McKitrick, McIntyre, Pielke, Watts, ……..

William Abbott
November 12, 2013 10:59 am

bwanajohn, I also thank you for posting the economist article. It takes a while but if any of you have time you ought to read it. And they call economics the dismal science.
Here is the excellent article on non self-correcting science and the misuse of statistics.
http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble

TheLastDemocrat
November 12, 2013 11:13 am

M Simon says: “This is the reason so many nutritional findings are reversed. You know. Trans fats are good. Trans fats are bad. Soon to be followed by meat eating is banned. Too many natural trans fats. ”
Often true. Another problem is the problem of the “healthy user” benefiting from the “healthy user effect.” People who TEND to do things that are believed to be healthy also TEND to do other supposedly healthy habits. They will, overall, reap benefits of what really contributes to a long, healthy life – probably regular exercise, a decent diet, and good stress management in the face of life’s slings and arrows via economic stability and or social relationships.
So, a behavior can have a somewhat BAD influence, but in some longitudinal study, be mathematically seen as healthy. The healthy habits carry the day, and the bad habit is a fellow traveler.
The extreme is the now-well-recognized reversal of hormone replacement therapy from being preventive of heart disease to being a predictor of heart disease: health-conscious women accepted and sustained their hormone prescriptions, until a randomized controlled trial tested this with much less ‘healthy user’ bias.

November 12, 2013 11:21 am

DirkH November 12, 2013 at 6:06 am
I have read anecdotes that go both ways. The metabolism of the body can change over time. Do post menopausal women metabolize differently? We do know that the metabolism changes with age.
I have no dog in the fight. I personally prefer a meat diet.
And what are the odds that the body is “trained” by early consumption? How big is that effect? Is it an effect?
Are cyclists representative of the general population?
More study is required.

TheLastDemocrat
November 12, 2013 11:23 am

Regarding Mann: assuming he did everything properly, his model in the end is not quite scientific. And no forecast or estimate of the future ever can be. Likewise, it will be quite a day when “evolution” is scientifically confirmed.
In the long run, science depends upon making an a priori prediction, specifying a disprovable test of that prediction, then gathering that actual evidence to observe whether the prediction is accurate or not. We do not know what will happen in the future. A predciton can be based on some good science, but a prediction of what the future will be like can never be observable – until it happens, at which time it is no longer the future. So, patently, a prediction cannot quite ever be “confirmed,” or “scientiftic” in the way other things can be scientific facts.
Beyond micro-evolution, evolutionalry theory suffers the same weakness. We will never see the cow-like animal again adapt itself back to underwater life as the whale-like animal. Evolution of the various species makes sense, and has a lot of evidence to support it, but it is not observed. If our species pays attention and keeps record long enough, say, a million years, sure, we might observe a new species emerge from recognized species. But we have not yet.

JEM
November 12, 2013 11:25 am

John Whitman – the paragraph about ‘this isn’t scientific misconduct’ really sounds to me like it was helicoptered in to mollify some editor or reviewer.
Clearly, there’s several potential reasons for choosing a ‘relaxed’ confidence interval. Maybe it’s appropriate to the situation. Maybe it’s just “the way it’s always been done” in that field. Maybe it’s what a grad student found in some statistical cookbook. Or someone fiddled up a spreadsheet and plugged in numbers until he got the results they “needed”. At some point it DOES slide far enough down the continuum to move past laziness and ignorance into misconduct.

November 12, 2013 11:31 am

Bill Marsh at 4:42 am: +5 “Numbers are like people…” Made my day!
chris y 5:05 am, +1 briggs quote.
Bill Illis 5:17 am, +2 The only statistics that count: $$$$. +2
rgbatduke 6:45 am +1 Always worth a read (and reread).
MCT 6:53 am +1 statistics (physics, causal) vs. stocastics (climate, corelatvie). I like the concept.
Jquip 9:31 am, +1 But this paper is precisely about non-replication.
Brad Tittle 9:16 am “Where is the damn Like button, + ?” Do it yourself. 😉
Gail Combs 9:53 am +1 Independent confirmation testing/splitting.

Chris4692
November 12, 2013 11:33 am

Dan Hughes says:
November 12, 2013 at 10:36 am

Tighter stats requirements do not address the very real problem that correlation does not equal causality.

Correlation is the first step in a sorting process. If there is causation there will be correlation. If there is no correlation look elsewhere for possible mechanisms of cause between variables. It’s more productive to look for causation first among the variables that are most highly correlated.
Correlation is just a clue. The problem is when it is taken as proof, which is a problem with the scientist not the procedure.

Janice Moore
November 12, 2013 11:33 am

JEM — in case you didn’t see my post soon after yours on the heart attack thread on 11/10 (perhaps you did…) — HOW ARE YOU DOING? I hope all is going very well. I’ll keep praying, but, IT WOULD BE NICE TO KNOW!
#(:))
JM

rogerknights
November 12, 2013 11:35 am

ferd berple says:
November 12, 2013 at 7:25 am
Perhaps the biggest assumption in modern science is the assumption that the purpose of science is to conduct research, to discover “The Truth”.
What if the true purpose of science is to attract funding? What scientific body would welcome a study that said “nothing to worry about, you can cut our funding”?

I posted the following on CA yesterday:

Daniel S. Greenberg is a Washington-based journalist who has recently turned to fiction after a long career writing about science policy and politics. He is the author of three non-fiction books, “The Politics of Pure Science,” “Science, Money, and Politics,” and “Science for Sale,” all published by the University of Chicago Press. His novel, “Tech Transfer: Science, Money, Love, and the Ivory Tower,” published in 2010, was described by the New York Times as “a hilarious” and “mordant satire about scientists and universities and how they do business.” (NY Times, Science section, book review, May 25, 2010).
Greenberg has served as a reporter for the Washington Post, as news editor of Science magazine, and as a columnist for the New England Journal of Medicine and The Lancet. For many years, he wrote an op-ed column that appeared in the Washington Post and other newspapers, and contributed to many publications, including the New York Times, the Economist, Harper’s, Smithsonian, Nature, and The Chronicle of Higher Education. He founded and for 25 years edited Science & Government Report, an international newsletter which was acquired by John Wiley & Sons in 1997.

Some of his books are cheap in used versions, available here:
http://www.amazon.com/Daniel-S.-Greenberg/e/B001HD15GW/ref=sr_ntt_srch_lnk_1?qid=1384202495&sr=1-1

November 12, 2013 11:35 am

TheLastDemocrat November 12, 2013 at 9:39 am
The Environment and Disease: Association or Causation?
Austin Bradford Hill
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1898525/

November 12, 2013 11:39 am

Evolution of the various species makes sense, and has a lot of evidence to support it, but it is not observed.
It is observed for fast reproducing species. Bacteria come to mind. From this the rest is inferred. Some do not care for the inference. They may (or may not) have a point.

November 12, 2013 11:40 am

Gail Combs says:
November 12, 2013 at 9:53 am
“Taking ferd berple’s example of drug testing.”
Drug testing of employees is not science and anyway a drug test which has 95% confidence is a screening cheap crap which anyway needs retest in case of positive in any case.
“In science it is independent replication usually by at least two independent labs. If it can not be replicated it gets tossed into the dustbin of history.
I see no reason to change this but the confirmation testing by an independent lab or two or three is an absolute must and this is what is missing in Climate Science.”
Confirmation is matter of church, not science.
The point in science is not confirmation, replication, the point is falsification. If something is confirmed than good, but if not, then it actually belongs to the dustbin of history. The point of scientific method is falsifiability (a hypothesis which is not falsifiable, is not scientific hypothesis by definition). If a claim, for example “man is responsible for global warming, because we made model XYZ123, tuned it and it shows he is and we predict temperature rise with it this and this” is a dangerous claim in science – because immediately the reality shows that the temperature is significantly not following the model, the model is falsified and belongs to dustbin of history. You actually don’t need anything like “confirmation” for it.

November 12, 2013 11:48 am

DirkH November 12, 2013 at 9:53 am
Harald Eia and his series Hjernevask or Brainwash

KNR
November 12, 2013 12:28 pm

In climate ‘science’ the only effect that rising the statistical bar will have is for more run time on the computers to ‘ensure ‘ they get the results ‘required ‘
Its simple really , start with what your ‘results ‘ then produce the ‘data’ that supports them.

Jquip
November 12, 2013 12:51 pm

M Simon: ” Bacteria come to mind. From this the rest is inferred. Some do not care for the inference.”
You missed the point in that we haven’t observed bacteria turn into bivalves. And we will not reunobserve bivalves going backwards through time into bacteria. As we go through time in the other direction. It is not testable, and in a condition of passive observation that requires time machines, impossible. eg. When it you can’t slap it up on a lab table, on demand, the best you get is passive observation. And in this, unless you have time machines, time goes one way and its pace; not ours.
People get a rather religious burr about it when you mention the ‘E’ word from the great Chuck D. But it remains that anything that is inherently stateful and chaotic, and that cannot be built on demand in a lab, has no valid inferences but an absolute crapload of observations. Simply look at the condition of astronomy and the ages of time it took to get from rather impressive Neolithic devices for measuring the heavens on to about Galileo. And that’s for something as simple as a first-order approximation of an ellipse. This remains true in every similar case, even for treemometers and IPCC models.

Adam
November 12, 2013 1:30 pm

You mean you want us to run more simulations with the same model, or set of models, to bring those error bars in tight? I hear you man! That’s why we are writing a grant proposal for $1 bn more computing power – so that we can report the output of our models to even more decimal places. Hopefully more decimal places will make our simulations better because we do not know the difference between accuracy and precision. [/sarc]