Raising the bar on statistical significance

I was searching the early edition of PNAS for the abstract of yet another sloppy “science by press release” that didn’t bother to give the the title of the paper or the DOI, and came across this paper, so it wasn’t a wasted effort.

Steve McIntyre recently mentioned:

Mann rose to prominence by supposedly being able to detect “faint” signals using “advanced” statistical methods. Lewandowsky has taken this to a new level: using lew-statistics, lew-scientists can deduce properties of population with no members.

Josh (N=0) humor aside, this new paper makes me wonder how many climate science findings would fail evidence thresholds under this new proposed standard?

Revised standards for statistical evidence

Valen E. Johnson

Significance

The lack of reproducibility of scientific research undermines public confidence in science and leads to the misuse of resources when researchers attempt to replicate and extend fallacious research findings. Using recent developments in Bayesian hypothesis testing, a root cause of nonreproducibility is traced to the conduct of significance tests at inappropriately high levels of significance. Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.

Abstract

Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.

From the discussion:

The correspondence between P values and Bayes factors based on UMPBTs suggest that commonly used thresholds for statistical significance represent only moderate evidence against null hypotheses. Although it is difficult to assess the proportion of all tested null hypotheses that are actually true, if one assumes that this proportion is approximately one-half, then these results suggest that between 17% and 25% of marginally significant scientific findings are false. This range of false positives is consistent with nonreproducibility rates reported by others (e.g., ref.5). If the proportion of true null hypotheses is greater than one-half, then the proportion of false positives reported in the scientific literature, and thus the proportion of scientific studies that would fail to replicate, is even higher.

In addition, this estimate of the nonreproducibility rate of scientific findings is based on the use of UMPBTs to establish the rejection regions of Bayesian tests. In general, the use of other default Bayesian methods to model effect sizes results in even higher assignments of posterior probability to rejected null hypotheses, and thus to even higher estimates of false-positive rates.

This phenomenon is discussed further in SI Text, where Bayes factors obtained using several other default Bayesian procedures are compared with UMPBTs (seeFig. S1). These analyses suggest that the range 17–25% underestimates the actual proportion of marginally significant scientific findings that are false.

Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.

=================================================================

The full paper is here: http://www.pnas.org/content/early/2013/10/28/1313476110.full.pdf

The SI is here: Download Supporting Information (PDF)

For our layman readers who might be a bit behind on statistics, here is a primer on statistical significance and P-values as it relates to weight loss/nutrition, which is something that you can easily get your mind around.

Gross failure of scientifical nutritional studies is another topic McIntyre recently discussed: A Scathing Indictment of Federally-Funded Nutrition Research

So, while some dicey science findings might simply be low threshold problems, there are real human conduct problems in science too.

0 0 votes

Article Rating

219 Comments

M Simon

November 12, 2013 4:23 am

This is the reason so many nutritional findings are reversed. You know. Trans fats are good. Trans fats are bad. Soon to be followed by meat eating is banned. Too many natural trans fats.
The Atkins diet? There are problems.
A few small, short studies suggest Atkins raises HDL cholesterol and lowers blood pressure, reducing the risk of heart attack and stroke. But many of the studies were small and short, and some of the positive findings did not carry enough statistical weight to be trustworthy. And all that fat worries most experts.
http://health.usnews.com/best-diet/atkins-diet

polistra

November 12, 2013 4:28 am

Stats can’t be saved by tweaking.
The only proper standard for science is NO STATS. If a result has to be reached by statistics, it’s not a scientific result. Only well-calibrated measurements plus a well-balanced experiment can give scientific results.

Vincent

November 12, 2013 4:38 am

Hasn’t John Brignel over at Number Watch been banging on about exactly this for years?

Bill Marsh

Editor

November 12, 2013 4:42 am

Still my favorite Stats prof saying, “Numbers are like people, torture them enough and they’ll tell you whatever you want to hear.”

jimmi_the_dalek

November 12, 2013 4:48 am

“The only proper standard for science is NO STATS. If a result has to be reached by statistics, it’s not a scientific result. Only well-calibrated measurements plus a well-balanced experiment can give scientific results.”
Nonsense. Many areas of science are inherently statistical. Chemical reactions for example, or quantum effects, or thermodynamics. The Second Law of Thermodynamics is entirely statistical in origin.

steveta_uk

November 12, 2013 4:54 am

Interesting that in climate scientology, you can be 95% sure on little evidence, where as in particlar physics, 6-sigma results are required to come close to verifying something.

chris y

November 12, 2013 5:05 am

polistra says-
“If a result has to be reached by statistics, it’s not a scientific result.”
William Briggs, statistician to the stars, doesn’t go that far, but what he writes is still pretty damning-
“If you need statistics to prove something, and you have no proof except statistical proof, then what you have proved probably isn’t true. Statistical evidence is the lowest form of evidence there is. What a depressing conclusion.”
William Briggs, statistician, October 5, 2011

Bill Illis

November 12, 2013 5:17 am

In fiscal year 2012-13 ended September, the US Government spent $2.5 billion on climate research (among $20 billion more spent on climate change for clean energy, international assistance, tax credits etc.)
http://www.powerlineblog.com/archives/2013/11/why-does-the-global-warming-hoax-persist.php
If you assume this $2.5 billion supports X number of researchers at $125,000 each, there would be 20,000 climate researchers. It is just an industry which has no incentive to say “oops, we got it wrong.” If a single scientist said so, he would be side-lined in short order because there are 20,000 other researchers in the US that depend on the $2.5 billion of income to keep coming in.
So, that is just the US. Globally, $359 billion was spent on climate change last year. If the same ratios applied to this number, there would be $40.4 billion spent on climate research supporting 320,000 researchers.
http://climatepolicyinitiative.org/publication/global-landscape-of-climate-finance-2013/
Those are the only statistics that count in this field.

Doug Huffman

November 12, 2013 5:17 am

Thanks for the citations that I will read in light of my slim understanding of E. T. Jaynes that did science – physics – with Bayesian statistics.
I appreciated the “torture” comment above; statistics, people or the internet can be tortured into providing any desired statement.

Carlo Napolitano

November 12, 2013 5:22 am

Well, this is not new in biology and medical research (the one I do). Statistical significance in clinical trials sometimes leads to wired conclusions but only those that are based on biologically relevant responses are eventually brought to the bedside. The majority of clinical trial (those based on “unexpected” findings are simply forgotten over the years.
Thus, going beyond the statistical significance and making clinical studies that are intrinsically mechanistic is the only way to produce good medical science.
In a nutshell….. it is better to understand what you are talking about

Mickey Reno

November 12, 2013 5:23 am

Well, it’s a good start….

Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.

This is a false statement, IF scientists intentionally refuse to archive of share their data for use by would be replicators, the motives of whom the original scientist seems to think he has a right to impugn. It’s also false IF the statistical result coincidentally match reality. A high degree of correlation STILL won’t prove causation. With smoothing of temperature graphs using multi-year averages, CO2 being released by the oceans during warmer periods would match a hypothesis that human emissions caused the higher CO2 levels. That’s the difficulty with correlations, you need to test EVERYTHING, EVEN THE THINGS YOU DON’T KNOW, to see whether or not other things might also be correlated. This is, for the most part, impossible when testing a system as complex as the Earth’s climate. Ergo, statistics will always leave us needing more. Conclusions based on statistics will always be suspect if they are used to deduce causality.

bwanajohn

November 12, 2013 5:25 am

Here is another excellent article on non self-correcting science and the misuse of statistics.
http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble

Espen

November 12, 2013 5:26 am

This seems like real non-news to anyone with a little background in statistics. It has always puzzled me that some branches of science seem content with p-values as high as 0.05!

Peter Ward

November 12, 2013 5:28 am

John Brignell has indeed been talking about this sort of thing for years. His critique of the “evidence” against secondary smoking, for example, is particularly withering.

vukcevic

November 12, 2013 5:31 am

[snip – stop trying to push your solar stuff on unrelated threads – Anthony]

James Strom

November 12, 2013 5:41 am

I’m not sure I understand this proposal. The idea is to reduce occurrences of nonconfirmation by requiring a significance level of, say, 0.005 instead of 0.05. But under these new standards there will be results which just barely pass the test. When those results are subjected to a repeat study for reconfirmation wouldn’t they be equally likely to miss significance as marginal results under the older regime?
Of course, adopting this stricter standard would result in far fewer studies achieving significance in the first place, and that would of itself reduce non-confirmation.
I’m not a statistician, so I’m happy to admit that I may have missed something.

bob sykes

November 12, 2013 5:46 am

Particle physicists don’t make claims unless the exceedance probability is over 6 sigma.

Col Mosby

November 12, 2013 6:06 am

Actually, whether or not a causal linkage exists between two factors is almost never of any importance, irregardless of whether your tests can detect same. What’s important is the strength of the linkage. Using the word “significant” in the term “statistical significance” has confused more folks than any other phrase in science, misleading them into believing that the detected linkage
is “significant,” i.e. a strong linkage, which the test results actually say nothing about. Simply
using a larger sample size makes for a more powerful statistical test and allows trivial linkages to be detected as “statistically significant.” Just assuring ourselves that real, actual linkages exist is rarely of any value. We need to know HOW strong the linkage is, not simply whether one exists or not (i.e. whether the null hypothesis can be rejected).

DirkH

November 12, 2013 6:06 am

M Simon says:
November 12, 2013 at 4:23 am
“This is the reason so many nutritional findings are reversed. You know. Trans fats are good. Trans fats are bad. Soon to be followed by meat eating is banned. Too many natural trans fats.
The Atkins diet? There are problems.
A few small, short studies suggest Atkins raises HDL cholesterol and lowers blood pressure, reducing the risk of heart attack and stroke. But many of the studies were small and short, and some of the positive findings did not carry enough statistical weight to be trustworthy. And all that fat worries most experts.”
Well, maybe you should first point to a study that shows a correlation between fat consumption and heart attacks, without leaving out all data points that would destroy the correlation.
And please, the site you linked to says “The theory:The body is an engine; carbs are the gas that makes it go.”
Oh please. Ask any bicyclist on what fuel he runs most of the time. That site is a bad joke and misrepresents what they attack.
Atkins BTW has not invented the low carb diet.
http://www.lowcarb.ca/corpulence/corpulence_full.html

ferd berple

November 12, 2013 6:41 am

You have 100 people in a company, of which 5 are actually using drugs. You employ a drug test that is 95% reliable. It will deliver 5 false positives. There is a 50-50 chance that someone identified as using drugs is actually using drugs.
What scientists routinely fail to account for is that statistical significance needs to be considered in the context of how “rare” the thing you are looking for is. When you are looking for hay in a haystack, 95% reliability works perfectly well. You will vary rarely find a needle instead of hay.
However, when you are looking for the needle in the haystack, then 95% reliability is worthless. Most of what you identify as “needles” will in fact be hay.
The problem is that for most of climate research, what they are looking for is the needles in the haystack. They want to, for example, identify the very small human temperature signal from the much larger daily and annual temperature signal. So when they apply the 95% test they get hockey sticks when the reality is closer to hockey pucks.

rgbatduke

November 12, 2013 6:45 am

ss Since this is something of my game, I’ll summarize the idea. In Bayesian probability analysis, one rarely just states hypotheses. One states hypotheses based on various assumptions. Sometimes these assumptions are actually stated, sometimes they are not. Sometimes they are (in some rigorously defensible sense) true and sufficiently accurate beyond reasonable doubt — the assumption of Galilean or Newtonian gravity for near-Earth-surface physics problems — and sometimes they are basically little more than educated guesses with little or no solid evidence.
The simplest learning example of the latter is a coin flip. Suppose you meet up with a total stranger, who wants to bet you $1 a flip on 100 successive flips, and you get heads. You say to yourself “Hmm, a two sided coin, zero sum game, random walk, I expect to win or lose no more than $10-15 (with equal probability) and it will help to pass the time, sure.” You have intuitively used Bayesian reasoning. You picked a prior probability for a two-sided coin of 0.5 heads or tails, from symmetry and maximum entropy principles.
You didn’t have to. You could have said to yourself “Gee, a complete stranger wants to play a game of ‘chance’ with me with his coin, where he gets to pick heads for me. I’ll bet that he’s playing with a biased coin so that if I take this sucker bet, my expectation is to lose $100 or close to it.” This too is a Bayesian prior. You go ahead and take the bet anyway because you are bored and because you can count on proving that he cheated if the winning gets too lopsided.
They are also the basis for the null hypothesis — The coin has p(heads) = 0.5 or the coin has p(heads) = 0.0 respectively.
In traditional hypothesis testing, one assumes the null hypothesis, conducts an experiment, and determines the probability of obtaining the observed sequence given the null hypothesis. If you assumed p = 0.5 and the results of the first 8 flips were all tails, the probability of this outcome is 1/256 \approx 0.004. Based on additional prior knowledge you have (basically, the reasoning in the second case that it is unwise to gamble with strangers, especially with their cards, dice, coins) you might well have rejected the null hypothesis at the seventh flip or even the sixth. Since the bet is cheap, you might even go two more flips, but at the tenth flip you’ve already reached your expected win/loss threshold based on a random sequence of flips of an unbiased coin and the p-value is down to less than 0.001. Time to offer the stranger a counter-bet — using the same coin, for the rest of the bets you get tails and he gets heads OR nobody pays off and everybody walks away.
In Bayesian probability theory, one doesn’t reason exactly this way. Instead, one can allow the results of the experiment to modify you assertion of prior probabilities so that they basically asymptotically agree with the data however you start them out. Before the first flip your prior estimate of p(heads) is 0.5, but after a single flip — whether or not it comes out heads or tails — this is not true. If you get tails at the end of one flip, the probability p(heads) is strictly less than 0.5, and it montonically descends with every sequential flip of tails.
This approach (really, no approach) is going to be particularly reliable with only five or six flips because (in the words of George Marsaglia, a giant of hypothesis testing and random numbers) “p happens”. If you conduct many such experiments, not only will sequences of six tails in a row occur, but they’ll occur one in sixty-four randomly selected sequences of six coin flips, even if the coin is unbiased!
This is precisely the kind of reasoning that is not being conducted in climate science. In part this is because it is a house of cards — if you knock a single card loose on the ground floor, you risk the whole thing tumbling down. The list of (Bayesian prior) assumptions is staggering, and in the case of many of them there is simply no possibility of directly measuring them and setting them on the basis of direct empirical evidence, or there is some evidentiary basis for a number, but there is a rather wide range of possible/probable error in that number.
This multiplicity of Bayesian priors has the following effect. It becomes quite possible to obtain good agreement with a limited sequence of data with incorrect prior assumptions. This, in turn, means that limits on p have to be made more stringent. If one uses the coin metaphor, an additional (unstated) Bayesian prior assumption is that it is possible for you to detect it if your opponent switches coins! Of course, your opponent in addition to being a scoundrel could be an unemployed magician, and swapping coins “invisibly” could be child’s play to him. In that event, he might play you with a fair coin for the first ten or twenty throws, allowing you to conclude that the coin is a fair coin, and then swap in the unfair coin for a few throws out of every ten for the rest of the way. You end up losing with complete certainty, but at a rate just over the expected win/loss threshold. He makes (say) $20 or $25 instead of $100, but he kept you in the game until the end because you could have just been unlucky with a fair coin and you had direct evidence of apparently fair sequences.
In more mathematical theories the same thing often happens if you have nonlinear functions with parametric partially cancelling parts contributing to some result, function forms with a lot of covariance. One can often get good agreement with a segment of data with a mutually varying range of certain parameters, so that any prior assumption in this range will apparently work. It’s only when one applies the theory outside of the range where the accidental cancellation “works” that the problem emerges. Once again naively applying a hypothesis test to the range of data used to justify a Bayesian assumption of parameters is completely incorrect, as is applying the assumption to a naively selected trial set, and the requirements for passing a hypothesis test at all have to be made much more stringent in the meantime to account for the lack of evidence for the Bayesian priors themselves. The evidence has to both agree with the null hypothesis and affirm the prior assumptions, and to the extent that it has to do the latter without outside support it takes a lot more evidence to affirm the priors in the sense of not rejecting the null hypothesis. You have many “and” operators in the Bayesian logic — if the primary theory is correct AND this parameter is correct AND this second parameter is correct AND this third parameter is correct, what are the chances of observing the data — and the computation of that chance has to be diluted by accounting for all of the OTHER values of the prior assumptions that might lead to the same agreement with the data but effectively constitute a different theory.
And then there is data dredging, which is very nearly another version of the same thing.
So when there are 30+ GCMs, each with dozens of parameters in a complex theory with many countervarying nonlinear contributions, that are in good agreement with an indifferently selected monotonic training set of data, that is not yet particularly good evidence that the theory is correct. The real test comes when one applies it to new data outside of the training set, ideally data that does not conform to the monotonic behavior of the training set. If the theory (EACH GCM, one at a time) can correctly predict the alteration of the monotonic behavior, then it becomes more plausible. To the extent that it fails to do so, it becomes less plausible. Finally, as theories become less plausible, a Bayesian would go in and modify the prior assumptions to try to reconstruct good agreement between the theory and observation, just as you might change your prior beliefs about the fairness of the coin above, if you start seeing highly improbable outcomes (given the assumptions).
It is this latter step that is proceeding far too slowly in climate science.
rgb

MCT

November 12, 2013 6:53 am

In defence of the claim that statistics has been used successfully in science, jimmi-the-dalek says above that:
“Nonsense. Many areas of science are inherently statistical. Chemical reactions for example, or quantum effects, or thermodynamics. The Second Law of Thermodynamics is entirely statistical in origin.”
This is true in one sense but rather misleading. The sort of ‘statistics’ used in quantum physics and classical thermodynamics is radically different from that used in climate science. The first is deductive, the second inductive. In thermodynamics we make assumptions about the probabilities of elementary events and then use the calculus of probabilities to derive probabilities for complex events. This is a process as rigorous as anything in mathematics, and the results are compared with experiment – eg.time of flight observations for the Maxwell-Boltzmann distribution. Compared with what happens in climate science this can hardly be called ‘statistics’ at all – ‘stochastic’ might be a better word for it. In climate science they start with the experimental results – time-series and so on – and use genuinely statistical methods to try to find trends, correlations and causality.
Of course, in the experimental testing of quantum results, say, statistics in this second sense is used – as we saw with the detection of the Higgs particle – but that is another matter. I think the reliance of any science on statistical analyses, to the extent we see in climate science, is a real cause for concern – especially when the level of competence shown has been criticised so heavily by those whose competence non-experts such as myself have no reason to doubt.

mkelly

November 12, 2013 7:07 am

published in German in 1854, is known as the Clausius statement:
Heat can never pass from a colder to a warmer body without some other change, connected therewith, occurring at the same time.
Statistical mechanics was initiated in 1870 with the work of Austrian physicist Ludwig Boltzmann…
Both the above from Wiki.
jimmi_the_dalek says:
November 12, 2013 at 4:48 am
The Second Law of Thermodynamics is entirely statistical in origin.
Since, the Clausius statement preceded the initiation of statistical mechanics by some 16 years I find it hard to understand how you can say that the second law is entirely statistical in origin.

ferd berple

November 12, 2013 7:14 am

rgbatduke says:
November 12, 2013 at 6:45 am
and sometimes they are basically little more than educated guesses with little or no solid evidence.
=============
the assumption that some trees make better thermometers than other trees, and that one tree in particular is a better thermometer than all other trees.
if you look at 1000 trees, you will find that one tree better follows the thermometer records than all the others. and if you repeat this study, you will find than in another set of 1000 trees, there is one tree that better matches the thermometer records.
climate science teaches the reason that one tree in 1000 better matches the thermometer records is that this tree is a better thermometer. thus this tree can be used reliably as a thermometer proxy outside the time in which we have thermometers.
chance tells us that if you look at 1000 trees there will always be 1 tree that better matches the thermometer records, and the reason has nothing to do with being a better thermometer. climate science rejects chance as the cause. Of this they are 97% certain.

Jim Rose

November 12, 2013 7:19 am

I don’t understand. P=.05 should mean that the null hypothesis is correct only 5% of the time. How does this turn into error rates of approximately 20% — supposing that biases are accounted for?
What is the basic idea that changes 5% to 20%?

1 2 3 … 9 Next »

wpDiscuz

Share this:

Related Posts

Examining the Global Carbon Project’s Estimates of CO2 Sources and Sinks, 1959-2023

Worrywarts and Their Grand Statistical Masquerade

Margins of Error

About Those Plummeting Fertility Rates