I was searching the early edition of PNAS for the abstract of yet another sloppy “science by press release” that didn’t bother to give the the title of the paper or the DOI, and came across this paper, so it wasn’t a wasted effort.
Steve McIntyre recently mentioned:
Mann rose to prominence by supposedly being able to detect “faint” signals using “advanced” statistical methods. Lewandowsky has taken this to a new level: using lew-statistics, lew-scientists can deduce properties of population with no members.
Josh (N=0) humor aside, this new paper makes me wonder how many climate science findings would fail evidence thresholds under this new proposed standard?
Revised standards for statistical evidence
Valen E. Johnson
Significance
The lack of reproducibility of scientific research undermines public confidence in science and leads to the misuse of resources when researchers attempt to replicate and extend fallacious research findings. Using recent developments in Bayesian hypothesis testing, a root cause of nonreproducibility is traced to the conduct of significance tests at inappropriately high levels of significance. Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.
Abstract
Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.
From the discussion:
The correspondence between P values and Bayes factors based on UMPBTs suggest that commonly used thresholds for statistical significance represent only moderate evidence against null hypotheses. Although it is difficult to assess the proportion of all tested null hypotheses that are actually true, if one assumes that this proportion is approximately one-half, then these results suggest that between 17% and 25% of marginally significant scientific findings are false. This range of false positives is consistent with nonreproducibility rates reported by others (e.g., ref.5). If the proportion of true null hypotheses is greater than one-half, then the proportion of false positives reported in the scientific literature, and thus the proportion of scientific studies that would fail to replicate, is even higher.
In addition, this estimate of the nonreproducibility rate of scientific findings is based on the use of UMPBTs to establish the rejection regions of Bayesian tests. In general, the use of other default Bayesian methods to model effect sizes results in even higher assignments of posterior probability to rejected null hypotheses, and thus to even higher estimates of false-positive rates.
This phenomenon is discussed further in SI Text, where Bayes factors obtained using several other default Bayesian procedures are compared with UMPBTs (seeFig. S1). These analyses suggest that the range 17–25% underestimates the actual proportion of marginally significant scientific findings that are false.
Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.
=================================================================
The full paper is here: http://www.pnas.org/content/early/2013/10/28/1313476110.full.pdf
The SI is here: Download Supporting Information (PDF)
For our layman readers who might be a bit behind on statistics, here is a primer on statistical significance and P-values as it relates to weight loss/nutrition, which is something that you can easily get your mind around.
Gross failure of scientifical nutritional studies is another topic McIntyre recently discussed: A Scathing Indictment of Federally-Funded Nutrition Research
So, while some dicey science findings might simply be low threshold problems, there are real human conduct problems in science too.
Perhaps the biggest assumption in modern science is the assumption that the purpose of science is to conduct research, to discover “The Truth”.
What if the true purpose of science is to attract funding? What scientific body would welcome a study that said “nothing to worry about, you can cut our funding”?
Jim Rose says:
November 12, 2013 at 7:19 am
What is the basic idea that changes 5% to 20%?
=======
You have 100 people in a company, of which 5 are actually using drugs. You employ a drug test that is 95% reliable. It will deliver 5 false positives. There is a 50-50 chance that someone identified as using drugs is actually using drugs.
You believe your test will be 95% accurate, but in fact 50% of the time it incorrectly identified the drug user. Since you don’t know there are only 5 drug users, when the test tells you there are 10 drug users, you believe the test. 5 innocent employees get fired along with the 5 guilty.
100 scientists conduct studies, of which 5 produce real positives. At the 95% significance, 5 produce false positives. Only the positives get published, as negative results do not attract attention. 5 true positive along with 5 false positives. 50% of the published papers are false.
I don’t believe raising thresholds much helps and I don’t believe 0.005 or 0.001 levels are needed for avoiding bad science.
In fact such a conclusion that raising thresholds to such levels will mean better science looks to me more insane than logical. It in my opinion defies probability theory, and it would in my opinion inevitably generate more false negatives than are now the false positives – at least quite definitely for normal distribution.
Statistics should anyway be considered only supporting, indirect evidence. As people shouldn’t be sentenced on indirect evidence, also scientist should have also other, direct evidence for their hypotheses. There are also hypotheses which aren’t testable by statistics after all. And where they are, the statistics is powerful, but still not omnipotent – it can’t replace common sense. Science shouldn’t be an industry to produce knowledge, deemed true if just passing statistical tests.
It is on the other hand substantial for the scientific method that it should be based on falsification – one comes with a hypothesis, tests it, comes to conclusion publishes it and then other comes, finds the flaws and falsifies it. And the falsification should be then taken most seriously. Not like in the climate science where the GST shows no change or even cooling and the global reports without change still talk about global warming, that it is due to CO2 and demand even more money and power anyway.
Science is not here for itself, it is a social phenomena driven both by competence and competition. It should be the nature of the science, not just believe, that if one raises thresholds it will automatically mean less junk science, less nonreproducibility. It doesn’t work like that and there are no “shortcuts” from ignorance to knowledge through restriction of science, the way there goes only through honesty, dedication and freedom of conscience in my opinion.
Jim Rose: I don’t understand. P=.05 should mean that the null hypothesis is correct only 5% of the time. How does this turn into error rates of approximately 20% — supposing that biases are accounted for?
What is the basic idea that changes 5% to 20%?
The p-value is an estimate of the conditional probability of a result at least as far from the observed prediction given that the null hypothesis is true. Only rarely would that result in 5% of the true null hypotheses being reported true, or the literature only having false rejections of the null hypothesis in 5% of published papers. What changes the 5% to 20% (speaking loosely) is the fact that the null hypotheses are actually true (or at least very accurate) more often than the researchers expect them to be.
The proposal in the focus paper and the reasoning behind it are not that new. The debate about what significance level to use is probably at least a century old now. The penalty for insisting on low p-values is a higher than desirable rate of false non-rejection of false null hypotheses, aka “high rate of type ii errors”.
What has to be recognized is that single reported studies, without replication, are not as reliable as you would like them to be. That is not a problem solved by adopting smaller p-values, but by a deeper appreciation of the prevalence and size of random variation (the agglomeration of all the variation than is not directly relateable to the focus of the study and measurable covariates.)
10 thousand Olympic athletes are tested for drug use with a test that is 99.99% accurate. what are the odds that an athlete identified as a drug user is a drug user?
ferd berple says:
November 12, 2013 at 7:14 am
“…climate science rejects chance as the cause. Of this they are 97% certain.”
Well said.
Why Most Published Research Findings Are False
http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124
Nice explanation of applying Bayesian statistics to the flip of a coin. There is one huge difference between your example and the use of statistics in climate science. In your example, the statistician knows what the event space is. The event space consists of readily observable and readily definable physical phenomena, namely, flips of the coin. Climate science does not have the luxury of a readily observable or readily definable event space. Permit me to illustrate by reference to Michael Mann’s work on tree rings. What is (was) an event for Mann? It was the measurement of the width of a tree ring. For Mann, a series of events that together make up a proxy record consists of measurements of tree ring widths for the same kind of tree taken from trees scattered about the surface of the Earth. Does that count as a legitimate event space? Clearly not because tree ring growth is not determined by the tree but by the ever changing environment in which the tree grows. Did Mann investigate the several environments in order to establish readily observable and definable phenomena? Not in the least. All he did was a little hand waving to the effect that the sample trees were pretty much located at the the tree line (on mountains of course).
The usual Alarmist response to my point is that all the differences among the samples can be accommodated through powerful statistical techniques. This response does not climb above childishness.
I have “attacked” the soft underbelly of climate science by using Mann as my example. But the same is true throughout climate science. Climate scientists sometimes talk about heat content and sometimes talk about temperature measurement and they make the egregious assumption that the two are interdefinable. Climate scientists sometimes talk about temperatures taken in the atmosphere and sometimes about temperature taken on the surface or beneath the surface of the oceans. To summarize, climate scientists are the only so-called hard scientists now practicing who are quite willing to take any two series of measurements and assume that they are comparable. The worst offenders are the “paleoclimatologists” who are quite willing to take 50 existing “paleo” series and treat them as comparable while giving no thought to the actual events of measurement that occurred at some time in some readily observable and definable environment. Most climate scientists have no idea what their event space is.
This post & discussion highlights one thing for sure – climate science is inherently statistical (which we all knew) but also it makes you want to have a statistician as co-author for any publication, given the complexities of the statistical analysis needed to come up with a credible conclusion.
While Bayes’s theorem is correct, Bayesian parameter estimation generally suffers from the lack of uniqueness of the prior probability density function with consequent violation of Aristotle’s law of non-contradiction. In particular, the method by which climatologists extract numerical values for the equilibrium climate sensitivity (TECS) from global temperature data violates non-contradiction.
There is an exception to the rule that Bayesian methods violate non-contradiction. It occurs in the circumstance that a sequence of events, for example a sequence of coin flips, underlies the model. In this case, the so-called “uninformative” prior probability density function is unique and non-contradiction is not violated.
However, no events underlie the IPCC climate models. In setting up this state of affairs, climatologists have ensured that the conclusions of their arguments violate non-contradiction. Rather than being one value for TECS, for example, values are of infinite number.
@Vincent
Thanks for reminding me of Number Watch and John Brignel.
The answer is Yes!
@All of You
Please check in on the Warsaw meeting.
http://www.worldcoal.org/
rgb
Nice post
Lies, damned lies – and statistics 🙂
Mr. Berple. Are you not confusing probability with statistics?
@ur momisugly polistra says:
November 12, 2013 at 4:28 am
“Stats can’t be saved by tweaking. The only proper standard for science is NO STATS. If a result has to be reached by statistics, it’s not a scientific result. Only well-calibrated measurements plus a well-balanced experiment can give scientific results.”
In fact, a number of scientific disciplines are observational and are unable to reach direct results by experiment. Major examples: astronomy, paleontology, evolutionary biology, Some of their findings are “consistent with” some experiments done in another context, but most of their findings are not reached through experimental methods. Of course, observational sciences still require careful measurements, and thus the other condition set by Polistra (well-calibrated measurements) generally applies, although one has to be careful about the basis for calibration (cf climate models calibrated to the rapidly warming 1970-2000 period).
and what about a sentence like we have 95% confidence that a fact is true…
so if fact is not true we are not wrong but unlucky….
what about a sentence like …extreme events will be more extreme…we can say that …with low confidence….i can say with low confidence it will not rain tomorrow…
they can never be wrong…
Mike Maguire says:
November 12, 2013 at 7:55 am
“Why Most Published Research Findings Are False ”
See also Sturgeon’s law
http://en.wikipedia.org/wiki/Sturgeon%27s_law
– – – – – – –
That statement ignores the situation where the researcher(s) choose consciously and intentionally to use “evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects” and then obfuscating that they have done so.
John
Where is the damn like button, +, whatever on the comments. There is a nice list above of people who get it. I was watching a show about angels a long time ago. A person came on and said, “I fell through the ice, an angel appeared and said, ‘Go this way’, and I followed her advice and found the hole to get back out!”
My wife said … “What about all the people who fell through the ice and saw the angel and followed their directions and didn’t get out!”
This exactly corresponds to the “One tree in 1000 matches temperature better!”
This why you look at the absolute value of data and do not get too enamored of Anomaly Data. If the anomaly data lets you make accurate predictions on the unknown, awesome. Keep that absolute chart posted next to you though to remind you of the underlying magnitudes.
The inverse of Risk Ratio is Survivability Factor. Even with smoking, the survivability of smoking and lung cancer is 92% over 60 years. There is a 92% chance you won’t get lung cancer.
This is one of the reasons I abandoned a PhD in Sociology, a discipline that regularly works at the p=0.05 or even p=0.1 level. Very little was reproducible, and hardly anyone even tried.
“This approach (really, no approach) is going to be particularly reliable with only five or six flips because …” — rgbatduke @6:45
rgb wrote a great bit on this, but it is distinctly different from what the paper is proposing. The papers stated problem, and its proposed solution, is to singly publish more papers that are reproucible.
Without reproducing them.
This is absurd in the first order, for if you make one mistake, one deceit, or one oversight, one bad choice with outliers and otherwise cleaning your dataset, with one paper: You have flipped the coin exactly once. To flip the coin five or six times you need to repeat the exact experiment of the paper, another four or five times.
Bayesian notions are pretty clever and terribly interesting: And point out exactly why experiments that are not replicated are nonsense. But this paper is precisely about non-replication. Faith, Trust, whatever you like. But it’s a misplaced solution.
Thank you, bwanajohn, for posting a link to that excellent article in The Economist. I’ve added both it and the Johnson paper to my collection of such articles, on my web site, here:
http://www.sealevel.info/papers.html#whitherscience
Great issue. A probability estimate, a “p” level, is derived internally – based on the data of the study, only. It therefore cannot be any sort of statement about the world outside of that data set.
A p value only helps contribute to the evaluation of how convincing the study outcome is.
The problem is that some people value it for more than that.
Bradford Hill, instrumental in fingering tobacco smoking as a cause of lung cancer, has his well-recognized 1965 paper on “Association or Causation?” He gives the well-recognized criteria for helping assess observational data. “Statistical likelihood” is not in that list of nine aspects, and is briefly mentioned and put well in its place toward the end of the article.
Robert says:
November 12, 2013 at 9:17 am
“This is one of the reasons I abandoned a PhD in Sociology, a discipline that regularly works at the p=0.05 or even p=0.1 level. Very little was reproducible, and hardly anyone even tried.”
Find on youtube Harald Eia and his series Hjernevask or Brainwash. He’s a Norwegian sociologist and comedian and has fun confronting his Norwegian sociologist colleagues with inconvenient facts.
tumetuestumefaisdubien1 says: @ur momisugly November 12, 2013 at 7:36 am
I don’t believe raising thresholds much helps and I don’t believe 0.005 or 0.001 levels are needed for avoiding bad science.
In fact such a conclusion that raising thresholds to such levels will mean better science looks to me more insane than logical….
>>>>>>>>>>>>>>>>>>>>>
You should be doing what we did in QC. (This depends of course on the cost of the item and the cost of the tests.)
Taking ferd berple’s example of drug testing.
The sample from each person gets split before testing. Part of the sample is used for testing the rest is retained. You know that out of the ten who tested positive approx. five maybe false positives. You then use the more expensive but more accurate test to retest the retains for all ten to confirm the positives. (Splitting the sample is what is done for drug testing truck drivers.)
In science it is independent replication usually by at least two independent labs. If it can not be replicated it gets tossed into the dustbin of history.
I see no reason to change this but the confirmation testing by an independent lab or two or three is an absolute must and this is what is missing in Climate Science.
I think the climate science (and to be fair, quite often in other science areas!) – it would be fair to say that many researchers use the ‘P’ value simply because without it, the casual reader/interpreter would realise that the research doesn’t pass the ‘sniff’ test? In other words, it’s like an imaginary forcefield, protecting/deflecting the actual research from external attack..(hmm, is there a cartoon in there somewhere?)
Anyway, for some as yet unknown reason, many folk are incapable of ‘reverse analysis’ from statistical ‘results’, especially when reading data from ‘surveys’.
As for P values in climate science (which, in such a large, complex and chaotic system, cannot actually be adequately defined) – I think it can be adequately summed up in the simple phrase:
‘They are taking the P…’