Raising the bar on statistical significance

I was searching the early edition of PNAS for the abstract of yet another sloppy “science by press release” that didn’t bother to give the the title of the paper or the DOI, and came across this paper, so it wasn’t a wasted effort.

Steve McIntyre recently mentioned:

Mann rose to prominence by supposedly being able to detect “faint” signals using “advanced” statistical methods. Lewandowsky has taken this to a new level: using lew-statistics, lew-scientists can deduce properties of population with no members.

Josh (N=0) humor aside, this new paper makes me wonder how many climate science findings would fail evidence thresholds under this new proposed standard?pvalue_curve

Revised standards for statistical evidence

Valen E. Johnson

Significance

The lack of reproducibility of scientific research undermines public confidence in science and leads to the misuse of resources when researchers attempt to replicate and extend fallacious research findings. Using recent developments in Bayesian hypothesis testing, a root cause of nonreproducibility is traced to the conduct of significance tests at inappropriately high levels of significance. Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.

Abstract

Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.

From the discussion:

The correspondence between P values and Bayes factors based on UMPBTs suggest that commonly used thresholds for statistical significance represent only moderate evidence against null hypotheses. Although it is difficult to assess the proportion of all tested null hypotheses that are actually true, if one assumes that this proportion is approximately one-half, then these results suggest that between 17% and 25% of marginally significant scientific findings are false. This range of false positives is consistent with nonreproducibility rates reported by others (e.g., ref.5). If the proportion of true null hypotheses is greater than one-half, then the proportion of false positives reported in the scientific literature, and thus the proportion of scientific studies that would fail to replicate, is even higher.

In addition, this estimate of the nonreproducibility rate of scientific findings is based on the use of UMPBTs to establish the rejection regions of Bayesian tests. In general, the use of other default Bayesian methods to model effect sizes results in even higher assignments of posterior probability to rejected null hypotheses, and thus to even higher estimates of false-positive rates.

This phenomenon is discussed further in SI Text, where Bayes factors obtained using several other default Bayesian procedures are compared with UMPBTs (seeFig. S1). These analyses suggest that the range 17–25% underestimates the actual proportion of marginally significant scientific findings that are false.

Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.

=================================================================

The full paper is here: http://www.pnas.org/content/early/2013/10/28/1313476110.full.pdf

The SI is here: Download Supporting Information (PDF)

For our layman readers who might be a bit behind on statistics, here is a primer on statistical significance and P-values as it relates to weight loss/nutrition, which is something that you can easily get your mind around.

Gross failure of scientifical nutritional studies is another topic McIntyre recently discussed: A Scathing Indictment of Federally-Funded Nutrition Research

So, while some dicey science findings might simply be low threshold problems, there are real human conduct problems in science too.

Advertisements

  Subscribe  
newest oldest most voted
Notify of

This is the reason so many nutritional findings are reversed. You know. Trans fats are good. Trans fats are bad. Soon to be followed by meat eating is banned. Too many natural trans fats.
The Atkins diet? There are problems.
A few small, short studies suggest Atkins raises HDL cholesterol and lowers blood pressure, reducing the risk of heart attack and stroke. But many of the studies were small and short, and some of the positive findings did not carry enough statistical weight to be trustworthy. And all that fat worries most experts.
http://health.usnews.com/best-diet/atkins-diet

polistra

Stats can’t be saved by tweaking.
The only proper standard for science is NO STATS. If a result has to be reached by statistics, it’s not a scientific result. Only well-calibrated measurements plus a well-balanced experiment can give scientific results.

Vincent

Hasn’t John Brignel over at Number Watch been banging on about exactly this for years?

DC Cowboy

Still my favorite Stats prof saying, “Numbers are like people, torture them enough and they’ll tell you whatever you want to hear.”

jimmi_the_dalek

The only proper standard for science is NO STATS. If a result has to be reached by statistics, it’s not a scientific result. Only well-calibrated measurements plus a well-balanced experiment can give scientific results.
Nonsense. Many areas of science are inherently statistical. Chemical reactions for example, or quantum effects, or thermodynamics. The Second Law of Thermodynamics is entirely statistical in origin.

steveta_uk

Interesting that in climate scientology, you can be 95% sure on little evidence, where as in particlar physics, 6-sigma results are required to come close to verifying something.

chris y

polistra says-
“If a result has to be reached by statistics, it’s not a scientific result.”
William Briggs, statistician to the stars, doesn’t go that far, but what he writes is still pretty damning-
“If you need statistics to prove something, and you have no proof except statistical proof, then what you have proved probably isn’t true. Statistical evidence is the lowest form of evidence there is. What a depressing conclusion.”
William Briggs, statistician, October 5, 2011

Bill Illis

In fiscal year 2012-13 ended September, the US Government spent $2.5 billion on climate research (among $20 billion more spent on climate change for clean energy, international assistance, tax credits etc.)
http://www.powerlineblog.com/archives/2013/11/why-does-the-global-warming-hoax-persist.php
If you assume this $2.5 billion supports X number of researchers at $125,000 each, there would be 20,000 climate researchers. It is just an industry which has no incentive to say “oops, we got it wrong.” If a single scientist said so, he would be side-lined in short order because there are 20,000 other researchers in the US that depend on the $2.5 billion of income to keep coming in.
So, that is just the US. Globally, $359 billion was spent on climate change last year. If the same ratios applied to this number, there would be $40.4 billion spent on climate research supporting 320,000 researchers.
http://climatepolicyinitiative.org/publication/global-landscape-of-climate-finance-2013/
Those are the only statistics that count in this field.

Thanks for the citations that I will read in light of my slim understanding of E. T. Jaynes that did science – physics – with Bayesian statistics.
I appreciated the “torture” comment above; statistics, people or the internet can be tortured into providing any desired statement.

Carlo Napolitano

Well, this is not new in biology and medical research (the one I do). Statistical significance in clinical trials sometimes leads to wired conclusions but only those that are based on biologically relevant responses are eventually brought to the bedside. The majority of clinical trial (those based on “unexpected” findings are simply forgotten over the years.
Thus, going beyond the statistical significance and making clinical studies that are intrinsically mechanistic is the only way to produce good medical science.
In a nutshell….. it is better to understand what you are talking about

Mickey Reno

Well, it’s a good start….

Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.

This is a false statement, IF scientists intentionally refuse to archive of share their data for use by would be replicators, the motives of whom the original scientist seems to think he has a right to impugn. It’s also false IF the statistical result coincidentally match reality. A high degree of correlation STILL won’t prove causation. With smoothing of temperature graphs using multi-year averages, CO2 being released by the oceans during warmer periods would match a hypothesis that human emissions caused the higher CO2 levels. That’s the difficulty with correlations, you need to test EVERYTHING, EVEN THE THINGS YOU DON’T KNOW, to see whether or not other things might also be correlated. This is, for the most part, impossible when testing a system as complex as the Earth’s climate. Ergo, statistics will always leave us needing more. Conclusions based on statistics will always be suspect if they are used to deduce causality.

bwanajohn

Here is another excellent article on non self-correcting science and the misuse of statistics.
http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble

Espen

This seems like real non-news to anyone with a little background in statistics. It has always puzzled me that some branches of science seem content with p-values as high as 0.05!

John Brignell has indeed been talking about this sort of thing for years. His critique of the “evidence” against secondary smoking, for example, is particularly withering.

[snip – stop trying to push your solar stuff on unrelated threads – Anthony]

James Strom

I’m not sure I understand this proposal. The idea is to reduce occurrences of nonconfirmation by requiring a significance level of, say, 0.005 instead of 0.05. But under these new standards there will be results which just barely pass the test. When those results are subjected to a repeat study for reconfirmation wouldn’t they be equally likely to miss significance as marginal results under the older regime?
Of course, adopting this stricter standard would result in far fewer studies achieving significance in the first place, and that would of itself reduce non-confirmation.
I’m not a statistician, so I’m happy to admit that I may have missed something.

bob sykes

Particle physicists don’t make claims unless the exceedance probability is over 6 sigma.

Actually, whether or not a causal linkage exists between two factors is almost never of any importance, irregardless of whether your tests can detect same. What’s important is the strength of the linkage. Using the word “significant” in the term “statistical significance” has confused more folks than any other phrase in science, misleading them into believing that the detected linkage
is “significant,” i.e. a strong linkage, which the test results actually say nothing about. Simply
using a larger sample size makes for a more powerful statistical test and allows trivial linkages to be detected as “statistically significant.” Just assuring ourselves that real, actual linkages exist is rarely of any value. We need to know HOW strong the linkage is, not simply whether one exists or not (i.e. whether the null hypothesis can be rejected).

DirkH

M Simon says:
November 12, 2013 at 4:23 am
“This is the reason so many nutritional findings are reversed. You know. Trans fats are good. Trans fats are bad. Soon to be followed by meat eating is banned. Too many natural trans fats.
The Atkins diet? There are problems.
A few small, short studies suggest Atkins raises HDL cholesterol and lowers blood pressure, reducing the risk of heart attack and stroke. But many of the studies were small and short, and some of the positive findings did not carry enough statistical weight to be trustworthy. And all that fat worries most experts.”
Well, maybe you should first point to a study that shows a correlation between fat consumption and heart attacks, without leaving out all data points that would destroy the correlation.
And please, the site you linked to says “The theory:The body is an engine; carbs are the gas that makes it go.”
Oh please. Ask any bicyclist on what fuel he runs most of the time. That site is a bad joke and misrepresents what they attack.
Atkins BTW has not invented the low carb diet.
http://www.lowcarb.ca/corpulence/corpulence_full.html

ferd berple

You have 100 people in a company, of which 5 are actually using drugs. You employ a drug test that is 95% reliable. It will deliver 5 false positives. There is a 50-50 chance that someone identified as using drugs is actually using drugs.
What scientists routinely fail to account for is that statistical significance needs to be considered in the context of how “rare” the thing you are looking for is. When you are looking for hay in a haystack, 95% reliability works perfectly well. You will vary rarely find a needle instead of hay.
However, when you are looking for the needle in the haystack, then 95% reliability is worthless. Most of what you identify as “needles” will in fact be hay.
The problem is that for most of climate research, what they are looking for is the needles in the haystack. They want to, for example, identify the very small human temperature signal from the much larger daily and annual temperature signal. So when they apply the 95% test they get hockey sticks when the reality is closer to hockey pucks.

rgbatduke

ss Since this is something of my game, I’ll summarize the idea. In Bayesian probability analysis, one rarely just states hypotheses. One states hypotheses based on various assumptions. Sometimes these assumptions are actually stated, sometimes they are not. Sometimes they are (in some rigorously defensible sense) true and sufficiently accurate beyond reasonable doubt — the assumption of Galilean or Newtonian gravity for near-Earth-surface physics problems — and sometimes they are basically little more than educated guesses with little or no solid evidence.
The simplest learning example of the latter is a coin flip. Suppose you meet up with a total stranger, who wants to bet you $1 a flip on 100 successive flips, and you get heads. You say to yourself “Hmm, a two sided coin, zero sum game, random walk, I expect to win or lose no more than $10-15 (with equal probability) and it will help to pass the time, sure.” You have intuitively used Bayesian reasoning. You picked a prior probability for a two-sided coin of 0.5 heads or tails, from symmetry and maximum entropy principles.
You didn’t have to. You could have said to yourself “Gee, a complete stranger wants to play a game of ‘chance’ with me with his coin, where he gets to pick heads for me. I’ll bet that he’s playing with a biased coin so that if I take this sucker bet, my expectation is to lose $100 or close to it.” This too is a Bayesian prior. You go ahead and take the bet anyway because you are bored and because you can count on proving that he cheated if the winning gets too lopsided.
They are also the basis for the null hypothesis — The coin has p(heads) = 0.5 or the coin has p(heads) = 0.0 respectively.
In traditional hypothesis testing, one assumes the null hypothesis, conducts an experiment, and determines the probability of obtaining the observed sequence given the null hypothesis. If you assumed p = 0.5 and the results of the first 8 flips were all tails, the probability of this outcome is 1/256 \approx 0.004. Based on additional prior knowledge you have (basically, the reasoning in the second case that it is unwise to gamble with strangers, especially with their cards, dice, coins) you might well have rejected the null hypothesis at the seventh flip or even the sixth. Since the bet is cheap, you might even go two more flips, but at the tenth flip you’ve already reached your expected win/loss threshold based on a random sequence of flips of an unbiased coin and the p-value is down to less than 0.001. Time to offer the stranger a counter-bet — using the same coin, for the rest of the bets you get tails and he gets heads OR nobody pays off and everybody walks away.
In Bayesian probability theory, one doesn’t reason exactly this way. Instead, one can allow the results of the experiment to modify you assertion of prior probabilities so that they basically asymptotically agree with the data however you start them out. Before the first flip your prior estimate of p(heads) is 0.5, but after a single flip — whether or not it comes out heads or tails — this is not true. If you get tails at the end of one flip, the probability p(heads) is strictly less than 0.5, and it montonically descends with every sequential flip of tails.
This approach (really, no approach) is going to be particularly reliable with only five or six flips because (in the words of George Marsaglia, a giant of hypothesis testing and random numbers) “p happens”. If you conduct many such experiments, not only will sequences of six tails in a row occur, but they’ll occur one in sixty-four randomly selected sequences of six coin flips, even if the coin is unbiased!
This is precisely the kind of reasoning that is not being conducted in climate science. In part this is because it is a house of cards — if you knock a single card loose on the ground floor, you risk the whole thing tumbling down. The list of (Bayesian prior) assumptions is staggering, and in the case of many of them there is simply no possibility of directly measuring them and setting them on the basis of direct empirical evidence, or there is some evidentiary basis for a number, but there is a rather wide range of possible/probable error in that number.
This multiplicity of Bayesian priors has the following effect. It becomes quite possible to obtain good agreement with a limited sequence of data with incorrect prior assumptions. This, in turn, means that limits on p have to be made more stringent. If one uses the coin metaphor, an additional (unstated) Bayesian prior assumption is that it is possible for you to detect it if your opponent switches coins! Of course, your opponent in addition to being a scoundrel could be an unemployed magician, and swapping coins “invisibly” could be child’s play to him. In that event, he might play you with a fair coin for the first ten or twenty throws, allowing you to conclude that the coin is a fair coin, and then swap in the unfair coin for a few throws out of every ten for the rest of the way. You end up losing with complete certainty, but at a rate just over the expected win/loss threshold. He makes (say) $20 or $25 instead of $100, but he kept you in the game until the end because you could have just been unlucky with a fair coin and you had direct evidence of apparently fair sequences.
In more mathematical theories the same thing often happens if you have nonlinear functions with parametric partially cancelling parts contributing to some result, function forms with a lot of covariance. One can often get good agreement with a segment of data with a mutually varying range of certain parameters, so that any prior assumption in this range will apparently work. It’s only when one applies the theory outside of the range where the accidental cancellation “works” that the problem emerges. Once again naively applying a hypothesis test to the range of data used to justify a Bayesian assumption of parameters is completely incorrect, as is applying the assumption to a naively selected trial set, and the requirements for passing a hypothesis test at all have to be made much more stringent in the meantime to account for the lack of evidence for the Bayesian priors themselves. The evidence has to both agree with the null hypothesis and affirm the prior assumptions, and to the extent that it has to do the latter without outside support it takes a lot more evidence to affirm the priors in the sense of not rejecting the null hypothesis. You have many “and” operators in the Bayesian logic — if the primary theory is correct AND this parameter is correct AND this second parameter is correct AND this third parameter is correct, what are the chances of observing the data — and the computation of that chance has to be diluted by accounting for all of the OTHER values of the prior assumptions that might lead to the same agreement with the data but effectively constitute a different theory.
And then there is data dredging, which is very nearly another version of the same thing.
So when there are 30+ GCMs, each with dozens of parameters in a complex theory with many countervarying nonlinear contributions, that are in good agreement with an indifferently selected monotonic training set of data, that is not yet particularly good evidence that the theory is correct. The real test comes when one applies it to new data outside of the training set, ideally data that does not conform to the monotonic behavior of the training set. If the theory (EACH GCM, one at a time) can correctly predict the alteration of the monotonic behavior, then it becomes more plausible. To the extent that it fails to do so, it becomes less plausible. Finally, as theories become less plausible, a Bayesian would go in and modify the prior assumptions to try to reconstruct good agreement between the theory and observation, just as you might change your prior beliefs about the fairness of the coin above, if you start seeing highly improbable outcomes (given the assumptions).
It is this latter step that is proceeding far too slowly in climate science.
rgb

MCT

In defence of the claim that statistics has been used successfully in science, jimmi-the-dalek says above that:
“Nonsense. Many areas of science are inherently statistical. Chemical reactions for example, or quantum effects, or thermodynamics. The Second Law of Thermodynamics is entirely statistical in origin.”
This is true in one sense but rather misleading. The sort of ‘statistics’ used in quantum physics and classical thermodynamics is radically different from that used in climate science. The first is deductive, the second inductive. In thermodynamics we make assumptions about the probabilities of elementary events and then use the calculus of probabilities to derive probabilities for complex events. This is a process as rigorous as anything in mathematics, and the results are compared with experiment – eg.time of flight observations for the Maxwell-Boltzmann distribution. Compared with what happens in climate science this can hardly be called ‘statistics’ at all – ‘stochastic’ might be a better word for it. In climate science they start with the experimental results – time-series and so on – and use genuinely statistical methods to try to find trends, correlations and causality.
Of course, in the experimental testing of quantum results, say, statistics in this second sense is used – as we saw with the detection of the Higgs particle – but that is another matter. I think the reliance of any science on statistical analyses, to the extent we see in climate science, is a real cause for concern – especially when the level of competence shown has been criticised so heavily by those whose competence non-experts such as myself have no reason to doubt.

mkelly

published in German in 1854, is known as the Clausius statement:
Heat can never pass from a colder to a warmer body without some other change, connected therewith, occurring at the same time.
Statistical mechanics was initiated in 1870 with the work of Austrian physicist Ludwig Boltzmann…
Both the above from Wiki.
jimmi_the_dalek says:
November 12, 2013 at 4:48 am
The Second Law of Thermodynamics is entirely statistical in origin.
Since, the Clausius statement preceded the initiation of statistical mechanics by some 16 years I find it hard to understand how you can say that the second law is entirely statistical in origin.

ferd berple

rgbatduke says:
November 12, 2013 at 6:45 am
and sometimes they are basically little more than educated guesses with little or no solid evidence.
=============
the assumption that some trees make better thermometers than other trees, and that one tree in particular is a better thermometer than all other trees.
if you look at 1000 trees, you will find that one tree better follows the thermometer records than all the others. and if you repeat this study, you will find than in another set of 1000 trees, there is one tree that better matches the thermometer records.
climate science teaches the reason that one tree in 1000 better matches the thermometer records is that this tree is a better thermometer. thus this tree can be used reliably as a thermometer proxy outside the time in which we have thermometers.
chance tells us that if you look at 1000 trees there will always be 1 tree that better matches the thermometer records, and the reason has nothing to do with being a better thermometer. climate science rejects chance as the cause. Of this they are 97% certain.

Jim Rose

I don’t understand. P=.05 should mean that the null hypothesis is correct only 5% of the time. How does this turn into error rates of approximately 20% — supposing that biases are accounted for?
What is the basic idea that changes 5% to 20%?

ferd berple

Perhaps the biggest assumption in modern science is the assumption that the purpose of science is to conduct research, to discover “The Truth”.
What if the true purpose of science is to attract funding? What scientific body would welcome a study that said “nothing to worry about, you can cut our funding”?

ferd berple

Jim Rose says:
November 12, 2013 at 7:19 am
What is the basic idea that changes 5% to 20%?
=======
You have 100 people in a company, of which 5 are actually using drugs. You employ a drug test that is 95% reliable. It will deliver 5 false positives. There is a 50-50 chance that someone identified as using drugs is actually using drugs.
You believe your test will be 95% accurate, but in fact 50% of the time it incorrectly identified the drug user. Since you don’t know there are only 5 drug users, when the test tells you there are 10 drug users, you believe the test. 5 innocent employees get fired along with the 5 guilty.
100 scientists conduct studies, of which 5 produce real positives. At the 95% significance, 5 produce false positives. Only the positives get published, as negative results do not attract attention. 5 true positive along with 5 false positives. 50% of the published papers are false.

I don’t believe raising thresholds much helps and I don’t believe 0.005 or 0.001 levels are needed for avoiding bad science.
In fact such a conclusion that raising thresholds to such levels will mean better science looks to me more insane than logical. It in my opinion defies probability theory, and it would in my opinion inevitably generate more false negatives than are now the false positives – at least quite definitely for normal distribution.
Statistics should anyway be considered only supporting, indirect evidence. As people shouldn’t be sentenced on indirect evidence, also scientist should have also other, direct evidence for their hypotheses. There are also hypotheses which aren’t testable by statistics after all. And where they are, the statistics is powerful, but still not omnipotent – it can’t replace common sense. Science shouldn’t be an industry to produce knowledge, deemed true if just passing statistical tests.
It is on the other hand substantial for the scientific method that it should be based on falsification – one comes with a hypothesis, tests it, comes to conclusion publishes it and then other comes, finds the flaws and falsifies it. And the falsification should be then taken most seriously. Not like in the climate science where the GST shows no change or even cooling and the global reports without change still talk about global warming, that it is due to CO2 and demand even more money and power anyway.
Science is not here for itself, it is a social phenomena driven both by competence and competition. It should be the nature of the science, not just believe, that if one raises thresholds it will automatically mean less junk science, less nonreproducibility. It doesn’t work like that and there are no “shortcuts” from ignorance to knowledge through restriction of science, the way there goes only through honesty, dedication and freedom of conscience in my opinion.

Matthew R Marler

Jim Rose: I don’t understand. P=.05 should mean that the null hypothesis is correct only 5% of the time. How does this turn into error rates of approximately 20% — supposing that biases are accounted for?
What is the basic idea that changes 5% to 20%?

The p-value is an estimate of the conditional probability of a result at least as far from the observed prediction given that the null hypothesis is true. Only rarely would that result in 5% of the true null hypotheses being reported true, or the literature only having false rejections of the null hypothesis in 5% of published papers. What changes the 5% to 20% (speaking loosely) is the fact that the null hypotheses are actually true (or at least very accurate) more often than the researchers expect them to be.
The proposal in the focus paper and the reasoning behind it are not that new. The debate about what significance level to use is probably at least a century old now. The penalty for insisting on low p-values is a higher than desirable rate of false non-rejection of false null hypotheses, aka “high rate of type ii errors”.
What has to be recognized is that single reported studies, without replication, are not as reliable as you would like them to be. That is not a problem solved by adopting smaller p-values, but by a deeper appreciation of the prevalence and size of random variation (the agglomeration of all the variation than is not directly relateable to the focus of the study and measurable covariates.)

ferd berple

10 thousand Olympic athletes are tested for drug use with a test that is 99.99% accurate. what are the odds that an athlete identified as a drug user is a drug user?

ferd berple says:
November 12, 2013 at 7:14 am
“…climate science rejects chance as the cause. Of this they are 97% certain.”
Well said.

Mike Maguire
Theo Goodwin

Nice explanation of applying Bayesian statistics to the flip of a coin. There is one huge difference between your example and the use of statistics in climate science. In your example, the statistician knows what the event space is. The event space consists of readily observable and readily definable physical phenomena, namely, flips of the coin. Climate science does not have the luxury of a readily observable or readily definable event space. Permit me to illustrate by reference to Michael Mann’s work on tree rings. What is (was) an event for Mann? It was the measurement of the width of a tree ring. For Mann, a series of events that together make up a proxy record consists of measurements of tree ring widths for the same kind of tree taken from trees scattered about the surface of the Earth. Does that count as a legitimate event space? Clearly not because tree ring growth is not determined by the tree but by the ever changing environment in which the tree grows. Did Mann investigate the several environments in order to establish readily observable and definable phenomena? Not in the least. All he did was a little hand waving to the effect that the sample trees were pretty much located at the the tree line (on mountains of course).
The usual Alarmist response to my point is that all the differences among the samples can be accommodated through powerful statistical techniques. This response does not climb above childishness.
I have “attacked” the soft underbelly of climate science by using Mann as my example. But the same is true throughout climate science. Climate scientists sometimes talk about heat content and sometimes talk about temperature measurement and they make the egregious assumption that the two are interdefinable. Climate scientists sometimes talk about temperatures taken in the atmosphere and sometimes about temperature taken on the surface or beneath the surface of the oceans. To summarize, climate scientists are the only so-called hard scientists now practicing who are quite willing to take any two series of measurements and assume that they are comparable. The worst offenders are the “paleoclimatologists” who are quite willing to take 50 existing “paleo” series and treat them as comparable while giving no thought to the actual events of measurement that occurred at some time in some readily observable and definable environment. Most climate scientists have no idea what their event space is.

Jeff L

This post & discussion highlights one thing for sure – climate science is inherently statistical (which we all knew) but also it makes you want to have a statistician as co-author for any publication, given the complexities of the statistical analysis needed to come up with a credible conclusion.

While Bayes’s theorem is correct, Bayesian parameter estimation generally suffers from the lack of uniqueness of the prior probability density function with consequent violation of Aristotle’s law of non-contradiction. In particular, the method by which climatologists extract numerical values for the equilibrium climate sensitivity (TECS) from global temperature data violates non-contradiction.
There is an exception to the rule that Bayesian methods violate non-contradiction. It occurs in the circumstance that a sequence of events, for example a sequence of coin flips, underlies the model. In this case, the so-called “uninformative” prior probability density function is unique and non-contradiction is not violated.
However, no events underlie the IPCC climate models. In setting up this state of affairs, climatologists have ensured that the conclusions of their arguments violate non-contradiction. Rather than being one value for TECS, for example, values are of infinite number.

Chris

@Vincent
Thanks for reminding me of Number Watch and John Brignel.
The answer is Yes!
@All of You
Please check in on the Warsaw meeting.
http://www.worldcoal.org/

The other Phil

rgb
Nice post

Lies, damned lies – and statistics 🙂
Mr. Berple. Are you not confusing probability with statistics?

Huemul

@ polistra says:
November 12, 2013 at 4:28 am
“Stats can’t be saved by tweaking. The only proper standard for science is NO STATS. If a result has to be reached by statistics, it’s not a scientific result. Only well-calibrated measurements plus a well-balanced experiment can give scientific results.”
In fact, a number of scientific disciplines are observational and are unable to reach direct results by experiment. Major examples: astronomy, paleontology, evolutionary biology, Some of their findings are “consistent with” some experiments done in another context, but most of their findings are not reached through experimental methods. Of course, observational sciences still require careful measurements, and thus the other condition set by Polistra (well-calibrated measurements) generally applies, although one has to be careful about the basis for calibration (cf climate models calibrated to the rapidly warming 1970-2000 period).

lemiere jacques

and what about a sentence like we have 95% confidence that a fact is true…
so if fact is not true we are not wrong but unlucky….
what about a sentence like …extreme events will be more extreme…we can say that …with low confidence….i can say with low confidence it will not rain tomorrow…
they can never be wrong…

DirkH

Mike Maguire says:
November 12, 2013 at 7:55 am
“Why Most Published Research Findings Are False ”
See also Sturgeon’s law
http://en.wikipedia.org/wiki/Sturgeon%27s_law

From ‘Revised Standards for Statistical Evidence’ by Valen E. Johnson published in PNAS,
“Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.”

– – – – – – –
That statement ignores the situation where the researcher(s) choose consciously and intentionally to use “evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects” and then obfuscating that they have done so.
John

Where is the damn like button, +, whatever on the comments. There is a nice list above of people who get it. I was watching a show about angels a long time ago. A person came on and said, “I fell through the ice, an angel appeared and said, ‘Go this way’, and I followed her advice and found the hole to get back out!”
My wife said … “What about all the people who fell through the ice and saw the angel and followed their directions and didn’t get out!”
This exactly corresponds to the “One tree in 1000 matches temperature better!”
This why you look at the absolute value of data and do not get too enamored of Anomaly Data. If the anomaly data lets you make accurate predictions on the unknown, awesome. Keep that absolute chart posted next to you though to remind you of the underlying magnitudes.
The inverse of Risk Ratio is Survivability Factor. Even with smoking, the survivability of smoking and lung cancer is 92% over 60 years. There is a 92% chance you won’t get lung cancer.

Robert

This is one of the reasons I abandoned a PhD in Sociology, a discipline that regularly works at the p=0.05 or even p=0.1 level. Very little was reproducible, and hardly anyone even tried.

Jquip

“This approach (really, no approach) is going to be particularly reliable with only five or six flips because …” — rgbatduke @6:45
rgb wrote a great bit on this, but it is distinctly different from what the paper is proposing. The papers stated problem, and its proposed solution, is to singly publish more papers that are reproucible.
Without reproducing them.
This is absurd in the first order, for if you make one mistake, one deceit, or one oversight, one bad choice with outliers and otherwise cleaning your dataset, with one paper: You have flipped the coin exactly once. To flip the coin five or six times you need to repeat the exact experiment of the paper, another four or five times.
Bayesian notions are pretty clever and terribly interesting: And point out exactly why experiments that are not replicated are nonsense. But this paper is precisely about non-replication. Faith, Trust, whatever you like. But it’s a misplaced solution.

Thank you, bwanajohn, for posting a link to that excellent article in The Economist. I’ve added both it and the Johnson paper to my collection of such articles, on my web site, here:
http://www.sealevel.info/papers.html#whitherscience

TheLastDemocrat

Great issue. A probability estimate, a “p” level, is derived internally – based on the data of the study, only. It therefore cannot be any sort of statement about the world outside of that data set.
A p value only helps contribute to the evaluation of how convincing the study outcome is.
The problem is that some people value it for more than that.
Bradford Hill, instrumental in fingering tobacco smoking as a cause of lung cancer, has his well-recognized 1965 paper on “Association or Causation?” He gives the well-recognized criteria for helping assess observational data. “Statistical likelihood” is not in that list of nine aspects, and is briefly mentioned and put well in its place toward the end of the article.

DirkH

Robert says:
November 12, 2013 at 9:17 am
“This is one of the reasons I abandoned a PhD in Sociology, a discipline that regularly works at the p=0.05 or even p=0.1 level. Very little was reproducible, and hardly anyone even tried.”
Find on youtube Harald Eia and his series Hjernevask or Brainwash. He’s a Norwegian sociologist and comedian and has fun confronting his Norwegian sociologist colleagues with inconvenient facts.

Gail Combs

tumetuestumefaisdubien1 says: @ November 12, 2013 at 7:36 am
I don’t believe raising thresholds much helps and I don’t believe 0.005 or 0.001 levels are needed for avoiding bad science.
In fact such a conclusion that raising thresholds to such levels will mean better science looks to me more insane than logical….
>>>>>>>>>>>>>>>>>>>>>
You should be doing what we did in QC. (This depends of course on the cost of the item and the cost of the tests.)
Taking ferd berple’s example of drug testing.
The sample from each person gets split before testing. Part of the sample is used for testing the rest is retained. You know that out of the ten who tested positive approx. five maybe false positives. You then use the more expensive but more accurate test to retest the retains for all ten to confirm the positives. (Splitting the sample is what is done for drug testing truck drivers.)
In science it is independent replication usually by at least two independent labs. If it can not be replicated it gets tossed into the dustbin of history.
I see no reason to change this but the confirmation testing by an independent lab or two or three is an absolute must and this is what is missing in Climate Science.

Kev-in-Uk

I think the climate science (and to be fair, quite often in other science areas!) – it would be fair to say that many researchers use the ‘P’ value simply because without it, the casual reader/interpreter would realise that the research doesn’t pass the ‘sniff’ test? In other words, it’s like an imaginary forcefield, protecting/deflecting the actual research from external attack..(hmm, is there a cartoon in there somewhere?)
Anyway, for some as yet unknown reason, many folk are incapable of ‘reverse analysis’ from statistical ‘results’, especially when reading data from ‘surveys’.
As for P values in climate science (which, in such a large, complex and chaotic system, cannot actually be adequately defined) – I think it can be adequately summed up in the simple phrase:
‘They are taking the P…’