Statistical flaws in science: p-values and false positives

To make science better, watch out for statistical flaws

by Tom Siegfried

First of two parts

As Winston Churchill once said about democracy, it’s the worst form of government, except for all the others. Science is like that. As commonly practiced today, science is a terrible way for gathering knowledge about nature, especially in messy realms like medicine. But it would be very unwise to vote science out of office, because all the other methods are so much worse.

Still, science has room for improvement, as its many critics are constantly pointing out. Some of those critics are, of course, lunatics who simply prefer not to believe solid scientific evidence if they dislike its implications. But many critics of science have the goal of making the scientific enterprise better, stronger and more reliable. They are justified in pointing out that scientific methodology — in particular, statistical techniques for testing hypotheses — have more flaws than Facebook’s privacy policies. One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false. 

A few months ago, though, some defenders of the scientific faith produced a new study claiming otherwise. Their survey of five major medical journals indicated a false discovery rate among published papers of only 14 percent. “Our analysis suggests that the medical literature remains a reliable record of scientific progress,” Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University wrote in the journal Biostatistics.

Their finding is based on an examination of P values, the probability of getting a positive result if there is no real effect (an assumption called the null hypothesis). By convention, if the results you get (or more extreme results) would occur less than 5 percent of the time by chance (P value less than .05), then your finding is “statistically significant.” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.

As Jager and Leek acknowledge, though, this method has well-documented flaws. “There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis,” they wrote.

For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies, and with thousands of scientific studies going on, that adds up to a lot. But it’s even worse. If there actually is no real effect in most experiments, you’ll reach a wrong conclusion far more than 5 percent of the time. Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes. More than 80 percent of your supposed results would be false.

read more here:  https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
114 Comments
Inline Feedbacks
View all comments
Hoser
February 7, 2014 11:02 pm

Well so much for that. The <pre> tag failed to preserve the input file handle <IN> in the 4th line above. Testing outside WP, the same result is obtained in IE, FF, and GC. PRE won’t save anything looking like a tag even across multiple lines.

February 7, 2014 11:13 pm

Steven Mosher says: February 7, 2014 at 7:17 pm
“just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.”
Sort of agree. Wish to stress that this type of uncertainty expression is about the spread of values about a mean of some sort, precision if you like.
More concerned about whether the mean is in the right place, than with the scatter about it being 95% enclosed by a certain curve. Bias.
I’m thinking that when you compare a number of temperature data sets with adjustments and there is an envelope around the various adjustments of say +/- 1 deg C, it is rather immaterial to concentrate on precision because it just adds on top of the bias which can often be the larger.
Like this graph from Alice Springs in the centre of Australia – I guess I should update it now we have BEST and CRUTEM4 and even Google.
http://www.geoffstuff.com/Spaghetti_Alice_JPG.jpg
Of course, these concepts are as old as time, but it’s remarkable how, in climate work, the bias aspect is so seldom considered properly, if at all. One gets the impression that ye olde thermometers could be read to 0.05 degrees, when 1 degree was more like it.
Or that Argo floats are accurate to 0.004 deg C as I recall. Utter BS.
But then, you’d broadly agree with me, I suspect.

richardscourtney
February 7, 2014 11:15 pm

Nick Stokes:
At February 7, 2014 at 6:30 pm

“There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis”

Yes, that’s elementary. P<0.05 can reject the null hypothesis. But otherwise the test fails. You can’t deduce the null hypothesis is true. You just don’t know.
That’s why all this talk of “statistically significant warming” is misconceived. You can test whether a trend is significantly different from zero, and maybe deduce something if it is. But if it isn’t, your test failed to reject. No result.

NO!
That is warmist sophistry which pretends the ‘pause’ is not happening.

A linear ‘trend’ can be computed from any data time series. At issue here is whether the trend in global atmospheric temperature anomaly (GASTA) differs from zero (i.e. no discernible global warming or cooling) and – if so – for how long before the present.
Climastrology uses linear trends and 95% confidence. There are good reasons to dispute each of these conventions, but they are the conventions used by climastrology so they are the appropriate conventions in this case.
So, in this case the null hypothesis is that a linear trend in GASTA does not differ from zero at 95% confidence and, therefore, there is no discernible warming. And the period to be determined of no discernible global warming or cooling is up to the present. Therefore, the end point is now and the data is assessed back in time until a linear trend over the period differs from zero at 95% confidence.
Each of the several time series of GASTA indicates no trend which differs from zero (i.e. no global warming or cooling) for at least 17 years until now; RSS indicates 24.5 years.
And it is not reasonable to remove data from the data set(s). 1998 had a high value and there is no possibility of justifying its removal from the data set whatever the cause of it being a high value. This is because the assessment is of how long there has been no discernible warming or cooling, and any distortion of the analysed data provides a distortion of the result of the analysis.
Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.
Richard

Toto
February 7, 2014 11:25 pm

Nobody has said the magic word (‘model’). You know that p-value thing? You know how it is calculated? Using a model, which in some cases is an only an assumption.

Admin
February 7, 2014 11:35 pm

The father of modern statistics was an ardent Eugenics catastrophist – he developed the field of statistics to find mathematical support for his passion.
http://en.wikipedia.org/wiki/Ronald_Fisher
I am not suggesting that statistics are useless because its origins are tainted, what I am suggesting is, if someone with the genius to invent an entire mathematical discipline can be fooled by his own invention, then anyone can get it wrong.

craig
February 7, 2014 11:41 pm

Hence, one of the principle reasons when I’m detailing Drs, that the Dr is made aware of the statistical or non-statistical significance of a value and the more important and the most relevant part of the discussion, is the value change from placebo or the active ingredient arm, CLINICALLY MEANINGFUL? Clinical meaningfulness of a number is a more practical way to understand a drug effect on a subject.

Mindert Eiting
February 8, 2014 12:17 am

Eric Worrall: ‘The father of modern statistics was an ardent Eugenics catastrophist’.
Consider the historical context. The inconvenient truth is that many of his contemporaries were eugeneticists. As embarrassing is the fact that many people in these days were anti-semites. Perhaps the most difficult thing, even for geniuses, is to check assumptions and to think about consequences.

JDN
February 8, 2014 12:32 am

Some points I’ve been thinking about:
1) The people rubbishing medical research don’t release their data. There’s no way to confirm what you’ve been hearing. It’s basically hearsay.
2) Medical journals are the worst for publishing & executing methods. The statistical tests are the least of the problems. Why does this keep coming up? We learn to dismiss articles based on faulty methods or experimental construction, no matter what the statistical significance.
3) People keep trying to examine the outcome of medical research based on whether drugs that work in the lab work in the clinic. This doesn’t measure the search for knowledge, it measures the search for financial success. There’s something to be said for knowledge so reliable you can take it to the bank. However, clinical trials can fail for reasons that have nothing to do with the reliability of scientific knowledge. These exercises looking at the monetization of science are a waste of time. Everything is worthless until it’s not. If you perform an evaluation of the evaluators as in Science mag this week (paywalled unfortunately, http://www.sciencemag.org/content/343/6171/596), you’ll find out that these evaluations are not worth much.

Greg Goodman
February 8, 2014 12:36 am

” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.”
The p-value is not what decides whether a paper gets published. A negative result is still a valid scientific result. But this raises the other problem of the ‘literature’: publication bias. Only publishing positive results falsifies the record as well.
A recent case was Tamilflu , a supposed protection against certain strains of influenza that Tony Blair , them prime mininister invested something like 4 billion pounds sterling in a stock of treatments in case an epidemic of bird flu struck the UK.
It has recently been found that about half the studies on the drug were negative but they did not get published.
Good article BTW, thanks.

February 8, 2014 12:41 am

Well. I think it is not correct to limit the science methodology to statistics. As others have pointed out in this discussion statistics is useful to cut off the noise. However no statistics can replace a mechanistic explanation. To make an example familiar to my field of work: genetic association studies (those exploring the association of common genetic variants in the population with some clinical phenotype) often get p values with less than 10-5 / 10-8 ( sorry I am not familiar with tags too). However only when the finding is biologically explained (with functional experiments) one can claim that the discovery is a scientific advancement. Otherwise you can clam the association but not that the association has any biological meaning.
I think this should also happen in climate science. Perhaps funding agencies should invest better their money to sponsor studies aimed ad understanding the physics underlying the observed phenomena rather than thousand of useless studies finding statistical associations and than building their theories based on what they find statistical significant. Science based on statistic actually reverses what science methodology should be and it is really the prototype of a fishing expedition (you get something but you don’t know why).

Nick Stokes
February 8, 2014 1:01 am

richardscourtney says: February 7, 2014 at 11:15 pm
“So, in this case the null hypothesis is that a linear trend in GASTA does not differ from zero at 95% confidence and, therefore, there is no discernible warming.”
No, that’s nonsense, and putting in bold doesn’t improve it. Where I am, it was 41°C today. Was that statistically significant? Well, maybe not; it’s late summer. But it was still discernibly warm.
You have tested whether the observed trend could have happened with an underlying zero trend and natural variation. And the answer is that that can’t be rejected. But lt is not the only possible explanation.
The UAH index shows a trend of 1.212°C/century since Jan 1996. That’s not quite significant re 0, so we can’t rule out an underlying zero trend. But we also can’t rule out the upper limit of 2.44°C/century (or anything in between). In fact 2.44 is as likely as 0. Now that would be highly discernible warming. In fact, the observed 1.212°C/cen is considerable.
What we know is that the measured trend was 1.212°C/cen. That’s what actually happened, and is discernible. The rest is theorising about what might have happened if we could run it again.

Greg Goodman
February 8, 2014 1:02 am

Carlo Napolitano, I agree. Statistical testing a good safeguard but is not the be all and end all of scientific analysis. Too much of what is used in climate science has been inherited from econometrics rather then the physical sciences. And often rather poorly understood at that.

Ed Zuiderwijk
February 8, 2014 1:45 am

There is another way of judging the veracity of published results: look after, say, ten years and see how many times a result is referenced. If it isn’t, either the result wasn’t important or nobody believes it anymore, or it is superceded by later results.
They did such a sobering analysis some decades ago with astronomy/astrophysics papers and found that only a few percent survives the ravages of time.

February 8, 2014 2:09 am

John Brignell put the problems of statistical significance into layman language several years ago. His website, [url]www.numberwatch,co.uk[/url], is worth spending an hour on, and his book “Sorry Wrong Number” is excellent. One of his constant points is that in epidemiology a relative risk of at least 2 (a doubling of the effect) should be seen before the result is taken as important due to the number of conflicting factors in any open system (doesn’t it sound like AGW?).
Here are a few relevant pages from the website:
[url]http://www.numberwatch.co.uk/statistical_bludgeon.htm[/url]
[url]http://www.numberwatch.co.uk/Stuff.htm[/url]
[url]http://www.numberwatch.co.uk/an_exercise_in_critical_reading_.htm[/url]
He also has several essays on the ridiculousness of AGW and a review of Watermelons :).

John Shade
February 8, 2014 2:31 am

Hypothesis testing gets a hard time every now and then. By those who think the p-value is the probability that the alternative hypothesis is wrong, or who think that such testing provides proof or disproof of some kind. It does neither. It is merely a means of assessing the strength of evidence in a particular data set, considered in isolation. In general, the null hypothesis is for ‘no effect’, e.g. that the hair on the left side of your head has an equal mean diameter as that on the right. We know that is not true. Generally we know the null hypothesis is not true. We are not trying to prove or disprove it. All we are doing is going through a ritual whereby we say, if the null were true (and other conditions deemed applicable hold) what is the probability of getting some statistic as or more extreme than the one computed for this particular set of data? That’s it. A small p does not mean the null is false, a large p does not mean that it is true. The test is making a far more modest contribution than that.

basicstats
February 8, 2014 2:33 am

It’s surprising the Ioannidis paper has created such a stir, since it basically just says that a small probability of error in an individual experiment/study when compounded over thousands of different experiments/studies results in a much larger probability of error. Or the probability of heads in 1000 spins of a coin greatly exceeds the probability of heads on 1 spin. Pretty obvious, although pinning down the precise probability of error/false positives over thousands of very different kinds of studies is definitely a hard problem.
The relevance to climatology lies in the proliferation of different measures global warmers are coming up with – sea levels, ice volumes, ocean heat content etc etc. Keep data mining and you will find something still going up steadily! Especially as at least some of these are probably correlated to global average temperature anomaly with a time lag. Not to forget there are half a dozen such global anomalies to begin with.

richardscourtney
February 8, 2014 2:33 am

Nick Stokes:
I am replying to your post at February 8, 2014 at 1:01 am which is here and is in reply to my post at February 7, 2014 at 11:15 pm which is here.
In my post I rightly said of your assertion

That’s why all this talk of “statistically significant warming” is misconceived. You can test whether a trend is significantly different from zero, and maybe deduce something if it is. But if it isn’t, your test failed to reject. No result.

NO!
That is warmist sophistry which pretends the ‘pause’ is not happening.

I explained

Climastrology uses linear trends and 95% confidence. There are good reasons to dispute each of these conventions, but they are the conventions used by climastrology so they are the appropriate conventions in this case.

Those conventions were used by climastrology to claim there was global warming. What matters is to use THOSE SAME conventions when assessing the ‘pause’. And it is sophistry to say that different conventions should be used when the result does not fit an agenda.
I stated that “There are good reasons to dispute each of these conventions” but, so what? The only pertinent fact is that those are the conventions used by climastrology. It is ‘moving the goal posts’ to now say those conventions should not be used because they are wrong.
Your reply which I am answering says

You have tested whether the observed trend could have happened with an underlying zero trend and natural variation. And the answer is that that can’t be rejected. But lt is not the only possible explanation.

That is more sophistry!
Whatever the cause of the ‘pause’ is not pertinent to a determination of the existence of the pause.

The same conventions of climastrology used to determine that there was global warming were used to determine the start of the ‘pause’. And the conclusion of that analysis is as I said

Each of the several time series of GASTA indicates no trend which differs from zero (i.e. no global warming or cooling) for at least 17 years until now; RSS indicates 24.5 years.

and

Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.

The conventions adopted by climastrology may be mistaken (I think they are) but it is not “science” to choose when and when not to use conventions depending on the desired result.
Richard

February 8, 2014 2:44 am

From the article linked in Tom Siegfried’s essay-
“Others proposed similar methods but with different interpretations for the P value. Fisher said a low P value merely means that you should reject the null hypothesis; it does not actually tell you how likely the null hypothesis is to be correct. Others interpreted the P value as the likelihood of a false positive: concluding an effect is real when it actually isn’t. ”
Seems like Tom Siegfried and many other commentors on this thread, such as Nick Stokes and Steven Mosher, have made the same misinterpretation of what Fisher’s p value actually-Just as is alluded to in that article.
Alpha values are what determine Type 1 errors, or False Positives, per Neyman–Pearson. Fisher p values are about acceptance of the null hypothesis, not about Type 1 and 2 errors, as Tom Siegfried suggests.
What Leonard Lane says at February 7, 2014 at 9:51 pm is spot on, if he means consult a statistician using Bayesian methods.
I am interested in seeing if Tom Siegfried figures out what a p value actually is before he writes part 2 of his essay.

Admin
February 8, 2014 2:58 am

Mindert Eiting
Eric Worrall: ‘The father of modern statistics was an ardent Eugenics catastrophist’.
Consider the historical context. The inconvenient truth is that many of his contemporaries were eugeneticists. As embarrassing is the fact that many people in these days were anti-semites. Perhaps the most difficult thing, even for geniuses, is to check assumptions and to think about consequences.

A historical example of GIGO – the statistical techniques were well applied, but the data and assumptions were rubbish.
Fast forward to the present day, and the climate “geniuses” can’t even get the statistics right.

Admin
February 8, 2014 3:01 am

richardscourtney, everyone should try the SkS trend calculator, but instead of using the latest figures, feed in 30 year time periods before and after the 1940 – 1970 cooling.
Whatever Foster and Rahmstorf’s method is calculating, it is not a reliable guide as to whether the world is experiencing a downturn in global temperatures.

Nick Stokes
February 8, 2014 3:08 am

richardscourtney says: February 8, 2014 at 2:33 am
“Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.”
Well, that makes absolutely no sense, despite the bold face. Yes, trend from 1980 to 1997 was significantly different from zero. So was the trend from Jan 1995 to Dec 2012. Does that mean discernible global warming stopped a year ago?

David L
February 8, 2014 3:20 am

A p-value only gives confidence in rejecting the null hypothesis, it is not proof of an effect. You can propose an alternative hypothesis and test for that as well.
In clinical studies a p-value of 0.01 is typically used but more important studies have to be properly powered beforehand , and the results have to either agree or disagree with the baseline measurements within their prior agreed upon confidence intervals.
If AGW research followed the rules required of Pharmaceutical research, the entire dogma would have been rejected by the FDA years ago.

February 8, 2014 3:21 am

Statistician William M. Briggs wrote:

The problem with statistics is the astonishing amount of magical thinking tolerated. A statistician—or his apprentice; this means you—waving a formula over a dataset is little different than an alchemist trying his luck with a philosopher’s stone and a pile of lead. That gold sometimes emerges says more about your efforts than it does about the mystical incantations wielded.
Statistics, which is to say probability, is supposed to be about uncertainty. You would think, then, that the goal of the procedures developed would be to quantify uncertainty to the best extent possible in the matters of interest to most people. You would be wrong. Instead, statistics answers questions nobody asked. Why? Because of mathematical slickness and convenience, mostly.
The result is a plague, an epidemic, a riot of over-certainty. This means you, too. Even if you’re using the newest of the new algorithms, even if you have “big” data, even if you call your statisticians “data scientists”, and even if you are pure of heart and really, really care.

More at link in a very good essay: http://wmbriggs.com/blog/?p=11305
By the way, Briggs has written extensively about the problem of people misusing statistics. His blog site is a treasure trove of wonderful essays on the issue.

richardscourtney
February 8, 2014 3:26 am

Nick Stokes:
Your post at February 8, 2014 at 3:08 am is yet more of your sophistry.
My post addressed to you at February 7, 2014 at 11:15 pm is here explained the derivation of my statement saying

Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.

But you ignore that and introduce a Red Herring by saying

Well, that makes absolutely no sense, despite the bold face. Yes, trend from 1980 to 1997 was significantly different from zero. So was the trend from Jan 1995 to Dec 2012. Does that mean discernible global warming stopped a year ago?

That is complete nonsense!
As I said in my post at February 8, 2014 at 2:33 am which you claim to be replying

The conventions adopted by climastrology may be mistaken (I think they are) but it is not “science” to choose when and when not to use conventions depending on the desired result.

Richard

February 8, 2014 3:29 am

Opps. I messed up that last. The link is indeed to the rest of that essay quoted from, but the “very good” essay I wanted to point out is the one before that and the link is: http://wmbriggs.com/blog/?p=11261
It would be nice to be able to edit, but WordPress says that could lead to problems. They are most likely correct. 🙁