Statistical flaws in science: p-values and false positives

To make science better, watch out for statistical flaws

by Tom Siegfried

First of two parts

As Winston Churchill once said about democracy, it’s the worst form of government, except for all the others. Science is like that. As commonly practiced today, science is a terrible way for gathering knowledge about nature, especially in messy realms like medicine. But it would be very unwise to vote science out of office, because all the other methods are so much worse.

Still, science has room for improvement, as its many critics are constantly pointing out. Some of those critics are, of course, lunatics who simply prefer not to believe solid scientific evidence if they dislike its implications. But many critics of science have the goal of making the scientific enterprise better, stronger and more reliable. They are justified in pointing out that scientific methodology — in particular, statistical techniques for testing hypotheses — have more flaws than Facebook’s privacy policies. One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false. 

A few months ago, though, some defenders of the scientific faith produced a new study claiming otherwise. Their survey of five major medical journals indicated a false discovery rate among published papers of only 14 percent. “Our analysis suggests that the medical literature remains a reliable record of scientific progress,” Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University wrote in the journal Biostatistics.

Their finding is based on an examination of P values, the probability of getting a positive result if there is no real effect (an assumption called the null hypothesis). By convention, if the results you get (or more extreme results) would occur less than 5 percent of the time by chance (P value less than .05), then your finding is “statistically significant.” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.

As Jager and Leek acknowledge, though, this method has well-documented flaws. “There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis,” they wrote.

For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies, and with thousands of scientific studies going on, that adds up to a lot. But it’s even worse. If there actually is no real effect in most experiments, you’ll reach a wrong conclusion far more than 5 percent of the time. Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes. More than 80 percent of your supposed results would be false.

read more here:  https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws

The climate data they don't want you to find — free, to your inbox.
Join readers who get 5–8 new articles daily — no algorithms, no shadow bans."
0 0 votes
Article Rating
114 Comments
Inline Feedbacks
View all comments
ferdberple
February 8, 2014 11:10 am

richardscourtney says:
February 8, 2014 at 2:33 am
The conventions adopted by climastrology may be mistaken (I think they are) but it is not “science” to choose when and when not to use conventions depending on the desired result.
=============
changing your methods is cherry picking. If you apply 20 different methods to analyses the data, the odds are that 1 method will deliver a false positive at the 95% confidence level. If you then report only this 1 method, you are committing scientific fraud. However, this sort of fraud is almost impossible to detect or prove,
So, when we see climate science using one method to analyze warming, and a different method to analyze the pause, this makes it likely that we are witnessing cherry picking of the methods, and what is being reported are false positives. Specifically, it is very likely the previous warming was not significant. It was an artifact of the statistical model.
To equate climate science to astrology is of course an insult to Astrology. We calculate the earth’s future tides to great precision using the techniques developed by astrology. These were the same techniques that early humans used to predict the seasons. Astrology has a bad name because these techniques have also been applied to personal horoscopes, where they have proven less successful.
Climate Science adds “Science” to its name because it isn’t science. It is only pretending to be one. None of the true sciences need add “Science” to their name. Climate Science is like the Peoples Democratic Republic. The PDR adds “Democratic” to its name because its isn’t a Democracy. It is only pretending to be one.
Astrology in contrast allows us to predict the specific future state of chaotic systems with a reasonably high degree of accuracy. Something that remains for all practical purposes impossible in all other branches of science.

george e. smith
February 8, 2014 12:11 pm

“””””…..Geoff Sherrington says:
February 7, 2014 at 11:13 pm
Steven Mosher says: February 7, 2014 at 7:17 pm
“just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.”
……………………..
One gets the impression that ye olde thermometers could be read to 0.05 degrees, when 1 degree was more like it.
Or that Argo floats are accurate to 0.004 deg C as I recall. Utter BS.
……………….”””
Don’t know what sort of thermometers would be considered olde, o r good to one degree or perhaps 0.05 deg.
But It is not that improbable, that the thermometers in the Argo floats, can resolve, and are repeatable to 0.004 deg. C
Absolute calibration accuracy is somewhat difficult, but that does not matter a jot, in climate studies. When you are measuring “anomalies” who cares what the calibration accuracy is; repeatability and resolution is all that matters; the absolute Temperatures are thrown out with the bath water.
A more important question is; just WHAT Temperature is the thermometer measuring ?? Is it measuring only its own temperature, or the temperature of something else you are more interested in.
So I don’t know that Argos are reading what people think, to 0.004 deg. C

george e. smith
February 8, 2014 12:16 pm

As for P values and null hypotheses.
I believe that a P value is an intrinsic statistical property of some known data set, defined precisely in statistical mathematics text books.
I don’t believe it tells you anything about anything else; not in that data set.
And we all know Einstein’s exhortation; “No amount of experiment can prove me right, but a single experiment can prove me wrong.” Or words to that effect.

February 8, 2014 12:45 pm

It takes zero stats to understand t h at co2 is a problem.
Mosher,
This comment is completely sense-free and indicative of your mindset. All the models agree that without a positive feedback increasing atmospheric water vapor, CO2 by itself is no problem at all. Have you checked to see if atmospheric water vapor has increased as CO2 increases? News flash: It has NOT.
Check out NASA’s NVAP-M study, which is full of bad news for you and your ilk.

Matthew R Marler
February 8, 2014 1:22 pm

Steven Mosher: just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.
You do your test. You report the uncertainty.

The 5% and 1% conventions persist only because a workable improvement has never been demonstrated. Claims that this or that procedure (e.g. some variety of Bayesian inference, or false discovery rate) is an improvement fall apart when you consider that millions of statistical decisions (including medical diagnoses and decisions whether to publish papers) are made daily.
Because of random variability and the vast scale of research and applications, every procedure has the risk of a high rate of error. And that’s when the procedures (diagnostics, procedures) are performed without flaw.
Of far greater concern is the suppression of reports of vast numbers, of uncounted “negative” results.
Legions of papers and book chapters have been published on these problems.

Frank
February 8, 2014 2:28 pm

Tom wrote: “One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false.”
In science, we never prove that a theory (or conclusion) is correct; we accumulate a body of experimental evidence that is consistent or inconsistent with the theory (or conclusion). When we are unable to reject the null hypothesis using the data with p<0.05 (or some other more appropriate level), the experimental data is usually considered to be INCONCLUSIVE – a result which generally doesn't make a theory or conclusion FALSE. For example, experiments at the Tevatron were unable to provide conclusive evidence for the existence of the Higgs boson, but that work certainly didn't prove that the Higgs didn't exist. There is a big difference between a conclusion being false and the more appropriate phrase you used: "false discovery rate". The fact that experiments that don't produce statistically significant results frequently don't get published certainly means that the we need to be interpret p values carefully.
If a large clinical trial with a drug fails to show efficacy with p<0.05, that doesn't mean that the drug didn't provide some benefit. That clinical trial was preceded by years of smaller trials in people and animals indicating a large [very expensive] clinical trial was warranted. Based on the earlier trials, a sponsor chooses how many patients to enroll in a large clinical trial so that they have good prospects of showing efficacy with p<0.05. Only an idiot would think that drug companies are running large clinical trial solely based on the hope of obtaining a positive result by chance in 1 out of 20. Note that the FDA usually requires TWO such clinical trials for approvals (1 out of 400). If a new drug treats an life-threatening condition for which no other treatment has been shown to be effective, they request a second large clinical trial be run after approval.) The authorities now require that all data from all clinical trials be placed in a public repository (www.clinicaltrials.gov). Only the naive would automatically conclude that a failure to demonstrate efficacy with p<0.05 in one or more clinical trials proves that a drug would not be useful for a different patient population or at a different dose.
The propagandists who say that statistically significant warming was or was not observed over periods X or Y don't really understand statistics. If they presented the 95% confidence intervals, they would find that most of the confidence intervals many periods overlap to a significant extent! Whether a warming rate of 0 is or is not included one of these confidence intervals isn't particularly important; the central value and our confidence in the central value is important.

Anthony Zeeman
February 8, 2014 3:06 pm

Real world science: two wrongs don’t make a right.
Climate science: a whole lot of wrongs modeled, averaged, filtered, tweaked and adjusted are somehow right.

Nullius in Verba
February 8, 2014 4:12 pm

“Only an idiot would think that drug companies are running large clinical trial solely based on the hope of obtaining a positive result by chance in 1 out of 20.”
Dunno. If the trial costs $1m, and the drug has a potential market able to return $100m, it makes sense from a financial point of view, if not an ethical one. Not that I think very many of them do – unconscious cognitive biases are more than sufficient for people to fool themselves without anyone playing such dangerous games.
One thing that can happen is that the company does a trawl of ten thousand different preparations looking for efficacy, with maybe a 1-in-20,000 chance for each. In vitro tests narrow it down to a sample with 1-in-1000 odds, animal tests whittle it down to 1-in-50, then a human trial gets you to 1-in-2. Considering where you started, that’s a hell of an improvement in confidence, and a pretty fair improvement in the odds, which for a life-threatening condition is not to be dismissed. But the point remains – where you end up depends on where you started. P-values only give a rough idea of the size of jump in confidence, they don’t tell you what the final confidence is. And people are assuming wrongly that 5% means a 5% probability of error.
They way they actually do it is actually the right thing to do. The problem is that people not only misunderstand p-values, they misunderstand the purpose of the scientific journals too. The idea of the journals is not to provide a stamp of authority on reliable science – it is to report interim results for checking by ones scientific peers. The journal peer review is a purely editorial function to confirm that it is worth the journal audience’s time to look at. But the purpose in publishing it is to allow other researchers to try to replicate it, extend it, debunk it, generalise it, etc. Only after it has survived this challenge can it be considered ‘accepted’. And as such, the confidence level required is not that needed for science to be ‘accepted’ (the idea that a 5% error rate would be tolerable in science is laughable!), but it only needs to be sufficient to say ‘this is worth looking at’. A p-value of 5% ought to shift your confidence (from wherever it starts) by a noticeable amount. It might say “this formerly unlikely possibility is now somewhat less unlikely”, or it might say “this former contender is now the leader”, or it might say “what was formerly only the most likely explanation is now quite strongly confirmed.”
And there’s absolutely nothing wrong with them doing that, so long as you don’t go round thinking that papers in journals are to be considered “settled science”, or even that they’re 95% sure. They’re work in progress, and we *expect* a large fraction of them to be wrong. They’re supposed to be. We’re only saying they’re worth checking out, we’re not saying they’re true.
The appropriate trade-off point depends primarily on how many potential results arise. You need to cut the number down so that there are enough to keep everyone busy, but not so many that people can’t keep up with the field. Some fields like particle physics generate a huge number of possible results, so they set stringent levels in order that people only spend time on the very best prospects. Other areas can afford to be more relaxed. It depends too on the potential benefits if it happens to be true – long shots are sometimes worthwhile.
Clearly, 5% error rate is not sufficiently low, since science commonly chains together many results in longer arguments. A mere 14-step argument in which every step is only 95% likely to be right is more likely to be flawed than not. A 69-step argument can be sustained with 99% certainty. But many scientific arguments rely on hundreds of results. If CAGW was genuinely 95% confident (and it’s not), that might arguably be enough for politics (depending also on costs and benefits), but it’s *far* from enough to be “settled science”.

Evan Jones
Editor
February 8, 2014 4:41 pm

So are those P 0.5 surface station results we found just lucky, Anthony?
Ah gots to know!

John Finn
February 8, 2014 5:05 pm

Having read the comments on this thread I think I’m broadly in agreement with Nick Stokes. I’m not sure if this helps or makes things any clearer but it might be worth looking at an example.
The SKS trend calculator gives the 1996-2013 UAH trend as
0.120 ±0.188 °C/decade (2σ)
Using the conventional P threshold of 0.05 the trend is not significant. However, the result suggests that the probability that the trend is greater than ZERO is around 90%. In other words it is far more likely to be warming than not. Even the RSS trend since 1996 has a higher probability (~60%) that the trend is greater than ZERO.
That said, it’s unlikely that the trends are as high as those projected by the IPCC models.

DAV
February 8, 2014 5:27 pm

p-value isn’t a probability and it certainly doesn’t show your theory has any merit.
There are many things wrong with using p-values.
For starters, I suggest reading:
http://wmbriggs.com/blog/?p=11261
http://wmbriggs.com/public/briggs.p-value.fallacies.pdf
http://wmbriggs.com/blog/?p=8295
And a zillion others just like them.

Mac the Knife
February 8, 2014 8:15 pm

Mark T says:
February 8, 2014 at 9:10 am
Maybe that really is what separates engineers from the rest of the scientific community (as noted above): we are specifically trained to root out meaningless (spurious) correlations, and as such, always look for alternative answers that better describe the problem at hand.
Mark,
Truth!
Mac

richardscourtney
February 8, 2014 11:27 pm

John Finn:
At February 8, 2014 at 5:05 pm you say

… I think I’m broadly in agreement with Nick Stokes.

Well, yes. Everybody knows that. But do you have anything to add to the thread?
Richard

Adam
February 9, 2014 12:06 am

It is the age old case. Statistics should be left to statisticians. Do not expect a scientist to be a statistician. The job of the Scientist is different to that of the statistician. The scientist gathers the data and interprets it the best he can in terms of proposing mechanisms and theories.
It is left to separate works by statisticians to carefully decipher the statistical significance of the result.
Finally, even if some result was statistically insignificant, that does not mean that it is incorrect and should be rejected. There IS ALWAYS the possibility.

John Finn
February 9, 2014 2:52 am

richardscourtney says:
February 8, 2014 at 11:27 pm
John Finn:
At February 8, 2014 at 5:05 pm you say
… I think I’m broadly in agreement with Nick Stokes.
Well, yes. Everybody knows that. But do you have anything to add to the thread?

Just trying to clarify things for you, Richard. You appear to think statistical significance is a “knife-edge” issue but it’s a bit more blurred than that. In an earlier post Nick writes

The UAH index shows a trend of 1.212°C/century since Jan 1996. That’s not quite significant re 0, so we can’t rule out an underlying zero trend. But we also can’t rule out the upper limit of 2.44°C/century (or anything in between). In fact 2.44 is as likely as 0. Now that would be highly discernible warming. In fact, the observed 1.212°C/cen is considerable.

Nick’s right. A “true” trend above 2.4 degrees per century is just as likely as a trend below ZERO – though neither is very likely.

Nullius in Verba
February 9, 2014 3:01 am

“Using the conventional P threshold of 0.05 the trend is not significant. However, the result suggests that the probability that the trend is greater than ZERO is around 90%. In other words it is far more likely to be warming than not.”
No, that’s the misunderstanding that everyone keeps making about p-values. It doesn’t mean that the probability of the trend being greater than zero is 90%. It means that the probability of seeing a less extreme slope if the true trend is actually zero (and the noise fits a certain statistical model) is 90%. (The SkS calculator apparently assumes ARMA(1,1) noise.)
The two probabilities are different. Consider this silly example. Suppose I want to know what the true slope of the global mean temperature anomaly is, and I decide to estimate it using the rather strange method of throwing two dice and adding the results up. I get a 5 and a 6 making 11. If the true temperature trend is zero, what is the probability of me getting a less extreme value than this? The answer is 33/36 = 92%. Not quite ‘significant’, but not far off. But what does that tell me about the probability of the true trend being zero?
That’s an extreme example, but illustrates the point that there’s not necessarily any connection between p-values and the probabilities of the null and alternative hypotheses. The p-value is one of the most commonly misunderstood statistics.

John Finn
February 9, 2014 3:08 am

DAV says:
February 8, 2014 at 5:27 pm
p-value isn’t a probability and it certainly doesn’t show your theory has any merit.

Is this intended for me?
First I haven’t got any theory. I’m just providing a very basic analysis of the UAH trend test statistic. Secondly, if the P-value doesn’t represent a probability what does it represent? A P-value of less than 0.05 suggests that the probability of obtaining a given test result by chance is less than 5%.

There are many things wrong with using p-values.

There aren’t “many things wrong with using p-values”. There are many ways that p-values can be misinterpreted. There are also times when the value of p-values can be exaggerated. For example, the claims that warming has stopped since 1997 or 1998 or whatever are stretching things a bit.

Chas
February 9, 2014 3:48 am

Eric Worral, thanks for thought-link to RA Fisher.Rrom the chronology in Wiki it looks as if his time spent thinking about Eugenics preceded his stats work. As luck would have it, the first article that I came across from the Eugenics Review Journal was by Fisher:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2986993/pdf/eugenrev00372-0017.pdf
It is salutory to consider what would be happening now, if the concept of evolutionary theory had emerged only in the 1950’s: The Eugenic catastrophists would have loads of colour graphics from giant computer simulations of ‘the future’ to convince the decision makers.The simulations would all be based on straighforward mathematics. [with just a few parameterisations;-) ].
I can see quite a few of our present day scientists and politicians being dragged onto the rocks.
Its kind of fun mapping the actors across from the current global warming farce ; Will Mosher make it through?? -Not with thoughts like “With regard to the ipcc. It takes zero stats to understand t h at co2 is a problem”.

John Finn
February 9, 2014 4:07 am

Nullius in Verba says:
February 9, 2014 at 3:01 am
No, that’s the misunderstanding that everyone keeps making about p-values. It doesn’t mean that the probability of the trend being greater than zero is 90%.

Ok- I understand the difference but I deduced the 90% from the error bars not the p-value.
However, it is still true to say that likelihood that the observed UAH data is from a population where the true trend is ZERO is the same as the likelihood that it is from a population where the true trend is 0.24 degrees per decade.

Kevin Kilty
February 9, 2014 8:39 am

The problem is that statistical tests, in three different versions, actually supply answers to three different questions:
Pearson-Neyman inference answers the question “What should I do?”
Bayesian methodology answers the question “What should I believe?”
Likelihood answers the question “How strong is this evidence?”
Any inference method based on p values is probably inappropriate for answering the sorts of questions that science and engineering generally ask. You all do know that Ph.D. scientists often have a pretty thin to nil background in probability and statistics do you not? At the university where I currently work we have a graduate level course aimed for Ph.D. candidates and Post-Docs. A few faculty even sign up. Even so, I think this course covers mainly the classical sort of inferential statistics using p-values, confidence intervals, and rejection regions.

Kevin Kilty
February 9, 2014 9:06 am

Consider this silly example. Suppose I want to know what the true slope of the global mean temperature anomaly is, and I decide to estimate it using the rather strange method of throwing two dice and adding the results up. I get a 5 and a 6 making 11. If the true temperature trend is zero, what is the probability of me getting a less extreme value than this? The answer is 33/36 = 92%. Not quite ‘significant’, but not far off. But what does that tell me about the probability of the true trend being zero?

I read examples of this sort all the time, and it appears that no one observes that the “method” in question here has no construct validity. The dice have less (very much less) validity in measuring temperature trends, than say, using a thermometer. Yes, it is a silly example. A person can point me toward all sorts of statistical measures that demonstrate significance of many things, but without a testable theory that explains the physical mechanism involved, I’m unlikely to be convinced.

Nullius in Verba
February 9, 2014 11:14 am

“However, it is still true to say that likelihood that the observed UAH data is from a population where the true trend is ZERO is the same as the likelihood that it is from a population where the true trend is 0.24 degrees per decade.”
Only if the data happens to be of the form of a linear trend plus ARMA(1,1) noise, which is to some degree begging the question. If you assume trend+ARMA(1,1) you get one answer, if you assume, say, no trend+ARIMA(3,1,0) as Doug Keenan did you get a completely different answer, which actually fits the data somewhat better, but for which the trend is zero by definition.
When you get out what you put in, that’s an indication that the data doesn’t contain a definitive answer, and the answer you appear to be getting is illusory; an artefact of the method you’re using. *If* there’s a non-zero trend, this gives you an estimate of it, but it can’t tell you if there’s a trend.
For that, you need an accurate, validated physical model of the background noise statistics, which we don’t have. So all this analysis is a waste of time.
“I read examples of this sort all the time, and it appears that no one observes that the “method” in question here has no construct validity. The dice have less (very much less) validity in measuring temperature trends, than say, using a thermometer.”
Of course. That was the point. All measurement methods provide some balance of signal and noise. I picked a method for my example that shifted the balance all the way to the end – so that it was *no* signal and *all* noise. And yet, you can still get a significant p-value from it.
The point is that a p-value doesn’t tell you if your measurement method is any good – it assumes it. So even a totally rubbish method can still seem to work, and people who only look at p-values mechanically, without.understanding, can be easily misled.

DAV
February 9, 2014 11:56 am

John Finn February 9, 2014 at 3:08 am: Is this intended for me?
Not particularly but I see some reading might be in order.
if the P-value doesn’t represent a probability what does it represent? A P-value of less than 0.05 suggests that the probability of obtaining a given test result by chance is less than 5%.
In frequentist terms you aren’t permitted to call it a probability if only because you aren’t allowed to assign probabilities to unobservables. It’s definition is rather wordy and not readily available online that I can see but goes something like: the probability of seeing the statistic (from which the p-value is derived) equal or larger than the one found given the parameter in question (means or slopes in a linear regression) are equal if the experiment (whatever) was repeated an infinite number of times.
Even so, why on Earth would you want a regression if the purpose is not prediction? Can’t you just look at the data with your eyes? Are temps higher on the left or right? Are they equal? Just look!
There aren’t “many things wrong with using p-values”. There are many ways that p-values can be misinterpreted.
Their interpretation is pointless. The p-value is answering a question regarding the quality of the model parameters and not at all answering the question of whether the model is useful or valid. The latter answer is what people generally want to know. The former: who cares?

Matthew R Marler
February 9, 2014 12:25 pm

DAV: Their interpretation is pointless. The p-value is answering a question regarding the quality of the model parameters and not at all answering the question of whether the model is useful or valid. The latter answer is what people generally want to know. The former: who cares?
This happens all the time in science, not just in statistics: the question that you can answer is different from the question that you want to answer.

DAV
February 9, 2014 12:35 pm

Matthew R Marler, This happens all the time in science, not just in statistics: the question that you can answer is different from the question that you want to answer.
Sad, isn’t it?