Statistical flaws in science: p-values and false positives

To make science better, watch out for statistical flaws

First of two parts

As Winston Churchill once said about democracy, it’s the worst form of government, except for all the others. Science is like that. As commonly practiced today, science is a terrible way for gathering knowledge about nature, especially in messy realms like medicine. But it would be very unwise to vote science out of office, because all the other methods are so much worse.

Still, science has room for improvement, as its many critics are constantly pointing out. Some of those critics are, of course, lunatics who simply prefer not to believe solid scientific evidence if they dislike its implications. But many critics of science have the goal of making the scientific enterprise better, stronger and more reliable. They are justified in pointing out that scientific methodology — in particular, statistical techniques for testing hypotheses — have more flaws than Facebook’s privacy policies. One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false.

A few months ago, though, some defenders of the scientific faith produced a new study claiming otherwise. Their survey of five major medical journals indicated a false discovery rate among published papers of only 14 percent. “Our analysis suggests that the medical literature remains a reliable record of scientific progress,” Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University wrote in the journal Biostatistics.

Their finding is based on an examination of P values, the probability of getting a positive result if there is no real effect (an assumption called the null hypothesis). By convention, if the results you get (or more extreme results) would occur less than 5 percent of the time by chance (P value less than .05), then your finding is “statistically significant.” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.

As Jager and Leek acknowledge, though, this method has well-documented flaws. “There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis,” they wrote.

For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies, and with thousands of scientific studies going on, that adds up to a lot. But it’s even worse. If there actually is no real effect in most experiments, you’ll reach a wrong conclusion far more than 5 percent of the time. Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes. More than 80 percent of your supposed results would be false.

read more here: https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws

0 0 votes

Article Rating

114 Comments

Col Mosby

February 7, 2014 6:30 pm

The biggest flaw in the methodology being used is that it presumes that rejection of the null hypothesis, even if true, has any appreciable significance, or is important. . Use a large enough sample size and one can often detect very small (an insignificant) effects. One is almost never interested in whether the null hypothesis is actually true, but whether the actual effect is significant, not merely whether it’s statistically significant at some p level. One should instead demonstrate that the effect is of a significant magnitude, using statistical tests to do so.

Nick Stokes

February 7, 2014 6:30 pm

“There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis”
Yes, that’s elementary. P<0.05 can reject the null hypothesis. But otherwise the test fails. You can’t deduce the null hypothesis is true. You just don’t know.
That’s why all this talk of “statistically significant warming” is misconceived. You can test whether a trend is significantly different from zero, and maybe deduce something if it is. But if it isn’t, your test failed to reject. No result.
“For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies…”
OK, use the number you thing is right. What if that means that nineteen times you discover a useful new drug, and one time you get a dud? Sounds good to me.
“Suppose you test 100 drugs for a given disease, when only one actually works. “
Sounds like you probably shouldn’t be doing that. But OK, if you know those are the odds and it’s still worthwhile, you should adjust the P-value accordingly. Which means, incidentally, that the target drug has to be very good to stand out from the noise.

markx

February 7, 2014 7:16 pm

Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes.
It is a bit hard to say anything at all about the significance of testing without knowing a trial design, number of replicates or the variance of the response, but note on the above basis you must also have some chance of missing the detection of the correct drug.

Leo Morgan

February 7, 2014 7:16 pm

The statistics above are even worse than we thought.
What is the probability of a false negative? I don’t know, but whatever it is it has a chance of occurring.
We might miss our one right answer. So the odds are more than four to one against a test correctly identifying that one true result.
Nick Stokes makes some good points above. But unless I’m misreading him, he’s assuming that we know ahead of time how many of our drugs are effective, so we can adjust the p-value of the trial. That assumption wasn’t specified in the lead article, and is relevant to a completely different statistical situation.

Steven Mosher

February 7, 2014 7:17 pm

just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.
You do your test. You report the uncertainty.
There will of course be cases where .05 is not good enough. Think particle physics.
There will of course be cases where one would make a decision on 75% certainty or less.
The notion that there is some special value that allows us to automatically accept or reject tests, is the culprit. you report 95%. from that one cannot deduce anything. You can observe that the result might be wrong 1 in 20 times. That tells you nothing about the truth of the matter.
Of course some will accept 95% and build on the foundation. Chances are this is the right pragmatic decision. Others will exercise skepticism and perhaps win a grand prize 1 out of 20 times. But the number 95 tells you nothing about which choice to make.. it doesnt direct you or order you to accept the science and it doesnt order you to reject it. The question is

Bob

February 7, 2014 7:20 pm

“For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies”
This is probably OK, but it reads like the p value spans multiple studies. I think that a given p value is established only one study at time. Plus, Mosby and Stokes are correct. You can manipulate the value of the p value with the size of sample.
Dr Briggs has lots to say on the subject.
http://wmbriggs.com/blog/?p=8295

Lance Wallace

February 7, 2014 7:26 pm

A very common practice is to test multiple potential effects, one or more of which have a p-value >0.05. You can no longer necessarily say you have rejected the null hypothesis, because the multiple tests made it easier for a null result to sneak below 0.05. The Bonferroni correction, IIRC, says to divide by the number of tests to find your significance target. Testing 5 separate effects requires that you reach p<0.01 to claim significance. Bonferroni has his detractors, but the general idea is clear.

Jon

February 7, 2014 7:30 pm

To test the success rate of science you only have to get hold of some scientific magazines — New Scientist or Scientific American, say — from twenty years ago, and judge for yourself how many of the world-shattering events and discoveries announced there have actually had any impact on everyday life. Five per cent would be a generous estimate. But as others have already pointed out, that’s five per cent better than any other method.

norah4you

February 7, 2014 7:42 pm

To be remembered: (this one from a link on line. Sometimes other words used but the link includes all there is to be remembered BEFORE one go on analysing one’s results)
The strength of the evidence for the alternative hypothesis is often summed up in a ‘P value’ (also called the significance level) – and this is the point where the explanation has to become technical. If an outcome O is said to have a P value of 0.05, for example, this means that O falls within the 5% of possible outcomes that represent the strongest evidence in favour of the alternative hypothesis rather than the null. If O has a P value of 0.01 then it falls within the 1% of possible cases giving the strongest evidence for the alternative. So the smaller the P value the stronger the evidence.
Of course an outcome may not have a small significance level (or P value) at all. Suppose the outcome is not significant at the 5% level. This is sometimes – and quite wrongly – interpreted to mean that there is strong evidence in favour of the null hypothesis. The proper interpretation is much more cautious: we simply don’t have strong enough evidence against the null. The alternative may still be correct, but we don’t have the data to justify that conclusion.Statistical significance, getstats.org.uk page
That said it’s also important to analyse the question oneself has put forward to falsify hypothesis in question. As Vollmer Gerhard wrote 1993, Wissenschaftstheorie in Einsatz, Stuttgart 1993 :
Die wichtigkeit oder Bedeutung eines Problems hängt immer auch von subjektiven, bewer tendens Elementen ab.
Quick English translation: The importance or significance of a problem always depends on subjective, evaluative elements.
In other words one have to remember that no one of us is without having Tendens in our backpack. This means that we have to be careful not to mix black, grey and white alternative nor to ask dependent questuions. Remember that in every analyse of a result that tries to be in accuracy of Theories of Science it’s better to use Chebyshev’s inequality, next in analyse.
While all this might give you more than a hint of a certain type of observation, the ‘fact’ observed in curves that two types of observation interact significantly with it’s other is a total different thing.
If A can be showed to lead up to B in X numbers of studies and at the same time some B lead up to C no nullhypothesis what so ever is enough to prove that A leads to C.
You better use Set of Theory and Number theory on your two variables/curves in order to be able to draw a more than probable conclusion.

MattS

February 7, 2014 7:49 pm

Please deposit your p-values in the appropriate receptacles in the restrooms. Thank You! 😉

HankH

February 7, 2014 7:54 pm

There’s misunderstanding of the whole concept of statistical significance. P0.05 we conclude there is no point in further inquiry. It is unfortunate that many see statistical significance as proof of the effect. It’s not. It merely lends evidence to the effect.
Publishing a study that claims to reject the null hypothesis at P>0.02 would be largely ignored in my field – it would attract sharp critique in peer review and probably be rejected unless the notion that there is any probability of an effect at all is of strong interest. For example, in new lines of cancer research where there may be a possibility that multiple drugs may interact in a way that the sum effect may be greater than its individual parts. A P value between 0.02 and 0.05 could suggest there’s a new drug to play with in further research.
Statistical significance serves another purpose. It also avoids 95% of the noise in published literature where an effect is claimed. It is a tacit agreement among researchers that theres a gatekeeper and filter that we use to avoid wasting everyone’s time talking about “my great discovery” when it’s not at all important.

HankH

February 7, 2014 7:58 pm

Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols. My second sentence was supposed to say:
“P [less than] 0.05 is really the point at which it is considered feasible to take an interest in the effect. It’s a threshold at which to do research work. If the effect is P [greater than] 0.05 we conclude there is no point in further inquiry.

Gary Hladik

February 7, 2014 8:29 pm

HankH, WordPress interpreted your symbols as a nonsense HTML tag, ignoring them and everything in between.

HankH

February 7, 2014 8:49 pm

Gary, thanks. The next time I’ll know better than to talk statistics without checking the layout of my symbols. 😉

Mooloo

February 7, 2014 8:57 pm

The study is a waste of time. It assumes that the largest problem is false positives. It isn’t.
P-values only measure statistical effects. They say nothing about the trial itself.
P-values don’t detect:
— biased experiments
— experimental errors in set-up, measurement or collation
— “correct” values that are, unfortunately, based on incorrect theoretical underpinnings (there were a host of experiments to test the aether that were sadly never going to be correct just because the result happened to past some silly significance test)
— cherry picked or “corrected” measurements because the experimenter knew what the result should be
— maths errors in calculating the p-values
— maths errors in any other part of the experiment
— repeated runs of similar experiments until a “result” occurs (yes, the p-values should be corrected to allow for this, but there is no way the authors of our studies bothered to check if this is being done)
— incorrect conclusions from the result (that the p-value shows significance doesn’t mean that the experiment says what people think it says)
— and, not least, outright fraud.
It says something about the blindness of modern science that this stupid paper passed peer review.
Just because it thinks it refutes Ionannidis doesn’t mean that that scientific papers are mostly correct — a perfect example of my point above, that just because you get a statistical result doesn’t mean that it gets interpreted correctly. They can be wrong in so many ways, that they apparently didn’t even consider.

anna v

February 7, 2014 8:58 pm

Most of you must have lived through the discovery of the Higgs ( particle physics, CERN, LHC). Maybe also saw the dithering for a year of whether it was the Higgs or not. This last was due to the fact that the statistical significance was lower than 5 standard deviations, assuming a normal distribution , chance of error one over almost two million. The five percent quoted above corresponds to 2 standard deviations. In particle physics the road to hell is paved with lower sigma ” resonances/particles.”
One cannot expect a study to be done on a million people, that happens, as with thalidomide, when the medicine is released, and then they go “oops”. But a hundred is too low a number to avoid random correlations. Of course they have their control samples and that makes a difference , as does the time taken during the observations, the statistics increase and the correlations are checked. Still it is ironic that people’s lives are less precious than discovering the Higgs..

Bob

February 7, 2014 9:05 pm

HankH says:
February 7, 2014 at 7:58 pm
“Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols”
Hank: It was probably not WordPress that interpreted your arrows to be html tags. The browser is the software charged with that duty, and pretty much all browsers will ignore non-existant html tags without an error statement. So, the left arrow “” with something in between is interpreted as a tag. I don’t know of any way around it. It seems there should be an escape character that could be used.

Phil

February 7, 2014 9:16 pm

The problem is even worse. You choose a confidence level and then see if your results are significant at that level. The IPCC in AR4 used a confidence level of 90%. That means that you have a 10% chance of a false positive. However, if you have two papers/conclusions that use a confidence level of 90% and the results of one depend on the other, then the chance of a false positive becomes 1-(0.9 x 0.9) or 19%. If you have three levels, then your chance of a false positive using a confidence level of 90% becomes 27%. For 4 levels, it is 33% and for five levels it is 41%. After 6 levels, your chance of a false positive becomes better than a coin toss.
There are probably at least six levels of conclusions/results in the IPCC reports (I haven’t counted them), so the IPCC reports as a whole probably have a 50% or greater chance of a false positive. If the confidence level used were 95%, then for six levels you would still have a 30% chance of a false positive. Only if you use a confidence level of 99%, would you have a less than 10% chance of a false positive for six levels (actually 7%).
However, if climate science were to use a confidence level of 99% to test for significance, I wouldn’t be surprised if most of the fields’ results were deemed to be insignificant. The IPCC rolled back the significance level to 90% from 95% between the TAR and AR4, IIRC.

HankH

February 7, 2014 9:24 pm

Thanks Bob! I’m learning more about HTML tags every day. Using the arrows a lot in publication I forget you can’t get away with it so much in blogs.

rogerknights

February 7, 2014 9:48 pm

Bob says:
February 7, 2014 at 9:05 pm

HankH says:
February 7, 2014 at 7:58 pm
“Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols”

Hank: It was probably not WordPress that interpreted your arrows to be html tags. The browser is the software charged with that duty, and pretty much all browsers will ignore non-existant html tags without an error statement. So, the left arrow “” with something in between is interpreted as a tag. I don’t know of any way around it. It seems there should be an escape character that could be used.

I believe that using the following will get you printable less-than and greater-than arrows without confusing WordPress (delete the space after the ampersand in actual use)
& lt; & gt;
Testing: < >

Leonard Lane

February 7, 2014 9:51 pm

Depending upon the underlying probability distribution, sample size, etc. selecting a smaller p value does not give stronger results. There are Type I errors and Type II errors. The smaller p becomes the greater the probability of a Type II error. If alpha and beta and Type I and Type 2 Errors do not mean anything to you, then you should consult a competent statistician before you even start the experiment and get help in understanding the error properties of the statistical tests before you set the significance level (p value) and the sample size.

Mac the Knife

February 7, 2014 10:01 pm

With apologies to The Bard:
To p(), or not to p(), that is the question—
Whether ’tis Nobler in the analytical mind to suffer
The Slings and Arrows of outrageous Probability,
Or to take Statistics against a Sea of Data,
And by opp(o)sing nullify them?

Chris4692

February 7, 2014 10:38 pm

What Mr. Lane said @9:51.

Hoser

February 7, 2014 10:51 pm

Testing the handy-dandy pre tag

	open IN, "< $hCVfile" or die "Cannot open information file $hCVfile\n";
	if($RS)
	{
	    while($line = )
	    {
		chomp $line;
		@SNPdata = split("\t",$line);
		$SNP_ID = uc($SNPdata[1]);
		next if($SNP_ID !~ m/RS/);
		$lastpos = $pos;
		$pos = tell(IN);
		$key = uc($SNPdata[1]);
		$key =~ s/^(RS.{4}).*$/$1/;
		if(length($key) > 6)
		{
		    print LOG "Warning: RS DB key $key, length greater than 6.\n";
		    next;
		}
		next if(exists $hCV{$key});
		$hCV{$key} = $lastpos;
	    }
	    close IN;
	}

Mindert Eiting

February 7, 2014 10:54 pm

In performing the most simple statistical test, you always test at least two assumptions, that the null hypothesis is true and that the sample is taken randomly. A violation of the latter assumption, usually taken for granted, may explain many false positives. Another problem with the Fisher type of test is that the null hypothesis concerns a point value, almost always apriori false. In a sample, sufficiently large, the apriori false null hypothesis will be rejected at any significance level you like.