Statistical flaws in science: p-values and false positives

To make science better, watch out for statistical flaws

by Tom Siegfried

First of two parts

As Winston Churchill once said about democracy, it’s the worst form of government, except for all the others. Science is like that. As commonly practiced today, science is a terrible way for gathering knowledge about nature, especially in messy realms like medicine. But it would be very unwise to vote science out of office, because all the other methods are so much worse.

Still, science has room for improvement, as its many critics are constantly pointing out. Some of those critics are, of course, lunatics who simply prefer not to believe solid scientific evidence if they dislike its implications. But many critics of science have the goal of making the scientific enterprise better, stronger and more reliable. They are justified in pointing out that scientific methodology — in particular, statistical techniques for testing hypotheses — have more flaws than Facebook’s privacy policies. One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false. 

A few months ago, though, some defenders of the scientific faith produced a new study claiming otherwise. Their survey of five major medical journals indicated a false discovery rate among published papers of only 14 percent. “Our analysis suggests that the medical literature remains a reliable record of scientific progress,” Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University wrote in the journal Biostatistics.

Their finding is based on an examination of P values, the probability of getting a positive result if there is no real effect (an assumption called the null hypothesis). By convention, if the results you get (or more extreme results) would occur less than 5 percent of the time by chance (P value less than .05), then your finding is “statistically significant.” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.

As Jager and Leek acknowledge, though, this method has well-documented flaws. “There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis,” they wrote.

For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies, and with thousands of scientific studies going on, that adds up to a lot. But it’s even worse. If there actually is no real effect in most experiments, you’ll reach a wrong conclusion far more than 5 percent of the time. Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes. More than 80 percent of your supposed results would be false.

read more here:  https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws

Advertisements

  Subscribe  
newest oldest most voted
Notify of

The biggest flaw in the methodology being used is that it presumes that rejection of the null hypothesis, even if true, has any appreciable significance, or is important. . Use a large enough sample size and one can often detect very small (an insignificant) effects. One is almost never interested in whether the null hypothesis is actually true, but whether the actual effect is significant, not merely whether it’s statistically significant at some p level. One should instead demonstrate that the effect is of a significant magnitude, using statistical tests to do so.

“There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis”
Yes, that’s elementary. P<0.05 can reject the null hypothesis. But otherwise the test fails. You can’t deduce the null hypothesis is true. You just don’t know.
That’s why all this talk of “statistically significant warming” is misconceived. You can test whether a trend is significantly different from zero, and maybe deduce something if it is. But if it isn’t, your test failed to reject. No result.
“For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies…”
OK, use the number you thing is right. What if that means that nineteen times you discover a useful new drug, and one time you get a dud? Sounds good to me.
“Suppose you test 100 drugs for a given disease, when only one actually works. “
Sounds like you probably shouldn’t be doing that. But OK, if you know those are the odds and it’s still worthwhile, you should adjust the P-value accordingly. Which means, incidentally, that the target drug has to be very good to stand out from the noise.

markx

Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes.
It is a bit hard to say anything at all about the significance of testing without knowing a trial design, number of replicates or the variance of the response, but note on the above basis you must also have some chance of missing the detection of the correct drug.

Leo Morgan

The statistics above are even worse than we thought.
What is the probability of a false negative? I don’t know, but whatever it is it has a chance of occurring.
We might miss our one right answer. So the odds are more than four to one against a test correctly identifying that one true result.
Nick Stokes makes some good points above. But unless I’m misreading him, he’s assuming that we know ahead of time how many of our drugs are effective, so we can adjust the p-value of the trial. That assumption wasn’t specified in the lead article, and is relevant to a completely different statistical situation.

just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.
You do your test. You report the uncertainty.
There will of course be cases where .05 is not good enough. Think particle physics.
There will of course be cases where one would make a decision on 75% certainty or less.
The notion that there is some special value that allows us to automatically accept or reject tests, is the culprit. you report 95%. from that one cannot deduce anything. You can observe that the result might be wrong 1 in 20 times. That tells you nothing about the truth of the matter.
Of course some will accept 95% and build on the foundation. Chances are this is the right pragmatic decision. Others will exercise skepticism and perhaps win a grand prize 1 out of 20 times. But the number 95 tells you nothing about which choice to make.. it doesnt direct you or order you to accept the science and it doesnt order you to reject it. The question is

For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies
This is probably OK, but it reads like the p value spans multiple studies. I think that a given p value is established only one study at time. Plus, Mosby and Stokes are correct. You can manipulate the value of the p value with the size of sample.
Dr Briggs has lots to say on the subject.
http://wmbriggs.com/blog/?p=8295

Lance Wallace

A very common practice is to test multiple potential effects, one or more of which have a p-value >0.05. You can no longer necessarily say you have rejected the null hypothesis, because the multiple tests made it easier for a null result to sneak below 0.05. The Bonferroni correction, IIRC, says to divide by the number of tests to find your significance target. Testing 5 separate effects requires that you reach p<0.01 to claim significance. Bonferroni has his detractors, but the general idea is clear.

Jon

To test the success rate of science you only have to get hold of some scientific magazines — New Scientist or Scientific American, say — from twenty years ago, and judge for yourself how many of the world-shattering events and discoveries announced there have actually had any impact on everyday life. Five per cent would be a generous estimate. But as others have already pointed out, that’s five per cent better than any other method.

norah4you

To be remembered: (this one from a link on line. Sometimes other words used but the link includes all there is to be remembered BEFORE one go on analysing one’s results)
The strength of the evidence for the alternative hypothesis is often summed up in a ‘P value’ (also called the significance level) – and this is the point where the explanation has to become technical. If an outcome O is said to have a P value of 0.05, for example, this means that O falls within the 5% of possible outcomes that represent the strongest evidence in favour of the alternative hypothesis rather than the null. If O has a P value of 0.01 then it falls within the 1% of possible cases giving the strongest evidence for the alternative. So the smaller the P value the stronger the evidence.
Of course an outcome may not have a small significance level (or P value) at all. Suppose the outcome is not significant at the 5% level. This is sometimes – and quite wrongly – interpreted to mean that there is strong evidence in favour of the null hypothesis. The proper interpretation is much more cautious: we simply don’t have strong enough evidence against the null. The alternative may still be correct, but we don’t have the data to justify that conclusion.
Statistical significance, getstats.org.uk page
That said it’s also important to analyse the question oneself has put forward to falsify hypothesis in question. As Vollmer Gerhard wrote 1993, Wissenschaftstheorie in Einsatz, Stuttgart 1993 :
Die wichtigkeit oder Bedeutung eines Problems hängt immer auch von subjektiven, bewer tendens Elementen ab.
Quick English translation: The importance or significance of a problem always depends on subjective, evaluative elements.
In other words one have to remember that no one of us is without having Tendens in our backpack. This means that we have to be careful not to mix black, grey and white alternative nor to ask dependent questuions. Remember that in every analyse of a result that tries to be in accuracy of Theories of Science it’s better to use Chebyshev’s inequality, next in analyse.
While all this might give you more than a hint of a certain type of observation, the ‘fact’ observed in curves that two types of observation interact significantly with it’s other is a total different thing.
If A can be showed to lead up to B in X numbers of studies and at the same time some B lead up to C no nullhypothesis what so ever is enough to prove that A leads to C.
You better use Set of Theory and Number theory on your two variables/curves in order to be able to draw a more than probable conclusion.

MattS

Please deposit your p-values in the appropriate receptacles in the restrooms. Thank You! 😉

HankH

There’s misunderstanding of the whole concept of statistical significance. P0.05 we conclude there is no point in further inquiry. It is unfortunate that many see statistical significance as proof of the effect. It’s not. It merely lends evidence to the effect.
Publishing a study that claims to reject the null hypothesis at P>0.02 would be largely ignored in my field – it would attract sharp critique in peer review and probably be rejected unless the notion that there is any probability of an effect at all is of strong interest. For example, in new lines of cancer research where there may be a possibility that multiple drugs may interact in a way that the sum effect may be greater than its individual parts. A P value between 0.02 and 0.05 could suggest there’s a new drug to play with in further research.
Statistical significance serves another purpose. It also avoids 95% of the noise in published literature where an effect is claimed. It is a tacit agreement among researchers that theres a gatekeeper and filter that we use to avoid wasting everyone’s time talking about “my great discovery” when it’s not at all important.

HankH

Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols. My second sentence was supposed to say:
“P [less than] 0.05 is really the point at which it is considered feasible to take an interest in the effect. It’s a threshold at which to do research work. If the effect is P [greater than] 0.05 we conclude there is no point in further inquiry.

Gary Hladik

HankH, WordPress interpreted your symbols as a nonsense HTML tag, ignoring them and everything in between.

HankH

Gary, thanks. The next time I’ll know better than to talk statistics without checking the layout of my symbols. 😉

Mooloo

The study is a waste of time. It assumes that the largest problem is false positives. It isn’t.
P-values only measure statistical effects. They say nothing about the trial itself.
P-values don’t detect:
— biased experiments
— experimental errors in set-up, measurement or collation
— “correct” values that are, unfortunately, based on incorrect theoretical underpinnings (there were a host of experiments to test the aether that were sadly never going to be correct just because the result happened to past some silly significance test)
— cherry picked or “corrected” measurements because the experimenter knew what the result should be
— maths errors in calculating the p-values
— maths errors in any other part of the experiment
— repeated runs of similar experiments until a “result” occurs (yes, the p-values should be corrected to allow for this, but there is no way the authors of our studies bothered to check if this is being done)
— incorrect conclusions from the result (that the p-value shows significance doesn’t mean that the experiment says what people think it says)
— and, not least, outright fraud.
It says something about the blindness of modern science that this stupid paper passed peer review.
Just because it thinks it refutes Ionannidis doesn’t mean that that scientific papers are mostly correct — a perfect example of my point above, that just because you get a statistical result doesn’t mean that it gets interpreted correctly. They can be wrong in so many ways, that they apparently didn’t even consider.

anna v

Most of you must have lived through the discovery of the Higgs ( particle physics, CERN, LHC). Maybe also saw the dithering for a year of whether it was the Higgs or not. This last was due to the fact that the statistical significance was lower than 5 standard deviations, assuming a normal distribution , chance of error one over almost two million. The five percent quoted above corresponds to 2 standard deviations. In particle physics the road to hell is paved with lower sigma ” resonances/particles.”
One cannot expect a study to be done on a million people, that happens, as with thalidomide, when the medicine is released, and then they go “oops”. But a hundred is too low a number to avoid random correlations. Of course they have their control samples and that makes a difference , as does the time taken during the observations, the statistics increase and the correlations are checked. Still it is ironic that people’s lives are less precious than discovering the Higgs..

HankH says:
February 7, 2014 at 7:58 pm
“Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols”
Hank: It was probably not WordPress that interpreted your arrows to be html tags. The browser is the software charged with that duty, and pretty much all browsers will ignore non-existant html tags without an error statement. So, the left arrow “” with something in between is interpreted as a tag. I don’t know of any way around it. It seems there should be an escape character that could be used.

Phil

The problem is even worse. You choose a confidence level and then see if your results are significant at that level. The IPCC in AR4 used a confidence level of 90%. That means that you have a 10% chance of a false positive. However, if you have two papers/conclusions that use a confidence level of 90% and the results of one depend on the other, then the chance of a false positive becomes 1-(0.9 x 0.9) or 19%. If you have three levels, then your chance of a false positive using a confidence level of 90% becomes 27%. For 4 levels, it is 33% and for five levels it is 41%. After 6 levels, your chance of a false positive becomes better than a coin toss.
There are probably at least six levels of conclusions/results in the IPCC reports (I haven’t counted them), so the IPCC reports as a whole probably have a 50% or greater chance of a false positive. If the confidence level used were 95%, then for six levels you would still have a 30% chance of a false positive. Only if you use a confidence level of 99%, would you have a less than 10% chance of a false positive for six levels (actually 7%).
However, if climate science were to use a confidence level of 99% to test for significance, I wouldn’t be surprised if most of the fields’ results were deemed to be insignificant. The IPCC rolled back the significance level to 90% from 95% between the TAR and AR4, IIRC.

HankH

Thanks Bob! I’m learning more about HTML tags every day. Using the arrows a lot in publication I forget you can’t get away with it so much in blogs.

rogerknights

Bob says:
February 7, 2014 at 9:05 pm

HankH says:
February 7, 2014 at 7:58 pm
“Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols”

Hank: It was probably not WordPress that interpreted your arrows to be html tags. The browser is the software charged with that duty, and pretty much all browsers will ignore non-existant html tags without an error statement. So, the left arrow “” with something in between is interpreted as a tag. I don’t know of any way around it. It seems there should be an escape character that could be used.

I believe that using the following will get you printable less-than and greater-than arrows without confusing WordPress (delete the space after the ampersand in actual use)
& lt; & gt;
Testing: < >

Leonard Lane

Depending upon the underlying probability distribution, sample size, etc. selecting a smaller p value does not give stronger results. There are Type I errors and Type II errors. The smaller p becomes the greater the probability of a Type II error. If alpha and beta and Type I and Type 2 Errors do not mean anything to you, then you should consult a competent statistician before you even start the experiment and get help in understanding the error properties of the statistical tests before you set the significance level (p value) and the sample size.

Mac the Knife

With apologies to The Bard:
To p(), or not to p(), that is the question—
Whether ’tis Nobler in the analytical mind to suffer
The Slings and Arrows of outrageous Probability,
Or to take Statistics against a Sea of Data,
And by opp(o)sing nullify them?

Chris4692

What Mr. Lane said @9:51.

Hoser

Testing the handy-dandy pre tag

	open IN, "< $hCVfile" or die "Cannot open information file $hCVfile\n";
	if($RS)
	{
	    while($line = )
	    {
		chomp $line;
		@SNPdata = split("\t",$line);
		$SNP_ID = uc($SNPdata[1]);
		next if($SNP_ID !~ m/RS/);
		$lastpos = $pos;
		$pos = tell(IN);
		$key = uc($SNPdata[1]);
		$key =~ s/^(RS.{4}).*$/$1/;
		if(length($key) > 6)
		{
		    print LOG "Warning: RS DB key $key, length greater than 6.\n";
		    next;
		}
		next if(exists $hCV{$key});
		$hCV{$key} = $lastpos;
	    }
	    close IN;
	}
Mindert Eiting

In performing the most simple statistical test, you always test at least two assumptions, that the null hypothesis is true and that the sample is taken randomly. A violation of the latter assumption, usually taken for granted, may explain many false positives. Another problem with the Fisher type of test is that the null hypothesis concerns a point value, almost always apriori false. In a sample, sufficiently large, the apriori false null hypothesis will be rejected at any significance level you like.

Hoser

Well so much for that. The <pre> tag failed to preserve the input file handle <IN> in the 4th line above. Testing outside WP, the same result is obtained in IE, FF, and GC. PRE won’t save anything looking like a tag even across multiple lines.

Geoff Sherrington

Steven Mosher says: February 7, 2014 at 7:17 pm
“just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.”
Sort of agree. Wish to stress that this type of uncertainty expression is about the spread of values about a mean of some sort, precision if you like.
More concerned about whether the mean is in the right place, than with the scatter about it being 95% enclosed by a certain curve. Bias.
I’m thinking that when you compare a number of temperature data sets with adjustments and there is an envelope around the various adjustments of say +/- 1 deg C, it is rather immaterial to concentrate on precision because it just adds on top of the bias which can often be the larger.
Like this graph from Alice Springs in the centre of Australia – I guess I should update it now we have BEST and CRUTEM4 and even Google.
http://www.geoffstuff.com/Spaghetti_Alice_JPG.jpg
Of course, these concepts are as old as time, but it’s remarkable how, in climate work, the bias aspect is so seldom considered properly, if at all. One gets the impression that ye olde thermometers could be read to 0.05 degrees, when 1 degree was more like it.
Or that Argo floats are accurate to 0.004 deg C as I recall. Utter BS.
But then, you’d broadly agree with me, I suspect.

richardscourtney

Nick Stokes:
At February 7, 2014 at 6:30 pm

“There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis”

Yes, that’s elementary. P<0.05 can reject the null hypothesis. But otherwise the test fails. You can’t deduce the null hypothesis is true. You just don’t know.
That’s why all this talk of “statistically significant warming” is misconceived. You can test whether a trend is significantly different from zero, and maybe deduce something if it is. But if it isn’t, your test failed to reject. No result.

NO!
That is warmist sophistry which pretends the ‘pause’ is not happening.

A linear ‘trend’ can be computed from any data time series. At issue here is whether the trend in global atmospheric temperature anomaly (GASTA) differs from zero (i.e. no discernible global warming or cooling) and – if so – for how long before the present.
Climastrology uses linear trends and 95% confidence. There are good reasons to dispute each of these conventions, but they are the conventions used by climastrology so they are the appropriate conventions in this case.
So, in this case the null hypothesis is that a linear trend in GASTA does not differ from zero at 95% confidence and, therefore, there is no discernible warming. And the period to be determined of no discernible global warming or cooling is up to the present. Therefore, the end point is now and the data is assessed back in time until a linear trend over the period differs from zero at 95% confidence.
Each of the several time series of GASTA indicates no trend which differs from zero (i.e. no global warming or cooling) for at least 17 years until now; RSS indicates 24.5 years.
And it is not reasonable to remove data from the data set(s). 1998 had a high value and there is no possibility of justifying its removal from the data set whatever the cause of it being a high value. This is because the assessment is of how long there has been no discernible warming or cooling, and any distortion of the analysed data provides a distortion of the result of the analysis.
Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.
Richard

Toto

Nobody has said the magic word (‘model’). You know that p-value thing? You know how it is calculated? Using a model, which in some cases is an only an assumption.

The father of modern statistics was an ardent Eugenics catastrophist – he developed the field of statistics to find mathematical support for his passion.
http://en.wikipedia.org/wiki/Ronald_Fisher
I am not suggesting that statistics are useless because its origins are tainted, what I am suggesting is, if someone with the genius to invent an entire mathematical discipline can be fooled by his own invention, then anyone can get it wrong.

craig

Hence, one of the principle reasons when I’m detailing Drs, that the Dr is made aware of the statistical or non-statistical significance of a value and the more important and the most relevant part of the discussion, is the value change from placebo or the active ingredient arm, CLINICALLY MEANINGFUL? Clinical meaningfulness of a number is a more practical way to understand a drug effect on a subject.

Mindert Eiting

Eric Worrall: ‘The father of modern statistics was an ardent Eugenics catastrophist’.
Consider the historical context. The inconvenient truth is that many of his contemporaries were eugeneticists. As embarrassing is the fact that many people in these days were anti-semites. Perhaps the most difficult thing, even for geniuses, is to check assumptions and to think about consequences.

JDN

Some points I’ve been thinking about:
1) The people rubbishing medical research don’t release their data. There’s no way to confirm what you’ve been hearing. It’s basically hearsay.
2) Medical journals are the worst for publishing & executing methods. The statistical tests are the least of the problems. Why does this keep coming up? We learn to dismiss articles based on faulty methods or experimental construction, no matter what the statistical significance.
3) People keep trying to examine the outcome of medical research based on whether drugs that work in the lab work in the clinic. This doesn’t measure the search for knowledge, it measures the search for financial success. There’s something to be said for knowledge so reliable you can take it to the bank. However, clinical trials can fail for reasons that have nothing to do with the reliability of scientific knowledge. These exercises looking at the monetization of science are a waste of time. Everything is worthless until it’s not. If you perform an evaluation of the evaluators as in Science mag this week (paywalled unfortunately, http://www.sciencemag.org/content/343/6171/596), you’ll find out that these evaluations are not worth much.

Greg Goodman

” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.”
The p-value is not what decides whether a paper gets published. A negative result is still a valid scientific result. But this raises the other problem of the ‘literature’: publication bias. Only publishing positive results falsifies the record as well.
A recent case was Tamilflu , a supposed protection against certain strains of influenza that Tony Blair , them prime mininister invested something like 4 billion pounds sterling in a stock of treatments in case an epidemic of bird flu struck the UK.
It has recently been found that about half the studies on the drug were negative but they did not get published.
Good article BTW, thanks.

Well. I think it is not correct to limit the science methodology to statistics. As others have pointed out in this discussion statistics is useful to cut off the noise. However no statistics can replace a mechanistic explanation. To make an example familiar to my field of work: genetic association studies (those exploring the association of common genetic variants in the population with some clinical phenotype) often get p values with less than 10-5 / 10-8 ( sorry I am not familiar with tags too). However only when the finding is biologically explained (with functional experiments) one can claim that the discovery is a scientific advancement. Otherwise you can clam the association but not that the association has any biological meaning.
I think this should also happen in climate science. Perhaps funding agencies should invest better their money to sponsor studies aimed ad understanding the physics underlying the observed phenomena rather than thousand of useless studies finding statistical associations and than building their theories based on what they find statistical significant. Science based on statistic actually reverses what science methodology should be and it is really the prototype of a fishing expedition (you get something but you don’t know why).

richardscourtney says: February 7, 2014 at 11:15 pm
“So, in this case the null hypothesis is that a linear trend in GASTA does not differ from zero at 95% confidence and, therefore, there is no discernible warming.”
No, that’s nonsense, and putting in bold doesn’t improve it. Where I am, it was 41°C today. Was that statistically significant? Well, maybe not; it’s late summer. But it was still discernibly warm.
You have tested whether the observed trend could have happened with an underlying zero trend and natural variation. And the answer is that that can’t be rejected. But lt is not the only possible explanation.
The UAH index shows a trend of 1.212°C/century since Jan 1996. That’s not quite significant re 0, so we can’t rule out an underlying zero trend. But we also can’t rule out the upper limit of 2.44°C/century (or anything in between). In fact 2.44 is as likely as 0. Now that would be highly discernible warming. In fact, the observed 1.212°C/cen is considerable.
What we know is that the measured trend was 1.212°C/cen. That’s what actually happened, and is discernible. The rest is theorising about what might have happened if we could run it again.

Greg Goodman

Carlo Napolitano, I agree. Statistical testing a good safeguard but is not the be all and end all of scientific analysis. Too much of what is used in climate science has been inherited from econometrics rather then the physical sciences. And often rather poorly understood at that.

Ed Zuiderwijk

There is another way of judging the veracity of published results: look after, say, ten years and see how many times a result is referenced. If it isn’t, either the result wasn’t important or nobody believes it anymore, or it is superceded by later results.
They did such a sobering analysis some decades ago with astronomy/astrophysics papers and found that only a few percent survives the ravages of time.

John Brignell put the problems of statistical significance into layman language several years ago. His website, [url]www.numberwatch,co.uk[/url], is worth spending an hour on, and his book “Sorry Wrong Number” is excellent. One of his constant points is that in epidemiology a relative risk of at least 2 (a doubling of the effect) should be seen before the result is taken as important due to the number of conflicting factors in any open system (doesn’t it sound like AGW?).
Here are a few relevant pages from the website:
[url]http://www.numberwatch.co.uk/statistical_bludgeon.htm[/url]
[url]http://www.numberwatch.co.uk/Stuff.htm[/url]
[url]http://www.numberwatch.co.uk/an_exercise_in_critical_reading_.htm[/url]
He also has several essays on the ridiculousness of AGW and a review of Watermelons :).

John Shade

Hypothesis testing gets a hard time every now and then. By those who think the p-value is the probability that the alternative hypothesis is wrong, or who think that such testing provides proof or disproof of some kind. It does neither. It is merely a means of assessing the strength of evidence in a particular data set, considered in isolation. In general, the null hypothesis is for ‘no effect’, e.g. that the hair on the left side of your head has an equal mean diameter as that on the right. We know that is not true. Generally we know the null hypothesis is not true. We are not trying to prove or disprove it. All we are doing is going through a ritual whereby we say, if the null were true (and other conditions deemed applicable hold) what is the probability of getting some statistic as or more extreme than the one computed for this particular set of data? That’s it. A small p does not mean the null is false, a large p does not mean that it is true. The test is making a far more modest contribution than that.

basicstats

It’s surprising the Ioannidis paper has created such a stir, since it basically just says that a small probability of error in an individual experiment/study when compounded over thousands of different experiments/studies results in a much larger probability of error. Or the probability of heads in 1000 spins of a coin greatly exceeds the probability of heads on 1 spin. Pretty obvious, although pinning down the precise probability of error/false positives over thousands of very different kinds of studies is definitely a hard problem.
The relevance to climatology lies in the proliferation of different measures global warmers are coming up with – sea levels, ice volumes, ocean heat content etc etc. Keep data mining and you will find something still going up steadily! Especially as at least some of these are probably correlated to global average temperature anomaly with a time lag. Not to forget there are half a dozen such global anomalies to begin with.

richardscourtney

Nick Stokes:
I am replying to your post at February 8, 2014 at 1:01 am which is here and is in reply to my post at February 7, 2014 at 11:15 pm which is here.
In my post I rightly said of your assertion

That’s why all this talk of “statistically significant warming” is misconceived. You can test whether a trend is significantly different from zero, and maybe deduce something if it is. But if it isn’t, your test failed to reject. No result.

NO!
That is warmist sophistry which pretends the ‘pause’ is not happening.

I explained

Climastrology uses linear trends and 95% confidence. There are good reasons to dispute each of these conventions, but they are the conventions used by climastrology so they are the appropriate conventions in this case.

Those conventions were used by climastrology to claim there was global warming. What matters is to use THOSE SAME conventions when assessing the ‘pause’. And it is sophistry to say that different conventions should be used when the result does not fit an agenda.
I stated that “There are good reasons to dispute each of these conventions” but, so what? The only pertinent fact is that those are the conventions used by climastrology. It is ‘moving the goal posts’ to now say those conventions should not be used because they are wrong.
Your reply which I am answering says

You have tested whether the observed trend could have happened with an underlying zero trend and natural variation. And the answer is that that can’t be rejected. But lt is not the only possible explanation.

That is more sophistry!
Whatever the cause of the ‘pause’ is not pertinent to a determination of the existence of the pause.

The same conventions of climastrology used to determine that there was global warming were used to determine the start of the ‘pause’. And the conclusion of that analysis is as I said

Each of the several time series of GASTA indicates no trend which differs from zero (i.e. no global warming or cooling) for at least 17 years until now; RSS indicates 24.5 years.

and

Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.

The conventions adopted by climastrology may be mistaken (I think they are) but it is not “science” to choose when and when not to use conventions depending on the desired result.
Richard

From the article linked in Tom Siegfried’s essay-
“Others proposed similar methods but with different interpretations for the P value. Fisher said a low P value merely means that you should reject the null hypothesis; it does not actually tell you how likely the null hypothesis is to be correct. Others interpreted the P value as the likelihood of a false positive: concluding an effect is real when it actually isn’t. ”
Seems like Tom Siegfried and many other commentors on this thread, such as Nick Stokes and Steven Mosher, have made the same misinterpretation of what Fisher’s p value actually-Just as is alluded to in that article.
Alpha values are what determine Type 1 errors, or False Positives, per Neyman–Pearson. Fisher p values are about acceptance of the null hypothesis, not about Type 1 and 2 errors, as Tom Siegfried suggests.
What Leonard Lane says at February 7, 2014 at 9:51 pm is spot on, if he means consult a statistician using Bayesian methods.
I am interested in seeing if Tom Siegfried figures out what a p value actually is before he writes part 2 of his essay.

Mindert Eiting
Eric Worrall: ‘The father of modern statistics was an ardent Eugenics catastrophist’.
Consider the historical context. The inconvenient truth is that many of his contemporaries were eugeneticists. As embarrassing is the fact that many people in these days were anti-semites. Perhaps the most difficult thing, even for geniuses, is to check assumptions and to think about consequences.

A historical example of GIGO – the statistical techniques were well applied, but the data and assumptions were rubbish.
Fast forward to the present day, and the climate “geniuses” can’t even get the statistics right.

richardscourtney, everyone should try the SkS trend calculator, but instead of using the latest figures, feed in 30 year time periods before and after the 1940 – 1970 cooling.
Whatever Foster and Rahmstorf’s method is calculating, it is not a reliable guide as to whether the world is experiencing a downturn in global temperatures.

richardscourtney says: February 8, 2014 at 2:33 am
“Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.”
Well, that makes absolutely no sense, despite the bold face. Yes, trend from 1980 to 1997 was significantly different from zero. So was the trend from Jan 1995 to Dec 2012. Does that mean discernible global warming stopped a year ago?

David L

A p-value only gives confidence in rejecting the null hypothesis, it is not proof of an effect. You can propose an alternative hypothesis and test for that as well.
In clinical studies a p-value of 0.01 is typically used but more important studies have to be properly powered beforehand , and the results have to either agree or disagree with the baseline measurements within their prior agreed upon confidence intervals.
If AGW research followed the rules required of Pharmaceutical research, the entire dogma would have been rejected by the FDA years ago.

Statistician William M. Briggs wrote:

The problem with statistics is the astonishing amount of magical thinking tolerated. A statistician—or his apprentice; this means you—waving a formula over a dataset is little different than an alchemist trying his luck with a philosopher’s stone and a pile of lead. That gold sometimes emerges says more about your efforts than it does about the mystical incantations wielded.
Statistics, which is to say probability, is supposed to be about uncertainty. You would think, then, that the goal of the procedures developed would be to quantify uncertainty to the best extent possible in the matters of interest to most people. You would be wrong. Instead, statistics answers questions nobody asked. Why? Because of mathematical slickness and convenience, mostly.
The result is a plague, an epidemic, a riot of over-certainty. This means you, too. Even if you’re using the newest of the new algorithms, even if you have “big” data, even if you call your statisticians “data scientists”, and even if you are pure of heart and really, really care.

More at link in a very good essay: http://wmbriggs.com/blog/?p=11305
By the way, Briggs has written extensively about the problem of people misusing statistics. His blog site is a treasure trove of wonderful essays on the issue.

richardscourtney

Nick Stokes:
Your post at February 8, 2014 at 3:08 am is yet more of your sophistry.
My post addressed to you at February 7, 2014 at 11:15 pm is here explained the derivation of my statement saying

Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.

But you ignore that and introduce a Red Herring by saying

Well, that makes absolutely no sense, despite the bold face. Yes, trend from 1980 to 1997 was significantly different from zero. So was the trend from Jan 1995 to Dec 2012. Does that mean discernible global warming stopped a year ago?

That is complete nonsense!
As I said in my post at February 8, 2014 at 2:33 am which you claim to be replying

The conventions adopted by climastrology may be mistaken (I think they are) but it is not “science” to choose when and when not to use conventions depending on the desired result.

Richard

Opps. I messed up that last. The link is indeed to the rest of that essay quoted from, but the “very good” essay I wanted to point out is the one before that and the link is: http://wmbriggs.com/blog/?p=11261
It would be nice to be able to edit, but WordPress says that could lead to problems. They are most likely correct. 🙁