**To make science better, watch out for statistical flaws**

*First of two parts*

As Winston Churchill once said about democracy, it’s the worst form of government, except for all the others. Science is like that. As commonly practiced today, science is a terrible way for gathering knowledge about nature, especially in messy realms like medicine. But it would be very unwise to vote science out of office, because all the other methods are so much worse.

Still, science has room for improvement, as its many critics are constantly pointing out. Some of those critics are, of course, lunatics who simply prefer not to believe solid scientific evidence if they dislike its implications. But many critics of science have the goal of making the scientific enterprise better, stronger and more reliable. They are justified in pointing out that scientific methodology — in particular, statistical techniques for testing hypotheses — have more flaws than Facebook’s privacy policies. One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false.

A few months ago, though, some defenders of the scientific faith produced a new study claiming otherwise. Their survey of five major medical journals indicated a false discovery rate among published papers of only 14 percent. “Our analysis suggests that the medical literature remains a reliable record of scientific progress,” Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University wrote in the journal *Biostatistics*.

Their finding is based on an examination of P values, the probability of getting a positive result if there is no real effect (an assumption called the null hypothesis). By convention, if the results you get (or more extreme results) would occur less than 5 percent of the time by chance (P value less than .05), then your finding is “statistically significant.” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.

As Jager and Leek acknowledge, though, this method has well-documented flaws. “There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis,” they wrote.

For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies, and with thousands of scientific studies going on, that adds up to a lot. But it’s even worse. If there actually is no real effect in most experiments, you’ll reach a wrong conclusion far more than 5 percent of the time. Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes. More than 80 percent of your supposed results would be false.

read more here: https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws

The biggest flaw in the methodology being used is that it presumes that rejection of the null hypothesis, even if true, has any appreciable significance, or is important. . Use a large enough sample size and one can often detect very small (an insignificant) effects. One is almost never interested in whether the null hypothesis is actually true, but whether the actual effect is significant, not merely whether it’s statistically significant at some p level. One should instead demonstrate that the effect is of a significant magnitude, using statistical tests to do so.

“There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis”Yes, that’s elementary. P<0.05 can reject the null hypothesis. But otherwise the test fails. You can’t deduce the null hypothesis is true. You just don’t know.

That’s why all this talk of “statistically significant warming” is misconceived. You can test whether a trend is significantly different from zero, and maybe deduce something if it is. But if it isn’t, your test failed to reject. No result.

“For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies…”OK, use the number you thing is right. What if that means that nineteen times you discover a useful new drug, and one time you get a dud? Sounds good to me.

“Suppose you test 100 drugs for a given disease, when only one actually works. “Sounds like you probably shouldn’t be doing that. But OK, if you know those are the odds and it’s still worthwhile, you should adjust the P-value accordingly. Which means, incidentally, that the target drug has to be very good to stand out from the noise.

Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes.It is a bit hard to say anything at all about the significance of testing without knowing a trial design, number of replicates or the variance of the response, but note on the above basis you must also have some chance of missing the detection of the correct drug.

The statistics above are even worse than we thought.

What is the probability of a false negative? I don’t know, but whatever it is it has a chance of occurring.

We might miss our one right answer. So the odds are more than four to one against a test correctly identifying that one true result.

Nick Stokes makes some good points above. But unless I’m misreading him, he’s assuming that we know ahead of time how many of our drugs are effective, so we can adjust the p-value of the trial. That assumption wasn’t specified in the lead article, and is relevant to a completely different statistical situation.

just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.

You do your test. You report the uncertainty.

There will of course be cases where .05 is not good enough. Think particle physics.

There will of course be cases where one would make a decision on 75% certainty or less.

The notion that there is some special value that allows us to automatically accept or reject tests, is the culprit. you report 95%. from that one cannot deduce anything. You can observe that the result might be wrong 1 in 20 times. That tells you nothing about the truth of the matter.

Of course some will accept 95% and build on the foundation. Chances are this is the right pragmatic decision. Others will exercise skepticism and perhaps win a grand prize 1 out of 20 times. But the number 95 tells you nothing about which choice to make.. it doesnt direct you or order you to accept the science and it doesnt order you to reject it. The question is

A very common practice is to test multiple potential effects, one or more of which have a p-value >0.05. You can no longer necessarily say you have rejected the null hypothesis, because the multiple tests made it easier for a null result to sneak below 0.05. The Bonferroni correction, IIRC, says to divide by the number of tests to find your significance target. Testing 5 separate effects requires that you reach p<0.01 to claim significance. Bonferroni has his detractors, but the general idea is clear.

To test the success rate of science you only have to get hold of some scientific magazines — New Scientist or Scientific American, say — from twenty years ago, and judge for yourself how many of the world-shattering events and discoveries announced there have actually had any impact on everyday life. Five per cent would be a generous estimate. But as others have already pointed out, that’s five per cent better than any other method.

To be remembered: (this one from a link on line. Sometimes other words used but the link includes all there is to be remembered BEFORE one go on analysing one’s results)

The strength of the evidence for the alternative hypothesis is often summed up in a ‘P value’ (also called the significance level) – and this is the point where the explanation has to become technical. If an outcome O is said to have a P value of 0.05, for example, this means that O falls within the 5% of possible outcomes that represent the strongest evidence in favour of the alternative hypothesis rather than the null. If O has a P value of 0.01 then it falls within the 1% of possible cases giving the strongest evidence for the alternative. So the smaller the P value the stronger the evidence.Of course an outcome may not have a small significance level (or P value) at all. Suppose the outcome is not significant at the 5% level. This is sometimes – and quite wrongly – interpreted to mean that there is strong evidence in favour of the null hypothesis. The proper interpretation is much more cautious: we simply don’t have strong enough evidence against the null. The alternative may still be correct, but we don’t have the data to justify that conclusion.Statistical significance, getstats.org.uk pageThat said it’s also important to analyse the question oneself has put forward to falsify hypothesis in question. As Vollmer Gerhard wrote 1993,

Wissenschaftstheorie in Einsatz, Stuttgart 1993:Die wichtigkeit oder Bedeutung eines Problems hängt immer auch von subjektiven, bewer tendens Elementen ab.

Quick English translation: The importance or significance of a problem always depends on subjective, evaluative elements.

In other words one have to remember that no one of us is without having Tendens in our backpack. This means that we have to be careful not to mix black, grey and white alternative nor to ask dependent questuions. Remember that in every analyse of a result that tries to be in accuracy of Theories of Science it’s better to use Chebyshev’s inequality, next in analyse.

While all this might give you more than a hint of a certain type of observation, the ‘fact’ observed in curves that two types of observation interact significantly with it’s other is a total different thing.

If A can be showed to lead up to B in X numbers of studies and at the same time some B lead up to C no nullhypothesis what so ever is enough to prove that A leads to C.

You better use Set of Theory and Number theory on your two variables/curves in order to be able to draw a more than probable conclusion.

Please deposit your p-values in the appropriate receptacles in the restrooms. Thank You! ;-)

There’s misunderstanding of the whole concept of statistical significance. P0.05 we conclude there is no point in further inquiry. It is unfortunate that many see statistical significance as proof of the effect. It’s not. It merely lends evidence to the effect.

Publishing a study that claims to reject the null hypothesis at P>0.02 would be largely ignored in my field – it would attract sharp critique in peer review and probably be rejected unless the notion that there is any probability of an effect at all is of strong interest. For example, in new lines of cancer research where there may be a possibility that multiple drugs may interact in a way that the sum effect may be greater than its individual parts. A P value between 0.02 and 0.05 could suggest there’s a new drug to play with in further research.

Statistical significance serves another purpose. It also avoids 95% of the noise in published literature where an effect is claimed. It is a tacit agreement among researchers that theres a gatekeeper and filter that we use to avoid wasting everyone’s time talking about “my great discovery” when it’s not at all important.

Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols. My second sentence was supposed to say:

“P [less than] 0.05 is really the point at which it is considered feasible to take an interest in the effect. It’s a threshold at which to do research work. If the effect is P [greater than] 0.05 we conclude there is no point in further inquiry.

HankH, WordPress interpreted your symbols as a nonsense HTML tag, ignoring them and everything in between.

Gary, thanks. The next time I’ll know better than to talk statistics without checking the layout of my symbols. ;-)

The study is a waste of time. It assumes that the largest problem is false positives. It isn’t.

P-values only measure

statisticaleffects. They say nothing about the trial itself.P-values don’t detect:

— biased experiments

— experimental errors in set-up, measurement or collation

— “correct” values that are, unfortunately, based on incorrect theoretical underpinnings (there were a host of experiments to test the aether that were sadly never going to be correct just because the result happened to past some silly significance test)

— cherry picked or “corrected” measurements because the experimenter knew what the result should be

— maths errors in calculating the p-values

— maths errors in any other part of the experiment

— repeated runs of similar experiments until a “result” occurs (yes, the p-values should be corrected to allow for this, but there is no way the authors of our studies bothered to check if this is being done)

— incorrect conclusions from the result (that the p-value shows significance doesn’t mean that the experiment says what people think it says)

— and, not least, outright fraud.

It says something about the blindness of modern science that this stupid paper passed peer review.

Just because it thinks it refutes Ionannidis doesn’t mean that that scientific papers are mostly correct — a perfect example of my point above, that just because you get a statistical result doesn’t mean that it gets interpreted correctly. They can be wrong in so many ways, that they apparently didn’t even consider.

Most of you must have lived through the discovery of the Higgs ( particle physics, CERN, LHC). Maybe also saw the dithering for a year of whether it was the Higgs or not. This last was due to the fact that the statistical significance was lower than 5 standard deviations, assuming a normal distribution , chance of error one over almost two million. The five percent quoted above corresponds to 2 standard deviations. In particle physics the road to hell is paved with lower sigma ” resonances/particles.”

One cannot expect a study to be done on a million people, that happens, as with thalidomide, when the medicine is released, and then they go “oops”. But a hundred is too low a number to avoid random correlations. Of course they have their control samples and that makes a difference , as does the time taken during the observations, the statistics increase and the correlations are checked. Still it is ironic that people’s lives are less precious than discovering the Higgs..

The problem is even worse. You choose a confidence level and then see if your results are significant at that level. The IPCC in AR4 used a confidence level of 90%. That means that you have a 10% chance of a false positive. However, if you have two papers/conclusions that use a confidence level of 90% and the results of one depend on the other, then the chance of a false positive becomes 1-(0.9 x 0.9) or 19%. If you have three levels, then your chance of a false positive using a confidence level of 90% becomes 27%. For 4 levels, it is 33% and for five levels it is 41%. After 6 levels, your chance of a false positive becomes better than a coin toss.

There are probably at least six levels of conclusions/results in the IPCC reports (I haven’t counted them), so the IPCC reports as a whole probably have a 50% or greater chance of a false positive. If the confidence level used were 95%, then for six levels you would still have a 30% chance of a false positive. Only if you use a confidence level of 99%, would you have a less than 10% chance of a false positive for six levels (actually 7%).

However, if climate science were to use a confidence level of 99% to test for significance, I wouldn’t be surprised if most of the fields’ results were deemed to be insignificant. The IPCC rolled back the significance level to 90% from 95% between the TAR and AR4, IIRC.

Thanks Bob! I’m learning more about HTML tags every day. Using the arrows a lot in publication I forget you can’t get away with it so much in blogs.

I believe that using the following will get you printable less-than and greater-than arrows without confusing WordPress (delete the space after the ampersand in actual use)

& lt; & gt;

Testing: < >

Depending upon the underlying probability distribution, sample size, etc. selecting a smaller p value does not give stronger results. There are Type I errors and Type II errors. The smaller p becomes the greater the probability of a Type II error. If alpha and beta and Type I and Type 2 Errors do not mean anything to you, then you should consult a competent statistician before you even start the experiment and get help in understanding the error properties of the statistical tests before you set the significance level (p value) and the sample size.

With apologies to The Bard:

To p(), or not to p(),

thatis the question—Whether ’tis Nobler in the analytical mind to suffer

The Slings and Arrows of outrageous Probability,

Or to take Statistics against a Sea of Data,

And by opp(o)sing nullify them?What Mr. Lane said @9:51.

Testing the handy-dandy pre tag

In performing the most simple statistical test, you always test at least two assumptions, that the null hypothesis is true and that the sample is taken randomly. A violation of the latter assumption, usually taken for granted, may explain many false positives. Another problem with the Fisher type of test is that the null hypothesis concerns a point value, almost always apriori false. In a sample, sufficiently large, the apriori false null hypothesis will be rejected at any significance level you like.

Well so much for that. The <pre> tag failed to preserve the input file handle <IN> in the 4th line above. Testing outside WP, the same result is obtained in IE, FF, and GC. PRE won’t save anything looking like a tag even across multiple lines.

Steven Mosher says: February 7, 2014 at 7:17 pm

“just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.”

Sort of agree. Wish to stress that this type of uncertainty expression is about the spread of values about a mean of some sort, precision if you like.

More concerned about whether the mean is in the right place, than with the scatter about it being 95% enclosed by a certain curve. Bias.

I’m thinking that when you compare a number of temperature data sets with adjustments and there is an envelope around the various adjustments of say +/- 1 deg C, it is rather immaterial to concentrate on precision because it just adds on top of the bias which can often be the larger.

Like this graph from Alice Springs in the centre of Australia – I guess I should update it now we have BEST and CRUTEM4 and even Google.

Of course, these concepts are as old as time, but it’s remarkable how, in climate work, the bias aspect is so seldom considered properly, if at all. One gets the impression that ye olde thermometers could be read to 0.05 degrees, when 1 degree was more like it.

Or that Argo floats are accurate to 0.004 deg C as I recall. Utter BS.

But then, you’d broadly agree with me, I suspect.

Nick Stokes:

At February 7, 2014 at 6:30 pm

NO!That is warmist sophistry which pretends the ‘pause’ is not happening.

A linear ‘trend’ can be computed from any data time series. At issue here is whether the trend in global atmospheric temperature anomaly (GASTA) differs from zero (i.e. no discernible global warming or cooling) and – if so – for how long before the present.

Climastrology uses linear trends and 95% confidence. There are good reasons to dispute each of these conventions, but they are the conventions used by climastrology so they are the appropriate conventions in this case.

So, in this case the null hypothesis is that a linear trend in GASTA does not differ from zero at 95% confidence and, therefore, there is no discernible warming. And the period to be determined of no discernible global warming or cooling is up to the present. Therefore, the end point is now and the data is assessed back in time until a linear trend over the period differs from zero at 95% confidence.Each of the several time series of GASTA indicates no trend which differs from zero (i.e. no global warming or cooling) for at least 17 years until now; RSS indicates 24.5 years.

And it is not reasonable to remove data from the data set(s). 1998 had a high value and there is no possibility of justifying its removal from the data set whatever the cause of it being a high value. This is because the assessment is of how long there has been no discernible warming or cooling, and any distortion of the analysed data provides a distortion of the result of the analysis.

Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.Richard

Nobody has said the magic word (‘model’). You know that p-value thing? You know how it is calculated? Using a model, which in some cases is an only an assumption.

The father of modern statistics was an ardent Eugenics catastrophist – he developed the field of statistics to find mathematical support for his passion.

http://en.wikipedia.org/wiki/Ronald_Fisher

I am not suggesting that statistics are useless because its origins are tainted, what I am suggesting is, if someone with the genius to invent an entire mathematical discipline can be fooled by his own invention, then anyone can get it wrong.

Hence, one of the principle reasons when I’m detailing Drs, that the Dr is made aware of the statistical or non-statistical significance of a value and the more important and the most relevant part of the discussion, is the value change from placebo or the active ingredient arm, CLINICALLY MEANINGFUL? Clinical meaningfulness of a number is a more practical way to understand a drug effect on a subject.

Eric Worrall: ‘The father of modern statistics was an ardent Eugenics catastrophist’.

Consider the historical context. The inconvenient truth is that many of his contemporaries were eugeneticists. As embarrassing is the fact that many people in these days were anti-semites. Perhaps the most difficult thing, even for geniuses, is to check assumptions and to think about consequences.

Some points I’ve been thinking about:

1) The people rubbishing medical research don’t release their data. There’s no way to confirm what you’ve been hearing. It’s basically hearsay.

2) Medical journals are the worst for publishing & executing methods. The statistical tests are the least of the problems. Why does this keep coming up? We learn to dismiss articles based on faulty methods or experimental construction, no matter what the statistical significance.

3) People keep trying to examine the outcome of medical research based on whether drugs that work in the lab work in the clinic. This doesn’t measure the search for knowledge, it measures the search for financial success. There’s something to be said for knowledge so reliable you can take it to the bank. However, clinical trials can fail for reasons that have nothing to do with the reliability of scientific knowledge. These exercises looking at the monetization of science are a waste of time. Everything is worthless until it’s not. If you perform an evaluation of the evaluators as in Science mag this week (paywalled unfortunately, http://www.sciencemag.org/content/343/6171/596), you’ll find out that these evaluations are not worth much.

” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.”

The p-value is not what decides whether a paper gets published. A negative result is still a valid scientific result. But this raises the other problem of the ‘literature': publication bias. Only publishing positive results falsifies the record as well.

A recent case was Tamilflu , a supposed protection against certain strains of influenza that Tony Blair , them prime mininister invested something like 4 billion pounds sterling in a stock of treatments in case an epidemic of bird flu struck the UK.

It has recently been found that about half the studies on the drug were negative but they did not get published.

Good article BTW, thanks.

Well. I think it is not correct to limit the science methodology to statistics. As others have pointed out in this discussion statistics is useful to cut off the noise. However no statistics can replace a mechanistic explanation. To make an example familiar to my field of work: genetic association studies (those exploring the association of common genetic variants in the population with some clinical phenotype) often get p values with less than 10-5 / 10-8 ( sorry I am not familiar with tags too). However only when the finding is biologically explained (with functional experiments) one can claim that the discovery is a scientific advancement. Otherwise you can clam the association but not that the association has any biological meaning.

I think this should also happen in climate science. Perhaps funding agencies should invest better their money to sponsor studies aimed ad understanding the physics underlying the observed phenomena rather than thousand of useless studies finding statistical associations and than building their theories based on what they find statistical significant. Science based on statistic actually reverses what science methodology should be and it is really the prototype of a fishing expedition (you get something but you don’t know why).

richardscourtney says: February 7, 2014 at 11:15 pm

“So, in this case the null hypothesis is that a linear trend in GASTA does not differ from zero at 95% confidence and, therefore, there is no discernible warming.”No, that’s nonsense, and putting in bold doesn’t improve it. Where I am, it was 41°C today. Was that statistically significant? Well, maybe not; it’s late summer. But it was still discernibly warm.

You have tested whether the observed trend could have happened with an underlying zero trend and natural variation. And the answer is that that can’t be rejected. But lt is not the only possible explanation.

The UAH index shows a trend of 1.212°C/century since Jan 1996. That’s not quite significant re 0, so we can’t rule out an underlying zero trend. But we also can’t rule out the upper limit of 2.44°C/century (or anything in between). In fact 2.44 is as likely as 0. Now that would be highly discernible warming. In fact, the observed 1.212°C/cen is considerable.

What we know is that the measured trend was 1.212°C/cen. That’s what actually happened, and is discernible. The rest is theorising about what might have happened if we could run it again.

Carlo Napolitano, I agree. Statistical testing a good safeguard but is not the be all and end all of scientific analysis. Too much of what is used in climate science has been inherited from econometrics rather then the physical sciences. And often rather poorly understood at that.

There is another way of judging the veracity of published results: look after, say, ten years and see how many times a result is referenced. If it isn’t, either the result wasn’t important or nobody believes it anymore, or it is superceded by later results.

They did such a sobering analysis some decades ago with astronomy/astrophysics papers and found that only a few percent survives the ravages of time.

John Brignell put the problems of statistical significance into layman language several years ago. His website, [url]www.numberwatch,co.uk[/url], is worth spending an hour on, and his book “Sorry Wrong Number” is excellent. One of his constant points is that in epidemiology a relative risk of at least 2 (a doubling of the effect) should be seen before the result is taken as important due to the number of conflicting factors in any open system (doesn’t it sound like AGW?).

Here are a few relevant pages from the website:

[url]http://www.numberwatch.co.uk/statistical_bludgeon.htm[/url]

[url]http://www.numberwatch.co.uk/Stuff.htm[/url]

[url]http://www.numberwatch.co.uk/an_exercise_in_critical_reading_.htm[/url]

He also has several essays on the ridiculousness of AGW and a review of Watermelons :).

Hypothesis testing gets a hard time every now and then. By those who think the p-value is the probability that the alternative hypothesis is wrong, or who think that such testing provides proof or disproof of some kind. It does neither. It is merely a means of assessing the strength of evidence in a particular data set, considered in isolation. In general, the null hypothesis is for ‘no effect’, e.g. that the hair on the left side of your head has an equal mean diameter as that on the right. We know that is not true. Generally we know the null hypothesis is not true. We are not trying to prove or disprove it. All we are doing is going through a ritual whereby we say, if the null were true (and other conditions deemed applicable hold) what is the probability of getting some statistic as or more extreme than the one computed for this particular set of data? That’s it. A small p does not mean the null is false, a large p does not mean that it is true. The test is making a far more modest contribution than that.

It’s surprising the Ioannidis paper has created such a stir, since it basically just says that a small probability of error in an individual experiment/study when compounded over thousands of different experiments/studies results in a much larger probability of error. Or the probability of heads in 1000 spins of a coin greatly exceeds the probability of heads on 1 spin. Pretty obvious, although pinning down the precise probability of error/false positives over thousands of very different kinds of studies is definitely a hard problem.

The relevance to climatology lies in the proliferation of different measures global warmers are coming up with – sea levels, ice volumes, ocean heat content etc etc. Keep data mining and you will find something still going up steadily! Especially as at least some of these are probably correlated to global average temperature anomaly with a time lag. Not to forget there are half a dozen such global anomalies to begin with.

Nick Stokes:

I am replying to your post at February 8, 2014 at 1:01 am which is here and is in reply to my post at February 7, 2014 at 11:15 pm which is here.

In my post I rightly said of your assertion

I explained

Those conventions were used by climastrology to claim there was global warming. What matters is to use THOSE SAME conventions when assessing the ‘pause’. And it is sophistry to say that different conventions should be used when the result does not fit an agenda.

I stated that “There are good reasons to dispute each of these conventions” but, so what? The only pertinent fact is that those are the conventions used by climastrology. It is ‘moving the goal posts’ to now say those conventions should not be used because they are wrong.

Your reply which I am answering says

That is more sophistry!Whatever the cause of the ‘pause’ is not pertinent to a determination of the existence of the pause.

The same conventions of climastrology used to determine that there was global warming were used to determine the start of the ‘pause’. And the conclusion of that analysis is as I said

and

The conventions adopted by climastrology may be mistaken (I think they are) but it is not “science” to choose when and when not to use conventions depending on the desired result.

Richard

From the article linked in Tom Siegfried’s essay-

“Others proposed similar methods but with different interpretations for the P value. Fisher said a low P value merely means that you should reject the null hypothesis; it does not actually tell you how likely the null hypothesis is to be correct. Others interpreted the P value as the likelihood of a false positive: concluding an effect is real when it actually isn’t. ”

Seems like Tom Siegfried and many other commentors on this thread, such as Nick Stokes and Steven Mosher, have made the same misinterpretation of what Fisher’s p value actually-Just as is alluded to in that article.

Alpha values are what determine Type 1 errors, or False Positives, per Neyman–Pearson. Fisher p values are about acceptance of the null hypothesis, not about Type 1 and 2 errors, as Tom Siegfried suggests.

What Leonard Lane says at February 7, 2014 at 9:51 pm is spot on, if he means consult a statistician using Bayesian methods.

I am interested in seeing if Tom Siegfried figures out what a p value actually is before he writes part 2 of his essay.

Mindert EitingEric Worrall: ‘The father of modern statistics was an ardent Eugenics catastrophist’.Consider the historical context. The inconvenient truth is that many of his contemporaries were eugeneticists. As embarrassing is the fact that many people in these days were anti-semites. Perhaps the most difficult thing, even for geniuses, is to check assumptions and to think about consequences.

A historical example of GIGO – the statistical techniques were well applied, but the data and assumptions were rubbish.

Fast forward to the present day, and the climate “geniuses” can’t even get the statistics right.

richardscourtney, everyone should try the SkS trend calculator, but instead of using the latest figures, feed in 30 year time periods before and after the 1940 – 1970 cooling.Whatever Foster and Rahmstorf’s method is calculating, it is not a reliable guide as to whether the world is experiencing a downturn in global temperatures.

richardscourtney says: February 8, 2014 at 2:33 am

“Importantly, 17 years takes us back to 1997 and there was statistically significant warming over the previous 17 years. Therefore, discernible global warming stopped at least 17 years ago.”Well, that makes absolutely no sense, despite the bold face. Yes, trend from 1980 to 1997 was significantly different from zero. So was the trend from Jan 1995 to Dec 2012. Does that mean discernible global warming stopped a year ago?

A p-value only gives confidence in rejecting the null hypothesis, it is not proof of an effect. You can propose an alternative hypothesis and test for that as well.

In clinical studies a p-value of 0.01 is typically used but more important studies have to be properly powered beforehand , and the results have to either agree or disagree with the baseline measurements within their prior agreed upon confidence intervals.

If AGW research followed the rules required of Pharmaceutical research, the entire dogma would have been rejected by the FDA years ago.

Statistician William M. Briggs wrote:

More at link in a very good essay: http://wmbriggs.com/blog/?p=11305

By the way, Briggs has written extensively about the problem of people misusing statistics. His blog site is a treasure trove of wonderful essays on the issue.

Nick Stokes:

Your post at February 8, 2014 at 3:08 am is yet more of your sophistry.

My post addressed to you at February 7, 2014 at 11:15 pm is here explained the derivation of my statement saying

But you ignore that and introduce a Red Herring by saying

That is complete nonsense!

As I said in my post at February 8, 2014 at 2:33 am which you claim to be replying

Richard

Opps. I messed up that last. The link is indeed to the rest of that essay quoted from, but the “very good” essay I wanted to point out is the one before that and the link is: http://wmbriggs.com/blog/?p=11261

It would be nice to be able to edit, but WordPress says that could lead to problems. They are most likely correct. :-(

Eric Worrall:

I agree your post at February 8, 2014 at 3:01 am which says

However, that has nothing to do with my dispute with Nick Stokes in this thread.

The statistical conventions adopted by climastrology are nonsense. However, they were used to show the existence of discernible global warming in the last century. Stokes is now claiming that those same conventions should not now be used because they now demonstrate that there has not been discernible global warming for at least 17 years.

This is an important issue which goes to the heart of the subject of this thread.

Appropriate statistical methods need to be applied to assess the time series of GASTA. And their appropriateness needs to be defined technically and not on the basis that it fulfills an agenda.Richard

Hypothesis testing is best thought of in Bayesian terms. You start off with a prior belief in the conclusion. You perform an experiment or make an observation that adds or subtracts from your belief. Your posterior belief, the belief you should have that the conclusion is true after seeing the experiment, is your initial belief *plus* the increment from the experiment.

The p-value is an approximation to the size of the experimental increment. It is *not* the probability of the conclusion being true. The reason for using it as a filter on publication is to say “this experimental result is strong enough to shift your opinion significantly.” It does *not* say “this experiment shows that the conclusion is true.”

The 100 drugs trial example above is a classic example. You start off with a 1% confidence in each of the drugs. You perform the test, and at the end you have a 20% confidence in each of the drugs that passed. That’s a big increase in confidence, and well worth reporting, but if you started at 1% you’re only going to get to 20%, and as noted, that still means there’s an 80% chance you’re wrong.

For the mathematicians:

Bayes says that for two hypotheses H1 and H2 and an observation O…

P(H1|O) = P(O|H1) P(H1)/P(O)

P(H2|O) = P(O|H2) P(H2)/P(O)

so dividing one equation by the other

P(H1|O) / P(H2|O) = [P(O|H1) / P(O|H2)] [P(H1)/P(H2)]

Take logarithms

log[P(H1|O) / P(H2|O)] = log[P(O|H1) / P(O|H2)] + log[P(H1)/P(H2)]

and we interpret this as

log[P(H1)/P(H2)] = prior confidence in H1 over H2

log[P(O|H1) / P(O|H2)] = confidence added by observation O in favour of H1 over H2

log[P(H1|O) / P(H2|O)] = posterior confidence in H1 over H2 after seeing the observation

If H2 is just the opposite of H1, then P(H2) = 1-P(H1), and we can translate the logarithmic confidence scale to probabilities using c = log[p/(1-p)] and back again with p = 1/(1+b^(-c)) where b is the base of the logarithms.

The p-value is just P(O|H2), the probability of the observation under the null hypothesis, and the smaller it is the more confidence we’ve just gained in the alternative hypothesis H1. As you can see, this assumes that the observation is fairly certain to occur under H1, so log[P(O|H1)] is small. If it’s not, p-values taken too literally can give misleading results. However, it’s usually intuitively obvious if that’s the case, and this sort of thing is only a big problem when researchers apply statistical calculations blindly without understanding how the evidence works.

Not that I’m saying that never happens…

” Bob says:

February 7, 2014 at 9:05 pm

HankH says:

February 7, 2014 at 7:58 pm

“Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols”

Hank: It was probably not WordPress that interpreted your arrows to be html tags. The browser is the software charged with that duty, and pretty much all browsers will ignore non-existent html tags without an error statement. So, the left arrow “” with something in between is interpreted as a tag. I don’t know of any way around it. It seems there should be an escape character that could be used.”

________

Wordpress parses your comment into a filtered HTML generator before it passes to the comment file and deletes what it sees as incompatible code, and recodes other things, like urls. The angular brackets are especially tricky because, if they were not screened, the entire blog page can collapse depending what’s between the brackets. The angle brackets are the fundamental code delimiter in html files. Its not a browser issue, as it happens before your browser sees the parsed code. Sometimes it’s just easier to spell things out. “For greater” than and “less than” you could use (GT) or (LT), e.g (LT).05. Square brackets are equally cumbersome, since some systems (PHPBB notably), use those as code delimiters.

“Sometimes it’s just easier to spell things out. “For greater” than and “less than” you could use (GT) or (LT), e.g (LT).05. “Yes, it’s not WordPress, just HTML. You can use special sequences, given here. For less than, use & lt ; but without spaces (< see). For GT, & gt ;.

David L said:

If AGW research followed the rules required of Pharmaceutical research, the entire dogma would have been rejected by the FDA years ago.~ ~ ~ ~ ~ ~ ~ ~

The FDA gave the stamp of the approval to Merck for Vioxx.

The FDA tried to close down Dr Burzynski’s cancer clinic, keeping him mired in legal battles for years while the Department of Health and Human Services was busy stealing his antineoplaston patents. (ref)

That’s the FDA and pharmaceutical industry research standards that you hold up as as a role model. Not a good model.

Orkneygirl (@Orkneygal) says: February 8, 2014 at 2:44 am“Nick Stokes and Steven Mosher, have made the same misinterpretation of what Fisher’s p value actually”

Not me. Fisher is quoted as saying:

“a low P value merely means that you should reject the null hypothesis; it does not actually tell you how likely the null hypothesis is to be correct.”That’s exactly what I’m saying. P value tests can’t prove the null hypothesis correct. They can only usefully persuade you to reject.

So when you say:

“Fisher p values are about acceptance of the null hypothesis”/i>that’s exactly against what your quote is saying.

From the article:

Ah yes, the Slippery Slope Fallacy. If they stop believing in the results of medical science, they’ll believe in something worse. When I attended church I would get regular sermons on religious beliefs in things which were

ipso factopreposterous and failure to believe in them leading to immoral behaviour or even worse, atheism.The answer is of course, that there are too many (medical) science articles that are unreproducible and/or with unreliable marginal experimental results based on poor use of statistical techniques. And those results are trumpeted by a small coterie of scientific journals which trade on “impact” instead of verifiability.

Its an unvirtuous circle that scientific academies, if they had any use at all, would be trying to break. Instead scientific academies themselves are stuffed with people who produced the poor research in the first place, and are co-opted to promote orthodoxy of mediocre results and pour calumny on critics.

Khwarizmi on February 8, 2014 at 4:40 am

David L said:

If AGW research followed the rules required of Pharmaceutical research, the entire dogma would have been rejected by the FDA years ago.

~ ~ ~ ~ ~ ~ ~ ~

The FDA gave the stamp of the approval to Merck for Vioxx.

The FDA tried to close down Dr Burzynski’s cancer clinic, keeping him mired in legal battles for years while the Department of Health and Human Services was busy stealing his antineoplaston patents. (ref)

That’s the FDA and pharmaceutical industry research standards that you hold up as as a role model. Not a good model.

———–

You proved my point better than I did!!! Even as crappy as the FDA and Pharma are and applying their less than optimal standards, the AGW would not hold weight!!!!

John A : “…. leading to immoral behaviour or even worse, atheism.”

why is an atheist “worse” than an immoral christian?

Do you think that no one is capable of being moral without having a ‘representative’ of god to tell his what to do?

I think you need to check your null hypothesis.

As an engineer, I am somewhat bemused by all this statistical theory.

How many so-called “scientists” would let their children fly on an aeroplane that had a 95% probability of completing its journey 19 times out of 20, and crashed in flames the 20th time? Or even drive across a bridge that had a 1 in 10,0000 chance of falling down when a car drove across it?

Would Mosher depend on feeling lucky if the fate of his offspring was at stake and permit them to fly on the aforesaid aeroplane that had been designed and built by a bunch of climate McScientists who solemnly assured him that the p-value was <0.05 of it falling in pieces in mid-air? Would Grant foster? Or Michael Mann? I seriously doubt it. And yet they expect us to destroy our economies and hand over ever-increasing quantities of our hard-earned cash to the likes of Al Gore on equally flimsy evidence.

And then scientists look down on engineers for getting their hands dirty by applying science – of which in practically every case they require a vastly more profound understanding, for obvious reasons – to real world problems.

"Scientists" appear to resent the fact that engineers often regard much of their prognostication with amusement verging on contempt – AGW is a case in point, then wonder why.

Think on, as we say up here in Yorkshire.

A recent paper argues that p-values should be reduced to 0.005 to 0.001. Revised standards for statistical evidence, Valen E. Johnson. http://www.pnas.org/content/110/48/19313

“Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance. ”

I’ve been working in industrial labs for 35 years and I’ve learned to start any statistical analysis similar to a statics analysis for forces: start with a drawing. With statistics, a plot of the data and a description of the statistical analysis and justification for the assumptions inherent in that analysis. Don’t even calculate a standard deviation if you haven’t looked to see if the data is normally distributed! Did you see the assumption you made before calculating that standard deviation? (hint: Sometimes a transform of the data will give you a normal distribution.)

Industry is a little different than academia; from Ghost Busters

I’ve added this article to my collection.

See also Revised standards for statistical evidence, by Valen E. Johnson,

Proc. Natl Acad. Sci. USA, Oct. 9, 2013 (print Nov. 11, 2013), doi:10.1073/pnas.1313476110, and the discussion atWUWT, esp. Prof. Robert G. Brown’s comment, plus the article inNature.Steven Mosher:

“The notion that there is some special value that allows us to automatically accept or reject tests, is the culprit. you report 95%. from that one cannot deduce anything. You can observe that the result might be wrong 1 in 20 times. That tells you nothing about the truth of the matter.”

Actually I thought that 95% was the figure that permitted the IPCC to propagate CAGW, which I understand is a hypothesis is espoused by Mr. Mosher.

http://www.bbc.co.uk/news/science-environment-24292615

Solomon Green:

In your post at February 8, 2014 at 6:05 am you say

Yes! That was the point of my debate with Nick Stokes in this thread.

As I said to Eric Worrall at February 8, 2014 at 3:35 am

Richard

I like this article and would like to make a small point.

Researchers HAVE TO make progress in order to be successful, make progress. No paper is done in isolation. One paper builds upon another and you have to hurry up. There’s no time to take so much data as to be metaphysically sure that the relation is adequately described, i.e p<0.00…01. There's no real need because the next bit of progress is going to be built upon the last bit and if the last bit is wrong then the next bit won't pan out. You'll quickly discover that a mistake was made somewhere along the way. You won't know where but with a good understanding of the subject, you can make some good educated guesses and efficiently reexamine past conclusions.

This is in contract to Climate Research. New data doesn't invalidate prior conclusions very well because the new data isn't generated by the thought process of the researcher. The feedback on the thought process is much thinner.

Greg Goodman

A small factoid: I am an atheist of 15 years’ standing.

What I am pointing out is that fallacious reasoning is not limited to churches, and the same fallacies wheeled out regularly to religious believers to keep them on the straight and narrow path also happens in scientific journals.

Sorry John, it’s often hard to detect satire in blog posts.

The parallels between religion and science are many.

In particular, those who are part of the flock of the church of AGW believe in “the science” like christians believe in the “the word”.

Bald headed monks like brother Michael are revered as wise men.

Even outside the mess of climatology, science has become very much like the Church it has replaced.

Catweazel

You missed the point entirely. Im basically with briggs on this matter.

Example. Suppose my null is the buttered toast will fall butterside down half of the time.

Am I going to require high statiscal certainty on such a matter. Nope.

With regard to the ipcc. It takes zero stats to understand t h at co2 is a problem

Say they test 1,000 hypotheses, all of which are false but for which a p value can be derived. Using a 95% confidence interval on the sample, about 50 of the tests of hypotheses will fall in the tails of the distribution and thus 50 “discoveries” will have been made.

The problem is that it is those 50 results that will be written up and submitted to journals for publication. The editors of the journal will be, by definition, looking at 50 false conclusions, from which they will choose to publish the ones that…well, it really doesn’t matter which ones they publish, does it? They’re all wrong in this case, by definition.

Perhaps this is one reason so few published results survive the test of time?

“For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies”

=====

It can be even worse than that. Sometimes, there is no expected result, or multiple possibilities. For example, you might be curious if blood donation raises or lowers the donor’s blood pressure. One might expect it to lower blood pressure by decreasing the volume of blood to pump. Or one might expect it to increase blood pressure because many folks find sticking needle in their arm to be stressful. If one starts with no expectation, there are two p=.05 probabilities — increase or decrease, not one. And one chance in 10 rather than one in 20 of achieving a “significant” result

Looks like it’s time for a reminder:

Over on the right side nav bar is a link to Ric Werme’s guide to WUWT. Among other goodies is a good list of HTML notes for getting characters like ‘<‘ to display.

Also, if you want to try out <pre> and what not, please do it at the “Test” page – see the link at the top nav bar. That has most of my HTML notes too.

Sorry about breaking the font size and indentation there last year. I forget how I did it.

Greg said:

“Sorry John, it’s often hard to detect satire in blog posts.”

It wasn’t actually satire. John A was merely pointing out the position from the church view point. Indeed, atheism is probably worse than any sin in many, if not most, church eyes.

Ah, Mosher, ever the myopic prognosticator w.r.t. climate and that evil demon CO2. Indeed, it is obvious that CO2 is a problem, if that is what you started out believing. True scientists (read: few associated with climatology) don’t actually start out with a belief: they see something interesting then devise tests they can conduct (including analyses of existing data) to attempt a better understanding of the phenomenon. People like you, however, are all about belief. Maybe that really is what separates engineers from the rest of the scientific community (as noted above): we are specifically trained to root out meaningless (spurious) correlations, and as such, always look for alternative answers that better describe the problem at hand.

Mark

Greg says:

February 8, 2014 at 8:04 am “The parallels between religion and science are many. In particular, those who are part of the flock of the church of AGW believe in “the science” like christians believe in the “the word”. Bald headed monks like brother Michael are revered as wise men.

Even outside the mess of climatology, science has become very much like the Church it has replaced.”

The parallel between science and religion is this:

data is to science as text is to belief.

In the case of Christianity, all belief is to be tested with the text. A priest class, or expert class, which interprets the text for everyone else has always historically led to distortions and abuses, with the eventual outcome that traditions developed by the priest class teach the very opposite of the text. I believe this pattern exists in Hinduism and Buddhism, as well as in Christianity. Literacy, and translations into spoken languages has allowed believers to read the text themselves and judge the claims of the priest class. The individual then goes to the church which reflects their interpretation.

In the case of the example of medical science, the answer is similar. Each individual should have the freedom to research claims and choose a path to healing. No government board or exchange should be forcing medical decisions on doctors and patients. The treatments of medical doctors are often drastic, and have many side effects and unintended consequences, and may be worse than the disease it is meant to treat. Iatrogenic illnesses and deaths are very possibly the most under reported area of science. In the case of your physical life, or your eternal life, literacy and liberty, along with personal responsibility for outcomes, are optimal for human well being.

Steve Mosher-“It takes zero stats to understand that co2 is a problem.” Based on what? CO2 concentration in the atmosphere is 400 PPM. 400 sounds like a big, dangerous number. But 400 parts per million is .04%. According to warmistas, mankind has caused the CO2 concentration to increase from around 375 PPM to 400 PPM, with 375 PPM as the Goldilocks standard, not to hot, not to cold, but just right. 400 PPM is catastrophe. The increase in concentration is .000025 in absolute terms, and .0025% in percentage terms. Most people would look at the percentage numbers and percentage increase and draw the obvious conclusion, as Dr. Richard Lindzen has, that CO2 is a trace gas that has increased by a trace percentage, and so what? In addition, increase in global warming may be a good thing. More CO2 combined with warmer weather means that crop production in North and South Dakota, MN and southern Canada will increase tremendously, helping to alleviate global hunger.

Yes some oceanfront property may be lost, but at such a glacial pace that mitigation and the necessity to move inland will take place over such a long time frame that the economic cost is easily absorbed compared to the economic gains. Man lives where the climate is warm and wet, warm and dry, cold and wet, and cold and dry, and often experiences all four in one location. We are a remarkably adaptive species, and we all have feet. The bigger threat is Government wasting billions in capital on a chimera, all of which will increase the cost of energy, and for the first time in human history, we are making intentional policy choices that will lower the standard of living for future generations.

@Steven Mosher says: February 8, 2014 at 8:05 am

“With regard to the ipcc. It takes zero stats to understand t h at co2 is a problem”

Mosher, that’s your problem. CO2 is the fundamental building block of all carbon based life forms,it is not the problem !!

richardscourtney says:

February 8, 2014 at 2:33 am

The conventions adopted by climastrology may be mistaken (I think they are) but it is not “science” to choose when and when not to use conventions depending on the desired result.

=============

changing your methods is cherry picking. If you apply 20 different methods to analyses the data, the odds are that 1 method will deliver a false positive at the 95% confidence level. If you then report only this 1 method, you are committing scientific fraud. However, this sort of fraud is almost impossible to detect or prove,

So, when we see climate science using one method to analyze warming, and a different method to analyze the pause, this makes it likely that we are witnessing cherry picking of the methods, and what is being reported are false positives. Specifically, it is very likely the previous warming was not significant. It was an artifact of the statistical model.

To equate climate science to astrology is of course an insult to Astrology. We calculate the earth’s future tides to great precision using the techniques developed by astrology. These were the same techniques that early humans used to predict the seasons. Astrology has a bad name because these techniques have also been applied to personal horoscopes, where they have proven less successful.

Climate Science adds “Science” to its name because it isn’t science. It is only pretending to be one. None of the true sciences need add “Science” to their name. Climate Science is like the Peoples Democratic Republic. The PDR adds “Democratic” to its name because its isn’t a Democracy. It is only pretending to be one.

Astrology in contrast allows us to predict the specific future state of chaotic systems with a reasonably high degree of accuracy. Something that remains for all practical purposes impossible in all other branches of science.

“””””…..Geoff Sherrington says:

February 7, 2014 at 11:13 pm

Steven Mosher says: February 7, 2014 at 7:17 pm

“just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.”

……………………..

One gets the impression that ye olde thermometers could be read to 0.05 degrees, when 1 degree was more like it.

Or that Argo floats are accurate to 0.004 deg C as I recall. Utter BS.

……………….”””

Don’t know what sort of thermometers would be considered olde, o r good to one degree or perhaps 0.05 deg.

But It is not that improbable, that the thermometers in the Argo floats, can resolve, and are repeatable to 0.004 deg. C

Absolute calibration accuracy is somewhat difficult, but that does not matter a jot, in climate studies. When you are measuring “anomalies” who cares what the calibration accuracy is; repeatability and resolution is all that matters; the absolute Temperatures are thrown out with the bath water.

A more important question is; just WHAT Temperature is the thermometer measuring ?? Is it measuring only its own temperature, or the temperature of something else you are more interested in.

So I don’t know that Argos are reading what people think, to 0.004 deg. C

As for P values and null hypotheses.

I believe that a P value is an intrinsic statistical property of some known data set, defined precisely in statistical mathematics text books.

I don’t believe it tells you anything about anything else; not in that data set.

And we all know Einstein’s exhortation; “No amount of experiment can prove me right, but a single experiment can prove me wrong.” Or words to that effect.

It takes zero stats to understand t h at co2 is a problem.

Mosher,

This comment is completely sense-free and indicative of your mindset. All the models agree that without a positive feedback increasing atmospheric water vapor, CO2 by itself is no problem at all. Have you checked to see if atmospheric water vapor has increased as CO2 increases? News flash: It has NOT.

Check out NASA’s NVAP-M study, which is full of bad news for you and your ilk.

Steven Mosher:

just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.You do your test. You report the uncertainty.

The 5% and 1% conventions persist only because a workable improvement has never been demonstrated. Claims that this or that procedure (e.g. some variety of Bayesian inference, or false discovery rate) is an improvement fall apart when you consider that millions of statistical decisions (including medical diagnoses and decisions whether to publish papers) are made daily.

Because of random variability and the vast scale of research and applications, every procedure has the risk of a high rate of error. And that’s when the procedures (diagnostics, procedures) are performed without flaw.

Of far greater concern is the suppression of reports of vast numbers, of uncounted “negative” results.

Legions of papers and book chapters have been published on these problems.

Tom wrote: “One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false.”

In science, we never prove that a theory (or conclusion) is correct; we accumulate a body of experimental evidence that is consistent or inconsistent with the theory (or conclusion). When we are unable to reject the null hypothesis using the data with p<0.05 (or some other more appropriate level), the experimental data is usually considered to be INCONCLUSIVE – a result which generally doesn't make a theory or conclusion FALSE. For example, experiments at the Tevatron were unable to provide conclusive evidence for the existence of the Higgs boson, but that work certainly didn't prove that the Higgs didn't exist. There is a big difference between a conclusion being false and the more appropriate phrase you used: "false discovery rate". The fact that experiments that don't produce statistically significant results frequently don't get published certainly means that the we need to be interpret p values carefully.

If a large clinical trial with a drug fails to show efficacy with p<0.05, that doesn't mean that the drug didn't provide some benefit. That clinical trial was preceded by years of smaller trials in people and animals indicating a large [very expensive] clinical trial was warranted. Based on the earlier trials, a sponsor chooses how many patients to enroll in a large clinical trial so that they have good prospects of showing efficacy with p<0.05. Only an idiot would think that drug companies are running large clinical trial solely based on the hope of obtaining a positive result by chance in 1 out of 20. Note that the FDA usually requires TWO such clinical trials for approvals (1 out of 400). If a new drug treats an life-threatening condition for which no other treatment has been shown to be effective, they request a second large clinical trial be run after approval.) The authorities now require that all data from all clinical trials be placed in a public repository (www.clinicaltrials.gov). Only the naive would automatically conclude that a failure to demonstrate efficacy with p<0.05 in one or more clinical trials proves that a drug would not be useful for a different patient population or at a different dose.

The propagandists who say that statistically significant warming was or was not observed over periods X or Y don't really understand statistics. If they presented the 95% confidence intervals, they would find that most of the confidence intervals many periods overlap to a significant extent! Whether a warming rate of 0 is or is not included one of these confidence intervals isn't particularly important; the central value and our confidence in the central value is important.

Real world science: two wrongs don’t make a right.

Climate science: a whole lot of wrongs modeled, averaged, filtered, tweaked and adjusted are somehow right.

“Only an idiot would think that drug companies are running large clinical trial solely based on the hope of obtaining a positive result by chance in 1 out of 20.”Dunno. If the trial costs $1m, and the drug has a potential market able to return $100m, it makes sense from a financial point of view, if not an ethical one. Not that I think very many of them do – unconscious cognitive biases are more than sufficient for people to fool themselves without anyone playing such dangerous games.

One thing that can happen is that the company does a trawl of ten thousand different preparations looking for efficacy, with maybe a 1-in-20,000 chance for each. In vitro tests narrow it down to a sample with 1-in-1000 odds, animal tests whittle it down to 1-in-50, then a human trial gets you to 1-in-2. Considering where you started, that’s a hell of an improvement in confidence, and a pretty fair improvement in the odds, which for a life-threatening condition is not to be dismissed. But the point remains – where you end up depends on where you started. P-values only give a rough idea of the size of jump in confidence, they don’t tell you what the final confidence is. And people are assuming wrongly that 5% means a 5% probability of error.

They way they actually do it is actually the right thing to do. The problem is that people not only misunderstand p-values, they misunderstand the purpose of the scientific journals too. The idea of the journals is not to provide a stamp of authority on reliable science – it is to report interim results for checking by ones scientific peers. The journal peer review is a purely editorial function to confirm that it is worth the journal audience’s time to look at. But the purpose in publishing it is to allow other researchers to try to replicate it, extend it, debunk it, generalise it, etc. Only after it has survived this challenge can it be considered ‘accepted’. And as such, the confidence level required is not that needed for science to be ‘accepted’ (the idea that a 5% error rate would be tolerable in science is laughable!), but it only needs to be sufficient to say ‘this is worth looking at’. A p-value of 5% ought to shift your confidence (from wherever it starts) by a noticeable amount. It might say “this formerly unlikely possibility is now somewhat less unlikely”, or it might say “this former contender is now the leader”, or it might say “what was formerly only the most likely explanation is now quite strongly confirmed.”

And there’s absolutely nothing wrong with them doing that, so long as you don’t go round thinking that papers in journals are to be considered “settled science”, or even that they’re 95% sure. They’re work in progress, and we *expect* a large fraction of them to be wrong. They’re supposed to be. We’re only saying they’re worth checking out, we’re not saying they’re true.

The appropriate trade-off point depends primarily on how many potential results arise. You need to cut the number down so that there are enough to keep everyone busy, but not so many that people can’t keep up with the field. Some fields like particle physics generate a huge number of possible results, so they set stringent levels in order that people only spend time on the very best prospects. Other areas can afford to be more relaxed. It depends too on the potential benefits if it happens to be true – long shots are sometimes worthwhile.

Clearly, 5% error rate is not sufficiently low, since science commonly chains together many results in longer arguments. A mere 14-step argument in which every step is only 95% likely to be right is more likely to be flawed than not. A 69-step argument can be sustained with 99% certainty. But many scientific arguments rely on hundreds of results. If CAGW was genuinely 95% confident (and it’s not), that might arguably be enough for politics (depending also on costs and benefits), but it’s *far* from enough to be “settled science”.

So are those P 0.5 surface station results we found just lucky, Anthony?

Ah gots to know!

Having read the comments on this thread I think I’m broadly in agreement with Nick Stokes. I’m not sure if this helps or makes things any clearer but it might be worth looking at an example.

The SKS trend calculator gives the 1996-2013 UAH trend as

0.120 ±0.188 °C/decade (2σ)

Using the conventional P threshold of 0.05 the trend is not significant. However, the result suggests that the probability that the trend is greater than ZERO is around 90%. In other words it is far more likely to be warming than not. Even the RSS trend since 1996 has a higher probability (~60%) that the trend is greater than ZERO.

That said, it’s unlikely that the trends are as high as those projected by the IPCC models.

p-value isn’t a probability and it certainly doesn’t show your theory has any merit.

There are many things wrong with using p-values.

For starters, I suggest reading:

http://wmbriggs.com/blog/?p=11261

http://wmbriggs.com/public/briggs.p-value.fallacies.pdf

http://wmbriggs.com/blog/?p=8295

And a zillion others just like them.

Mark T says:

February 8, 2014 at 9:10 am

Maybe that really is what separates engineers from the rest of the scientific community (as noted above): we are specifically trained to root out meaningless (spurious) correlations, and as such, always look for alternative answers that better describe the problem at hand.Mark,

Truth!

Mac

John Finn:

At February 8, 2014 at 5:05 pm you say

Well, yes. Everybody knows that. But do you have anything to add to the thread?

Richard

It is the age old case. Statistics should be left to statisticians. Do not expect a scientist to be a statistician. The job of the Scientist is different to that of the statistician. The scientist gathers the data and interprets it the best he can in terms of proposing mechanisms and theories.

It is left to separate works by statisticians to carefully decipher the statistical significance of the result.

Finally, even if some result was statistically insignificant, that does not mean that it is incorrect and should be rejected. There IS ALWAYS the possibility.

Just trying to clarify things for you, Richard. You appear to think statistical significance is a “knife-edge” issue but it’s a bit more blurred than that. In an earlier post Nick writes

Nick’s right. A “true” trend above 2.4 degrees per century is just as likely as a trend below ZERO – though neither is very likely.

“Using the conventional P threshold of 0.05 the trend is not significant. However, the result suggests that the probability that the trend is greater than ZERO is around 90%. In other words it is far more likely to be warming than not.”No, that’s the misunderstanding that everyone keeps making about p-values. It doesn’t mean that the probability of the trend being greater than zero is 90%. It means that the probability of seeing a less extreme slope if the true trend is actually zero (and the noise fits a certain statistical model) is 90%. (The SkS calculator apparently assumes ARMA(1,1) noise.)

The two probabilities are different. Consider this silly example. Suppose I want to know what the true slope of the global mean temperature anomaly is, and I decide to estimate it using the rather strange method of throwing two dice and adding the results up. I get a 5 and a 6 making 11. If the true temperature trend is zero, what is the probability of me getting a less extreme value than this? The answer is 33/36 = 92%. Not quite ‘significant’, but not far off. But what does that tell me about the probability of the true trend being zero?

That’s an extreme example, but illustrates the point that there’s not necessarily any connection between p-values and the probabilities of the null and alternative hypotheses. The p-value is one of the most commonly misunderstood statistics.

Is this intended for me?

First I haven’t got any theory. I’m just providing a very basic analysis of the UAH trend test statistic. Secondly, if the P-value doesn’t represent a probability what does it represent? A P-value of less than 0.05 suggests that the probability of obtaining a given test result by chance is less than 5%.

There aren’t “many things wrong with using p-values”. There are many ways that p-values can be misinterpreted. There are also times when the value of p-values can be exaggerated. For example, the claims that warming has stopped since 1997 or 1998 or whatever are stretching things a bit.

Eric Worral, thanks for thought-link to RA Fisher.Rrom the chronology in Wiki it looks as if his time spent thinking about Eugenics preceded his stats work. As luck would have it, the first article that I came across from the Eugenics Review Journal was by Fisher:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2986993/pdf/eugenrev00372-0017.pdf

It is salutory to consider what would be happening now, if the concept of evolutionary theory had emerged only in the 1950’s: The Eugenic catastrophists would have loads of colour graphics from giant computer simulations of ‘the future’ to convince the decision makers.The simulations would all be based on straighforward mathematics. [with just a few parameterisations;-) ].

I can see quite a few of our present day scientists and politicians being dragged onto the rocks.

Its kind of fun mapping the actors across from the current global warming farce ; Will Mosher make it through?? -Not with thoughts like “With regard to the ipcc. It takes zero stats to understand t h at co2 is a problem”.

Ok- I understand the difference but I deduced the 90% from the error bars not the p-value.

However, it is still true to say that likelihood that the observed UAH data is from a population where the true trend is ZERO is the same as the likelihood that it is from a population where the true trend is 0.24 degrees per decade.

The problem is that statistical tests, in three different versions, actually supply answers to three different questions:

Pearson-Neyman inference answers the question “What should I do?”

Bayesian methodology answers the question “What should I believe?”

Likelihood answers the question “How strong is this evidence?”

Any inference method based on p values is probably inappropriate for answering the sorts of questions that science and engineering generally ask. You all do know that Ph.D. scientists often have a pretty thin to nil background in probability and statistics do you not? At the university where I currently work we have a graduate level course aimed for Ph.D. candidates and Post-Docs. A few faculty even sign up. Even so, I think this course covers mainly the classical sort of inferential statistics using p-values, confidence intervals, and rejection regions.

I read examples of this sort all the time, and it appears that no one observes that the “method” in question here has no construct validity. The dice have less (very much less) validity in measuring temperature trends, than say, using a thermometer. Yes, it is a silly example. A person can point me toward all sorts of statistical measures that demonstrate significance of many things, but without a testable theory that explains the physical mechanism involved, I’m unlikely to be convinced.

“However, it is still true to say that likelihood that the observed UAH data is from a population where the true trend is ZERO is the same as the likelihood that it is from a population where the true trend is 0.24 degrees per decade.”Only if the data happens to be of the form of a linear trend plus ARMA(1,1) noise, which is to some degree begging the question. If you assume trend+ARMA(1,1) you get one answer, if you assume, say, no trend+ARIMA(3,1,0) as Doug Keenan did you get a completely different answer, which actually fits the data somewhat better, but for which the trend is zero by definition.

When you get out what you put in, that’s an indication that the data doesn’t contain a definitive answer, and the answer you appear to be getting is illusory; an artefact of the method you’re using. *If* there’s a non-zero trend, this gives you an estimate of it, but it can’t tell you if there’s a trend.

For that, you need an accurate, validated physical model of the background noise statistics, which we don’t have. So all this analysis is a waste of time.

“I read examples of this sort all the time, and it appears that no one observes that the “method” in question here has no construct validity. The dice have less (very much less) validity in measuring temperature trends, than say, using a thermometer.”Of course. That was the point. All measurement methods provide some balance of signal and noise. I picked a method for my example that shifted the balance all the way to the end – so that it was *no* signal and *all* noise. And yet, you can still get a significant p-value from it.

The point is that a p-value doesn’t tell you if your measurement method is any good – it assumes it. So even a totally rubbish method can still seem to work, and people who only look at p-values mechanically, without.understanding, can be easily misled.

John Finn February 9, 2014 at 3:08 am:

Is this intended for me?Not particularly but I see some reading might be in order.

if the P-value doesn’t represent a probability what does it represent? A P-value of less than 0.05 suggests that the probability of obtaining a given test result by chance is less than 5%.In frequentist terms you aren’t permitted to call it a probability if only because you aren’t allowed to assign probabilities to unobservables. It’s definition is rather wordy and not readily available online that I can see but goes something like: the probability of seeing the statistic (from which the p-value is derived) equal or larger than the one found

giventhe parameter in question (means or slopes in a linear regression) are equalifthe experiment (whatever) was repeated aninfinitenumber of times.Even so, why on Earth would you want a regression if the purpose is not prediction? Can’t you just look at the data with your eyes? Are temps higher on the left or right? Are they equal? Just look!

There aren’t “many things wrong with using p-values”. There are many ways that p-values can be misinterpreted.Their interpretation is pointless. The p-value is answering a question regarding the quality of the model parameters and not at all answering the question of whether the model is useful or valid. The latter answer is what people generally want to know. The former: who cares?

DAV:

Their interpretation is pointless. The p-value is answering a question regarding the quality of the model parameters and not at all answering the question of whether the model is useful or valid. The latter answer is what people generally want to know. The former: who cares?This happens all the time in science, not just in statistics: the question that you can answer is different from the question that you want to answer.

Matthew R Marler,

This happens all the time in science, not just in statistics: the question that you can answer is different from the question that you want to answer.Sad, isn’t it?

Hypothesis testing is often mis-used. It is great for initial studies that are looking to see if further study is warranted. Of course, you have to also consider the power of the test, which is the likelihood of correctly detecting an effect of a certain size. But, design the initial study correctly, with an appropriate power, using a well-developed and well-understood methodology, and you can get a good idea of whether you’ve been suckered by randomness or not.

There is something not mentioned in the article that goes beyond the issue of statistical false positives and false negatives. I think scientific studies, tests, evaluations, and so forth, are often beset by methodology issues and so the statistical analysis ends up being garbage-in garbage-out. A statistical test/technique is only as good as it’s assumptions, and one assumption is always that the researchers/engineers know their test apparatus inside and out, and they thoroughly understand how it interacts with the various controlled and uncontrolled factors.

In reply to Nick Stoke’s 5% comment.

http://motls.blogspot.com/2010/03/defending-statistical-methods.html

Lubos Motl states physicists routinely look for 5 sigma significance, else the results aren’t worth publishing.

Alan McIntire,

You might also addthat Lubos is talking about discrepancies between observation (O) and prediction (P) and not about the niceness of the parameters used to create P.

Nullius in Verba wrote: “Dunno. If the trial costs $1m, and the drug has a potential market able to return $100m, it makes sense from a financial point of view, if not an ethical one. Not that I think very many of them do – unconscious cognitive biases are more than sufficient for people to fool themselves without anyone playing such dangerous games.”

You are forgetting that the FDA usually requires two clinical trials at 20 to 1 odds. It also helps to recognize that there are at least two layers of doctors between a drug company and the patients needed for a clinical trial, the research doctors at the hospital agreeing to host part of a clinical trial and the doctors referring patients to those clinical trials. Hospitals with an excellent reputation can often choose which promising drug they are willing to help develop. It is hard to rapidly accrue patients when their isn’t much excitement about a new candidate, and slow trials eat up patent life and potential future profits. However, I suspect that one can now buy a clinical trial somewhere in Asia for almost any drug for enough money these days.

Nullius also wrote: “unconscious cognitive biases are more than sufficient for people to fool themselves”. Exactly. And if several careers and the company’s stock price depend on the success of a shaky clinical candidate, the biases are not unconscious.

Doesn’t an ARIMA(3,1,0) model imply non-stationarity? Non-stationarity appears to me to be physically unreasonable for our planet, given the negative Planck feedback and the fact that we haven’t experienced a runaway greenhouse like Venus. We’ve got a 4 billion year history (about 0.5 billion of that with less adaptable large multicellular plants and animals living on land) for temperatures to randomly drift further and further from the initial conditions and Keenan wants us to worry about non-stationarity in the past century?

Anthony

AGW hypothesis is a blatant disregard of statistics. The most glaring example of false positive. NOAA global temperature data from 1880-2013 show a two-sigma deviation (P 0.69 C including margin of error. Regression analysis is misleading because it is essentially curve fitting a trend line in a scatter diagram. It is well known that a random walk function can create trend lines. Even technical analysts can be fooled by a random walk function.

In physical, biomedical and social sciences (meaning all of sciences) the data will be dismissed as trivial and the hypothesis (AGW) as inconclusive. But this is climate science with its own rules. The debate whether global warming is caused by man or nature is moot and academic. It’s like debating if your poker winnings are due to superior strategy or to cheating when in fact your winnings are not different from what you would expect from chance alone.

Part 1

AGW hypothesis is a blatant disregard of statistics. The most glaring example of false positive. NOAA global temperature data from 1880-2013 show a two-sigma deviation (P < 0.05) did not occur until 1998. All previous warming (positive anomalies) are trivial and statistically insignificant. However, the margin error in the data is +/- 0.09 C. If we take this into consideration, NONE of the data is statistically significant.

Part 2

The largest positive anomaly on record is 0.66 C in 2010. The threshold value to be statistically significant is > 0.69 C including margin of error. Regression analysis is misleading because it is essentially curve fitting a trend line in a scatter diagram. It is well known that a random walk function can create trend lines. Even technical analysts can be fooled by a random walk function.

In physical, biomedical and social sciences (meaning all of sciences) the data will be dismissed as trivial and the hypothesis (AGW) as inconclusive. But this is climate science with its own rules. The debate whether global warming is caused by man or nature is moot and academic. It’s like debating if your poker winnings are due to superior strategy or to cheating when in fact your winnings are not different from what you would expect from chance alone.

Steven Mosher says:

February 7, 2014 at 7:17 pm

just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.

You do your test. You report the uncertainty.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Amen to that. Statistics have become a means of propping up weak science.

Correction:

You do your test; you do your statistics; you have your data; there IS no uncertainty.

Unless you discard your test data.

Frank,

“You are forgetting that the FDA usually requires two clinical trials at 20 to 1 odds.”The claim I was responding to was “Only an idiot would think that drug companies are running large clinical trial solely based on the hope of obtaining a positive result by chance in 1 out of 20.” My point being that 1-in-20 can work if the potential profits are more than 20 times the cost of the trial. The principle still stands with two trials: 1-in-400 works if the profits are more than 400 times the cost of the trial. So it’s not a silly question, although one would hope that medical researchers are doing it for more than the money.

“Hospitals with an excellent reputation can often choose which promising drug they are willing to help develop.”So the trial subjects and treatment regime for selected drugs are atypical? That sounds… concerning. Not necessarily a problem, but if drug companies push the most potentially profitable candidates on the best hospitals, you could get selection effects. But that’s a different issue.

“Doesn’t an ARIMA(3,1,0) model imply non-stationarity? Non-stationarity appears to me to be physically unreasonable for our planet”Yes, and yes it is, but in time-series analysis it is common to use a non-stationary model for data that has to be stationary for physical reasons, because what you’re really doing is approximating the behaviour of time segments too short to resolve their characteristic roots, and doing so avoids all the mathematical difficulties associated with such situations.

The situation arises when one of the roots of a stationary process is very close to, but not on, the unit circle. If you collect enough data, you can locate it precisely enough to tell that it’s actually inside the circle and the series is stationary. But if you only have a short segment of data, too short to fully explore its range of behaviour, the result appears indistinguishable from a non-stationary process. And all the same problems that mess up estimation for non-stationary processes also mess up the only approximately non-stationary.

It’s kind of like the way some people fit a linear trend line to a temperature graph. If you extend the graph back far enough, you wind up with temperatures below absolute zero, which make no physical sense. Extend the linear trend far enough forwards and you get temperatures of thousands of degrees, which is not very realistic either. So for exactly the same sort of physical reasons, a linear trend is physically impossible too. And yet nobody objects when people propose linear trend+ARMA(1,1), while they always raise the objection with ARIMA(3,1,0). I suspect it’s just a question of familiarity, but it’s still darn inconsistent. ;-)

ARIMA(3,1,0) is an approximation for a short stretch of data in the same way that “linear trend + whatever” is. Everyone agrees it’s not physically plausible as a general model – but until we have a validated physics-based model of the noise, its about as good as we’re going to get. It’s what applying the standard textbook methods of time series analysis will give you, anyway.

I should perhaps add that Koutsoyiannis has some interesting ideas on long term persistence models as even better fits than simple ARIMA, that are worth knowing about. But it’s still a form of curve-fitting, rather than physics.

Nullius in Verba says: February 10, 2014 at 11:58 am

“So for exactly the same sort of physical reasons, a linear trend is physically impossible too. And yet nobody objects when people propose linear trend+ARMA(1,1), while they always raise the objection with ARIMA(3,1,0).”There is a physical theory as to why there is a trend that was not there before. It’s the theory we’re trying to test.

ARIMA(3,1,0) builds the unphysicality into the noise model. There is no reason to expect that the noise has changed in nature in the last 100 years. That’s a null hypothesis which itself needs explaining.

Nick Stokes has done a good job of illustrating how ‘scientists’ get themselves muddled up with statistics; they can calculate a P-value well enough but that doesnt mean they have got the null hypothesis right nor that they understand what their statistical calculations mean. We wont get into the misapplication of statistical methodology: linear trends for climate? Give me a break…

Mosher is simply illustrating his lack of critical thinking. “CO2 is a problem”: for who? How and why? Does it matter?

Fairly typical of Mosher though, cryptic (or even throwaway) lines that dont stand up to deeper scrutiny.

As for the subject at hand, i think it’s well understood by anyone with proper statistical training that scientific literature is littered with poor and misapplied statistics. However, merely using math seems to give people a greater confidence that something must be correct. We all need a reminder that math is simply another language, and just because something is consistent or works mathematically doesn’t mean that there is a physical translation or truth behind said mathematics. Statisticas can be included in this broader statement.

“There is a physical theory as to why there is a trend that was not there before. It’s the theory we’re trying to test.”There’s a physical theory for why the temperature was below absolute zero a few thousand years ago?! I don’t think so.

“ARIMA(3,1,0) builds the unphysicality into the noise model.”No more so than a linear trend.

The classic non-stationary process is the random walk, proposed to explain Brownian motion. And yet it is obviously the case that a pollen grain or a molecule of air cannot wander infinitely far. Why do you suppose it is that Einstein “built the unphysicality into the noise model”? Obviously, because it is a good approximation.

We make such unphysical approximations all the time – infinite perfectly flat planes, straight lines, perfect spheres, frictionless surfaces, rigid bodies, elastic collisions, flat spacetime, infinite crystal lattices, instantaneously propagating gravity, point particles, point velocities, monochromatic plane waves, etc. etc. It’s unphysical only because it’s an approximation. This is just another one.

It doesn’t matter that it’s *technically* unphysical, because it’s a close approximation of something that *is* physical (but mathematically messy).

“There is no reason to expect that the noise has changed in nature in the last 100 years.”There’s no reason to expect it hasn’t, either.

Consider a process x(t) = k(t-1)*x(t-1) + random(t-1) where x(t) is the data at time t and k(t) is a function that varies slowly over time, hovering around and usually just below 1. Sometimes it will exceed it and the process wanders away for a bit, but after a while will drop down again and x return back towards the origin.

The AR(1) process is just the above when k(t) has a constant value, but what physical reason do you have to assume it’s constant? What physical reason do we have to suppose the coefficients in the SkS ARMA(1,1) are constant? Don’t they depend on physical parameters, which can change?