To make science better, watch out for statistical flaws
First of two parts
As Winston Churchill once said about democracy, it’s the worst form of government, except for all the others. Science is like that. As commonly practiced today, science is a terrible way for gathering knowledge about nature, especially in messy realms like medicine. But it would be very unwise to vote science out of office, because all the other methods are so much worse.
Still, science has room for improvement, as its many critics are constantly pointing out. Some of those critics are, of course, lunatics who simply prefer not to believe solid scientific evidence if they dislike its implications. But many critics of science have the goal of making the scientific enterprise better, stronger and more reliable. They are justified in pointing out that scientific methodology — in particular, statistical techniques for testing hypotheses — have more flaws than Facebook’s privacy policies. One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false.
A few months ago, though, some defenders of the scientific faith produced a new study claiming otherwise. Their survey of five major medical journals indicated a false discovery rate among published papers of only 14 percent. “Our analysis suggests that the medical literature remains a reliable record of scientific progress,” Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University wrote in the journal Biostatistics.
Their finding is based on an examination of P values, the probability of getting a positive result if there is no real effect (an assumption called the null hypothesis). By convention, if the results you get (or more extreme results) would occur less than 5 percent of the time by chance (P value less than .05), then your finding is “statistically significant.” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.
As Jager and Leek acknowledge, though, this method has well-documented flaws. “There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis,” they wrote.
For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies, and with thousands of scientific studies going on, that adds up to a lot. But it’s even worse. If there actually is no real effect in most experiments, you’ll reach a wrong conclusion far more than 5 percent of the time. Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes. More than 80 percent of your supposed results would be false.
read more here: https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws
Eric Worrall:
I agree your post at February 8, 2014 at 3:01 am which says
However, that has nothing to do with my dispute with Nick Stokes in this thread.
The statistical conventions adopted by climastrology are nonsense. However, they were used to show the existence of discernible global warming in the last century. Stokes is now claiming that those same conventions should not now be used because they now demonstrate that there has not been discernible global warming for at least 17 years.
This is an important issue which goes to the heart of the subject of this thread.
Appropriate statistical methods need to be applied to assess the time series of GASTA. And their appropriateness needs to be defined technically and not on the basis that it fulfills an agenda.
Richard
Hypothesis testing is best thought of in Bayesian terms. You start off with a prior belief in the conclusion. You perform an experiment or make an observation that adds or subtracts from your belief. Your posterior belief, the belief you should have that the conclusion is true after seeing the experiment, is your initial belief *plus* the increment from the experiment.
The p-value is an approximation to the size of the experimental increment. It is *not* the probability of the conclusion being true. The reason for using it as a filter on publication is to say “this experimental result is strong enough to shift your opinion significantly.” It does *not* say “this experiment shows that the conclusion is true.”
The 100 drugs trial example above is a classic example. You start off with a 1% confidence in each of the drugs. You perform the test, and at the end you have a 20% confidence in each of the drugs that passed. That’s a big increase in confidence, and well worth reporting, but if you started at 1% you’re only going to get to 20%, and as noted, that still means there’s an 80% chance you’re wrong.
For the mathematicians:
Bayes says that for two hypotheses H1 and H2 and an observation O…
P(H1|O) = P(O|H1) P(H1)/P(O)
P(H2|O) = P(O|H2) P(H2)/P(O)
so dividing one equation by the other
P(H1|O) / P(H2|O) = [P(O|H1) / P(O|H2)] [P(H1)/P(H2)]
Take logarithms
log[P(H1|O) / P(H2|O)] = log[P(O|H1) / P(O|H2)] + log[P(H1)/P(H2)]
and we interpret this as
log[P(H1)/P(H2)] = prior confidence in H1 over H2
log[P(O|H1) / P(O|H2)] = confidence added by observation O in favour of H1 over H2
log[P(H1|O) / P(H2|O)] = posterior confidence in H1 over H2 after seeing the observation
If H2 is just the opposite of H1, then P(H2) = 1-P(H1), and we can translate the logarithmic confidence scale to probabilities using c = log[p/(1-p)] and back again with p = 1/(1+b^(-c)) where b is the base of the logarithms.
The p-value is just P(O|H2), the probability of the observation under the null hypothesis, and the smaller it is the more confidence we’ve just gained in the alternative hypothesis H1. As you can see, this assumes that the observation is fairly certain to occur under H1, so log[P(O|H1)] is small. If it’s not, p-values taken too literally can give misleading results. However, it’s usually intuitively obvious if that’s the case, and this sort of thing is only a big problem when researchers apply statistical calculations blindly without understanding how the evidence works.
Not that I’m saying that never happens…
” Bob says:
February 7, 2014 at 9:05 pm
HankH says:
February 7, 2014 at 7:58 pm
“Wow, using the less than sign followed later in the sentence with a greater than sign in WordPress serves to delete what was between the two symbols”
Hank: It was probably not WordPress that interpreted your arrows to be html tags. The browser is the software charged with that duty, and pretty much all browsers will ignore non-existent html tags without an error statement. So, the left arrow “” with something in between is interpreted as a tag. I don’t know of any way around it. It seems there should be an escape character that could be used.”
________
Wordpress parses your comment into a filtered HTML generator before it passes to the comment file and deletes what it sees as incompatible code, and recodes other things, like urls. The angular brackets are especially tricky because, if they were not screened, the entire blog page can collapse depending what’s between the brackets. The angle brackets are the fundamental code delimiter in html files. Its not a browser issue, as it happens before your browser sees the parsed code. Sometimes it’s just easier to spell things out. “For greater” than and “less than” you could use (GT) or (LT), e.g (LT).05. Square brackets are equally cumbersome, since some systems (PHPBB notably), use those as code delimiters.
“Sometimes it’s just easier to spell things out. “For greater” than and “less than” you could use (GT) or (LT), e.g (LT).05. “
Yes, it’s not WordPress, just HTML. You can use special sequences, given here. For less than, use & lt ; but without spaces (< see). For GT, & gt ;.
David L said:
If AGW research followed the rules required of Pharmaceutical research, the entire dogma would have been rejected by the FDA years ago.
~ ~ ~ ~ ~ ~ ~ ~
The FDA gave the stamp of the approval to Merck for Vioxx.
The FDA tried to close down Dr Burzynski’s cancer clinic, keeping him mired in legal battles for years while the Department of Health and Human Services was busy stealing his antineoplaston patents. (ref)
That’s the FDA and pharmaceutical industry research standards that you hold up as as a role model. Not a good model.
Orkneygirl (@Orkneygal) says: February 8, 2014 at 2:44 am
“Nick Stokes and Steven Mosher, have made the same misinterpretation of what Fisher’s p value actually”
Not me. Fisher is quoted as saying:
“a low P value merely means that you should reject the null hypothesis; it does not actually tell you how likely the null hypothesis is to be correct.”
That’s exactly what I’m saying. P value tests can’t prove the null hypothesis correct. They can only usefully persuade you to reject.
So when you say:
“Fisher p values are about acceptance of the null hypothesis”/i>
that’s exactly against what your quote is saying.
From the article:
Ah yes, the Slippery Slope Fallacy. If they stop believing in the results of medical science, they’ll believe in something worse. When I attended church I would get regular sermons on religious beliefs in things which were ipso facto preposterous and failure to believe in them leading to immoral behaviour or even worse, atheism.
The answer is of course, that there are too many (medical) science articles that are unreproducible and/or with unreliable marginal experimental results based on poor use of statistical techniques. And those results are trumpeted by a small coterie of scientific journals which trade on “impact” instead of verifiability.
Its an unvirtuous circle that scientific academies, if they had any use at all, would be trying to break. Instead scientific academies themselves are stuffed with people who produced the poor research in the first place, and are co-opted to promote orthodoxy of mediocre results and pour calumny on critics.
Khwarizmi on February 8, 2014 at 4:40 am
David L said:
If AGW research followed the rules required of Pharmaceutical research, the entire dogma would have been rejected by the FDA years ago.
~ ~ ~ ~ ~ ~ ~ ~
The FDA gave the stamp of the approval to Merck for Vioxx.
The FDA tried to close down Dr Burzynski’s cancer clinic, keeping him mired in legal battles for years while the Department of Health and Human Services was busy stealing his antineoplaston patents. (ref)
That’s the FDA and pharmaceutical industry research standards that you hold up as as a role model. Not a good model.
———–
You proved my point better than I did!!! Even as crappy as the FDA and Pharma are and applying their less than optimal standards, the AGW would not hold weight!!!!
John A : “…. leading to immoral behaviour or even worse, atheism.”
why is an atheist “worse” than an immoral christian?
Do you think that no one is capable of being moral without having a ‘representative’ of god to tell his what to do?
I think you need to check your null hypothesis.
As an engineer, I am somewhat bemused by all this statistical theory.
How many so-called “scientists” would let their children fly on an aeroplane that had a 95% probability of completing its journey 19 times out of 20, and crashed in flames the 20th time? Or even drive across a bridge that had a 1 in 10,0000 chance of falling down when a car drove across it?
Would Mosher depend on feeling lucky if the fate of his offspring was at stake and permit them to fly on the aforesaid aeroplane that had been designed and built by a bunch of climate McScientists who solemnly assured him that the p-value was <0.05 of it falling in pieces in mid-air? Would Grant foster? Or Michael Mann? I seriously doubt it. And yet they expect us to destroy our economies and hand over ever-increasing quantities of our hard-earned cash to the likes of Al Gore on equally flimsy evidence.
And then scientists look down on engineers for getting their hands dirty by applying science – of which in practically every case they require a vastly more profound understanding, for obvious reasons – to real world problems.
"Scientists" appear to resent the fact that engineers often regard much of their prognostication with amusement verging on contempt – AGW is a case in point, then wonder why.
Think on, as we say up here in Yorkshire.
A recent paper argues that p-values should be reduced to 0.005 to 0.001. Revised standards for statistical evidence, Valen E. Johnson. http://www.pnas.org/content/110/48/19313
“Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance. ”
I’ve been working in industrial labs for 35 years and I’ve learned to start any statistical analysis similar to a statics analysis for forces: start with a drawing. With statistics, a plot of the data and a description of the statistical analysis and justification for the assumptions inherent in that analysis. Don’t even calculate a standard deviation if you haven’t looked to see if the data is normally distributed! Did you see the assumption you made before calculating that standard deviation? (hint: Sometimes a transform of the data will give you a normal distribution.)
Industry is a little different than academia; from Ghost Busters
I’ve added this article to my collection.
See also Revised standards for statistical evidence, by Valen E. Johnson, Proc. Natl Acad. Sci. USA, Oct. 9, 2013 (print Nov. 11, 2013), doi:10.1073/pnas.1313476110, and the discussion at WUWT, esp. Prof. Robert G. Brown’s comment, plus the article in Nature.
Steven Mosher:
“The notion that there is some special value that allows us to automatically accept or reject tests, is the culprit. you report 95%. from that one cannot deduce anything. You can observe that the result might be wrong 1 in 20 times. That tells you nothing about the truth of the matter.”
Actually I thought that 95% was the figure that permitted the IPCC to propagate CAGW, which I understand is a hypothesis is espoused by Mr. Mosher.
http://www.bbc.co.uk/news/science-environment-24292615
Solomon Green:
In your post at February 8, 2014 at 6:05 am you say
Yes! That was the point of my debate with Nick Stokes in this thread.
As I said to Eric Worrall at February 8, 2014 at 3:35 am
Richard
I like this article and would like to make a small point.
Researchers HAVE TO make progress in order to be successful, make progress. No paper is done in isolation. One paper builds upon another and you have to hurry up. There’s no time to take so much data as to be metaphysically sure that the relation is adequately described, i.e p<0.00…01. There's no real need because the next bit of progress is going to be built upon the last bit and if the last bit is wrong then the next bit won't pan out. You'll quickly discover that a mistake was made somewhere along the way. You won't know where but with a good understanding of the subject, you can make some good educated guesses and efficiently reexamine past conclusions.
This is in contract to Climate Research. New data doesn't invalidate prior conclusions very well because the new data isn't generated by the thought process of the researcher. The feedback on the thought process is much thinner.
Greg Goodman
A small factoid: I am an atheist of 15 years’ standing.
What I am pointing out is that fallacious reasoning is not limited to churches, and the same fallacies wheeled out regularly to religious believers to keep them on the straight and narrow path also happens in scientific journals.
Sorry John, it’s often hard to detect satire in blog posts.
The parallels between religion and science are many.
In particular, those who are part of the flock of the church of AGW believe in “the science” like christians believe in the “the word”.
Bald headed monks like brother Michael are revered as wise men.
Even outside the mess of climatology, science has become very much like the Church it has replaced.
Catweazel
You missed the point entirely. Im basically with briggs on this matter.
Example. Suppose my null is the buttered toast will fall butterside down half of the time.
Am I going to require high statiscal certainty on such a matter. Nope.
With regard to the ipcc. It takes zero stats to understand t h at co2 is a problem
Say they test 1,000 hypotheses, all of which are false but for which a p value can be derived. Using a 95% confidence interval on the sample, about 50 of the tests of hypotheses will fall in the tails of the distribution and thus 50 “discoveries” will have been made.
The problem is that it is those 50 results that will be written up and submitted to journals for publication. The editors of the journal will be, by definition, looking at 50 false conclusions, from which they will choose to publish the ones that…well, it really doesn’t matter which ones they publish, does it? They’re all wrong in this case, by definition.
Perhaps this is one reason so few published results survive the test of time?
“For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies”
=====
It can be even worse than that. Sometimes, there is no expected result, or multiple possibilities. For example, you might be curious if blood donation raises or lowers the donor’s blood pressure. One might expect it to lower blood pressure by decreasing the volume of blood to pump. Or one might expect it to increase blood pressure because many folks find sticking needle in their arm to be stressful. If one starts with no expectation, there are two p=.05 probabilities — increase or decrease, not one. And one chance in 10 rather than one in 20 of achieving a “significant” result
Looks like it’s time for a reminder:
Over on the right side nav bar is a link to Ric Werme’s guide to WUWT. Among other goodies is a good list of HTML notes for getting characters like ‘<‘ to display.
Also, if you want to try out <pre> and what not, please do it at the “Test” page – see the link at the top nav bar. That has most of my HTML notes too.
Sorry about breaking the font size and indentation there last year. I forget how I did it.
Greg said:
“Sorry John, it’s often hard to detect satire in blog posts.”
It wasn’t actually satire. John A was merely pointing out the position from the church view point. Indeed, atheism is probably worse than any sin in many, if not most, church eyes.
Ah, Mosher, ever the myopic prognosticator w.r.t. climate and that evil demon CO2. Indeed, it is obvious that CO2 is a problem, if that is what you started out believing. True scientists (read: few associated with climatology) don’t actually start out with a belief: they see something interesting then devise tests they can conduct (including analyses of existing data) to attempt a better understanding of the phenomenon. People like you, however, are all about belief. Maybe that really is what separates engineers from the rest of the scientific community (as noted above): we are specifically trained to root out meaningless (spurious) correlations, and as such, always look for alternative answers that better describe the problem at hand.
Mark
Greg says:
February 8, 2014 at 8:04 am “The parallels between religion and science are many. In particular, those who are part of the flock of the church of AGW believe in “the science” like christians believe in the “the word”. Bald headed monks like brother Michael are revered as wise men.
Even outside the mess of climatology, science has become very much like the Church it has replaced.”
The parallel between science and religion is this:
data is to science as text is to belief.
In the case of Christianity, all belief is to be tested with the text. A priest class, or expert class, which interprets the text for everyone else has always historically led to distortions and abuses, with the eventual outcome that traditions developed by the priest class teach the very opposite of the text. I believe this pattern exists in Hinduism and Buddhism, as well as in Christianity. Literacy, and translations into spoken languages has allowed believers to read the text themselves and judge the claims of the priest class. The individual then goes to the church which reflects their interpretation.
In the case of the example of medical science, the answer is similar. Each individual should have the freedom to research claims and choose a path to healing. No government board or exchange should be forcing medical decisions on doctors and patients. The treatments of medical doctors are often drastic, and have many side effects and unintended consequences, and may be worse than the disease it is meant to treat. Iatrogenic illnesses and deaths are very possibly the most under reported area of science. In the case of your physical life, or your eternal life, literacy and liberty, along with personal responsibility for outcomes, are optimal for human well being.
Steve Mosher-“It takes zero stats to understand that co2 is a problem.” Based on what? CO2 concentration in the atmosphere is 400 PPM. 400 sounds like a big, dangerous number. But 400 parts per million is .04%. According to warmistas, mankind has caused the CO2 concentration to increase from around 375 PPM to 400 PPM, with 375 PPM as the Goldilocks standard, not to hot, not to cold, but just right. 400 PPM is catastrophe. The increase in concentration is .000025 in absolute terms, and .0025% in percentage terms. Most people would look at the percentage numbers and percentage increase and draw the obvious conclusion, as Dr. Richard Lindzen has, that CO2 is a trace gas that has increased by a trace percentage, and so what? In addition, increase in global warming may be a good thing. More CO2 combined with warmer weather means that crop production in North and South Dakota, MN and southern Canada will increase tremendously, helping to alleviate global hunger.
Yes some oceanfront property may be lost, but at such a glacial pace that mitigation and the necessity to move inland will take place over such a long time frame that the economic cost is easily absorbed compared to the economic gains. Man lives where the climate is warm and wet, warm and dry, cold and wet, and cold and dry, and often experiences all four in one location. We are a remarkably adaptive species, and we all have feet. The bigger threat is Government wasting billions in capital on a chimera, all of which will increase the cost of energy, and for the first time in human history, we are making intentional policy choices that will lower the standard of living for future generations.
@Steven Mosher says: February 8, 2014 at 8:05 am
“With regard to the ipcc. It takes zero stats to understand t h at co2 is a problem”
Mosher, that’s your problem. CO2 is the fundamental building block of all carbon based life forms,it is not the problem !!