To make science better, watch out for statistical flaws
First of two parts
As Winston Churchill once said about democracy, it’s the worst form of government, except for all the others. Science is like that. As commonly practiced today, science is a terrible way for gathering knowledge about nature, especially in messy realms like medicine. But it would be very unwise to vote science out of office, because all the other methods are so much worse.
Still, science has room for improvement, as its many critics are constantly pointing out. Some of those critics are, of course, lunatics who simply prefer not to believe solid scientific evidence if they dislike its implications. But many critics of science have the goal of making the scientific enterprise better, stronger and more reliable. They are justified in pointing out that scientific methodology — in particular, statistical techniques for testing hypotheses — have more flaws than Facebook’s privacy policies. One especially damning analysis, published in 2005, claimed to have proved that more than half of published scientific conclusions were actually false.
A few months ago, though, some defenders of the scientific faith produced a new study claiming otherwise. Their survey of five major medical journals indicated a false discovery rate among published papers of only 14 percent. “Our analysis suggests that the medical literature remains a reliable record of scientific progress,” Leah Jager of the U.S. Naval Academy and Jeffrey Leek of Johns Hopkins University wrote in the journal Biostatistics.
Their finding is based on an examination of P values, the probability of getting a positive result if there is no real effect (an assumption called the null hypothesis). By convention, if the results you get (or more extreme results) would occur less than 5 percent of the time by chance (P value less than .05), then your finding is “statistically significant.” Therefore you can reject the assumption that there was no effect, conclude you have found a true effect and get your paper published.
As Jager and Leek acknowledge, though, this method has well-documented flaws. “There are serious problems with interpreting individual P values as evidence for the truth of the null hypothesis,” they wrote.
For one thing, a 5 percent significance level isn’t a very stringent test. Using that rate you could imagine getting one wrong result for every 20 studies, and with thousands of scientific studies going on, that adds up to a lot. But it’s even worse. If there actually is no real effect in most experiments, you’ll reach a wrong conclusion far more than 5 percent of the time. Suppose you test 100 drugs for a given disease, when only one actually works. Using a P value of .05, those 100 tests could give you six positive results — the one correct drug and five flukes. More than 80 percent of your supposed results would be false.
read more here: https://www.sciencenews.org/blog/context/make-science-better-watch-out-statistical-flaws
Hypothesis testing is often mis-used. It is great for initial studies that are looking to see if further study is warranted. Of course, you have to also consider the power of the test, which is the likelihood of correctly detecting an effect of a certain size. But, design the initial study correctly, with an appropriate power, using a well-developed and well-understood methodology, and you can get a good idea of whether you’ve been suckered by randomness or not.
There is something not mentioned in the article that goes beyond the issue of statistical false positives and false negatives. I think scientific studies, tests, evaluations, and so forth, are often beset by methodology issues and so the statistical analysis ends up being garbage-in garbage-out. A statistical test/technique is only as good as it’s assumptions, and one assumption is always that the researchers/engineers know their test apparatus inside and out, and they thoroughly understand how it interacts with the various controlled and uncontrolled factors.
In reply to Nick Stoke’s 5% comment.
http://motls.blogspot.com/2010/03/defending-statistical-methods.html
Lubos Motl states physicists routinely look for 5 sigma significance, else the results aren’t worth publishing.
Alan McIntire,
You might also addthat Lubos is talking about discrepancies between observation (O) and prediction (P) and not about the niceness of the parameters used to create P.
Nullius in Verba wrote: “Dunno. If the trial costs $1m, and the drug has a potential market able to return $100m, it makes sense from a financial point of view, if not an ethical one. Not that I think very many of them do – unconscious cognitive biases are more than sufficient for people to fool themselves without anyone playing such dangerous games.”
You are forgetting that the FDA usually requires two clinical trials at 20 to 1 odds. It also helps to recognize that there are at least two layers of doctors between a drug company and the patients needed for a clinical trial, the research doctors at the hospital agreeing to host part of a clinical trial and the doctors referring patients to those clinical trials. Hospitals with an excellent reputation can often choose which promising drug they are willing to help develop. It is hard to rapidly accrue patients when their isn’t much excitement about a new candidate, and slow trials eat up patent life and potential future profits. However, I suspect that one can now buy a clinical trial somewhere in Asia for almost any drug for enough money these days.
Nullius also wrote: “unconscious cognitive biases are more than sufficient for people to fool themselves”. Exactly. And if several careers and the company’s stock price depend on the success of a shaky clinical candidate, the biases are not unconscious.
Doesn’t an ARIMA(3,1,0) model imply non-stationarity? Non-stationarity appears to me to be physically unreasonable for our planet, given the negative Planck feedback and the fact that we haven’t experienced a runaway greenhouse like Venus. We’ve got a 4 billion year history (about 0.5 billion of that with less adaptable large multicellular plants and animals living on land) for temperatures to randomly drift further and further from the initial conditions and Keenan wants us to worry about non-stationarity in the past century?
Anthony
AGW hypothesis is a blatant disregard of statistics. The most glaring example of false positive. NOAA global temperature data from 1880-2013 show a two-sigma deviation (P 0.69 C including margin of error. Regression analysis is misleading because it is essentially curve fitting a trend line in a scatter diagram. It is well known that a random walk function can create trend lines. Even technical analysts can be fooled by a random walk function.
In physical, biomedical and social sciences (meaning all of sciences) the data will be dismissed as trivial and the hypothesis (AGW) as inconclusive. But this is climate science with its own rules. The debate whether global warming is caused by man or nature is moot and academic. It’s like debating if your poker winnings are due to superior strategy or to cheating when in fact your winnings are not different from what you would expect from chance alone.
Part 1
AGW hypothesis is a blatant disregard of statistics. The most glaring example of false positive. NOAA global temperature data from 1880-2013 show a two-sigma deviation (P < 0.05) did not occur until 1998. All previous warming (positive anomalies) are trivial and statistically insignificant. However, the margin error in the data is +/- 0.09 C. If we take this into consideration, NONE of the data is statistically significant.
Part 2
The largest positive anomaly on record is 0.66 C in 2010. The threshold value to be statistically significant is > 0.69 C including margin of error. Regression analysis is misleading because it is essentially curve fitting a trend line in a scatter diagram. It is well known that a random walk function can create trend lines. Even technical analysts can be fooled by a random walk function.
In physical, biomedical and social sciences (meaning all of sciences) the data will be dismissed as trivial and the hypothesis (AGW) as inconclusive. But this is climate science with its own rules. The debate whether global warming is caused by man or nature is moot and academic. It’s like debating if your poker winnings are due to superior strategy or to cheating when in fact your winnings are not different from what you would expect from chance alone.
Steven Mosher says:
February 7, 2014 at 7:17 pm
just banish the nonsense of a result being statistically significant. There is no such thing. It’s a bogus tradition.
You do your test. You report the uncertainty.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Amen to that. Statistics have become a means of propping up weak science.
Correction:
You do your test; you do your statistics; you have your data; there IS no uncertainty.
Unless you discard your test data.
Frank,
“You are forgetting that the FDA usually requires two clinical trials at 20 to 1 odds.”
The claim I was responding to was “Only an idiot would think that drug companies are running large clinical trial solely based on the hope of obtaining a positive result by chance in 1 out of 20.” My point being that 1-in-20 can work if the potential profits are more than 20 times the cost of the trial. The principle still stands with two trials: 1-in-400 works if the profits are more than 400 times the cost of the trial. So it’s not a silly question, although one would hope that medical researchers are doing it for more than the money.
“Hospitals with an excellent reputation can often choose which promising drug they are willing to help develop.”
So the trial subjects and treatment regime for selected drugs are atypical? That sounds… concerning. Not necessarily a problem, but if drug companies push the most potentially profitable candidates on the best hospitals, you could get selection effects. But that’s a different issue.
“Doesn’t an ARIMA(3,1,0) model imply non-stationarity? Non-stationarity appears to me to be physically unreasonable for our planet”
Yes, and yes it is, but in time-series analysis it is common to use a non-stationary model for data that has to be stationary for physical reasons, because what you’re really doing is approximating the behaviour of time segments too short to resolve their characteristic roots, and doing so avoids all the mathematical difficulties associated with such situations.
The situation arises when one of the roots of a stationary process is very close to, but not on, the unit circle. If you collect enough data, you can locate it precisely enough to tell that it’s actually inside the circle and the series is stationary. But if you only have a short segment of data, too short to fully explore its range of behaviour, the result appears indistinguishable from a non-stationary process. And all the same problems that mess up estimation for non-stationary processes also mess up the only approximately non-stationary.
It’s kind of like the way some people fit a linear trend line to a temperature graph. If you extend the graph back far enough, you wind up with temperatures below absolute zero, which make no physical sense. Extend the linear trend far enough forwards and you get temperatures of thousands of degrees, which is not very realistic either. So for exactly the same sort of physical reasons, a linear trend is physically impossible too. And yet nobody objects when people propose linear trend+ARMA(1,1), while they always raise the objection with ARIMA(3,1,0). I suspect it’s just a question of familiarity, but it’s still darn inconsistent. 😉
ARIMA(3,1,0) is an approximation for a short stretch of data in the same way that “linear trend + whatever” is. Everyone agrees it’s not physically plausible as a general model – but until we have a validated physics-based model of the noise, its about as good as we’re going to get. It’s what applying the standard textbook methods of time series analysis will give you, anyway.
I should perhaps add that Koutsoyiannis has some interesting ideas on long term persistence models as even better fits than simple ARIMA, that are worth knowing about. But it’s still a form of curve-fitting, rather than physics.
Nullius in Verba says: February 10, 2014 at 11:58 am
“So for exactly the same sort of physical reasons, a linear trend is physically impossible too. And yet nobody objects when people propose linear trend+ARMA(1,1), while they always raise the objection with ARIMA(3,1,0).”
There is a physical theory as to why there is a trend that was not there before. It’s the theory we’re trying to test.
ARIMA(3,1,0) builds the unphysicality into the noise model. There is no reason to expect that the noise has changed in nature in the last 100 years. That’s a null hypothesis which itself needs explaining.
Nick Stokes has done a good job of illustrating how ‘scientists’ get themselves muddled up with statistics; they can calculate a P-value well enough but that doesnt mean they have got the null hypothesis right nor that they understand what their statistical calculations mean. We wont get into the misapplication of statistical methodology: linear trends for climate? Give me a break…
Mosher is simply illustrating his lack of critical thinking. “CO2 is a problem”: for who? How and why? Does it matter?
Fairly typical of Mosher though, cryptic (or even throwaway) lines that dont stand up to deeper scrutiny.
As for the subject at hand, i think it’s well understood by anyone with proper statistical training that scientific literature is littered with poor and misapplied statistics. However, merely using math seems to give people a greater confidence that something must be correct. We all need a reminder that math is simply another language, and just because something is consistent or works mathematically doesn’t mean that there is a physical translation or truth behind said mathematics. Statisticas can be included in this broader statement.
“There is a physical theory as to why there is a trend that was not there before. It’s the theory we’re trying to test.”
There’s a physical theory for why the temperature was below absolute zero a few thousand years ago?! I don’t think so.
“ARIMA(3,1,0) builds the unphysicality into the noise model.”
No more so than a linear trend.
The classic non-stationary process is the random walk, proposed to explain Brownian motion. And yet it is obviously the case that a pollen grain or a molecule of air cannot wander infinitely far. Why do you suppose it is that Einstein “built the unphysicality into the noise model”? Obviously, because it is a good approximation.
We make such unphysical approximations all the time – infinite perfectly flat planes, straight lines, perfect spheres, frictionless surfaces, rigid bodies, elastic collisions, flat spacetime, infinite crystal lattices, instantaneously propagating gravity, point particles, point velocities, monochromatic plane waves, etc. etc. It’s unphysical only because it’s an approximation. This is just another one.
It doesn’t matter that it’s *technically* unphysical, because it’s a close approximation of something that *is* physical (but mathematically messy).
“There is no reason to expect that the noise has changed in nature in the last 100 years.”
There’s no reason to expect it hasn’t, either.
Consider a process x(t) = k(t-1)*x(t-1) + random(t-1) where x(t) is the data at time t and k(t) is a function that varies slowly over time, hovering around and usually just below 1. Sometimes it will exceed it and the process wanders away for a bit, but after a while will drop down again and x return back towards the origin.
The AR(1) process is just the above when k(t) has a constant value, but what physical reason do you have to assume it’s constant? What physical reason do we have to suppose the coefficients in the SkS ARMA(1,1) are constant? Don’t they depend on physical parameters, which can change?