Keep doing that and you'll go blind

Statistical failure of A Population-Based Case–Control Study of Extreme Summer Temperature and Birth

Guest Post by Willis Eschenbach

The story of how global warming causes congenital cataracts in newborns babies has been getting wide media attention. So I thought I’d take a look at the study itself. It’s called A Population-Based Case–Control Study of Extreme Summer Temperature and Birth Defects, and it is available from the usually-scientific National Institutes of Health here.

two-way radiation between lightsFigure 1. Dice with various numbers of sides. SOURCE 

I have to confess, I laughed out loud when I read the study. Here’s what I found so funny.

When doing statistics, one thing you have to be careful about is whether your result happened by pure random chance. Maybe you just got lucky. Or maybe that result you got happens by chance a lot.

Statisticians use the “p-value” to estimate how likely it is that the result occurred by random chance. A small p-value means it is unlikely that it occurred by chance. The p-value is the odds (as a percentage) that your result occurred by random chance. So a p-value less than say 0.05 means that there is less than 5% odds of that occurring by random chance.

This 5% level is commonly taken to be a level indicating what is called “statistical significance”. If the p-value is below 0.05, the result is deemed to be statistically significant. However, there’s nothing magical about 5%, some scientific fields more commonly use a stricter criteria of 1% for statistical significance. But in this study, the significance level was chosen as a p-value less than 0.05.

Another way of stating this same thing is that a p-value of 0.05 means that one time in twenty (1.0 / 0.05), the result you are looking for will occur by random chance. Once in twenty you’ll get what is called a “false positive”—the bell rings, but it is not actually significant, it occurred randomly.

Here’s the problem. If I have a one in twenty chance of a false positive when looking at one single association (say heat with cataracts), what are my odds of finding a false positive if I look at say five associations (heat with spina bifida, heat with hypoplasia, heat with cataracts, etc.)? Because obviously, the more cases I look at, the greater my chances are of hitting a false positive.

To calculate that, the formula that gives the odds of finding at least one false positive is

FP = 1 – (1 – p)N

where FP is the odds of finding a false positive, p is the p-value (in this case 0.05), and N is the number of trials. For my example of five trials, that gives us

FP = 1 – (1 – 0.05)5 = 0.22

So about one time in five (22%) you’ll find a false positive using a p-value of 0.05 and five trials.

How does this apply to the cataract study?

Well, to find the one correlation that was significant at the 0.05 level, they compared temperature to no less than 28 different variables. As they describe it (emphasis mine):

Outcome assessment. Using International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM; Centers for Disease Control and Prevention 2011a) diagnoses codes from the CMR records, birth defect cases were classified into the 45 birth defects categories that meet the reporting standards of the National Birth Defects Prevention Network (NBDPN 2010). Of these, we selected the 28 groups of major birth defects within the six body systems with prior animal or human studies suggesting an association with heat: central nervous system (e.g., neural-tube defects, microcephaly), eye (e.g., microphthalmia, congenital cataracts), cardiovascular, craniofacial, renal, and musculoskeletal defects (e.g., abdominal wall defects, limb defects).

So they are looking at the relationship between temperature and no less than 28 independent variables.

Using the formula above, if we look at the case of N = 28 different variables, we will get a false positive about three times out of four (76%).

So it is absolutely unsurprising, and totally lacking in statistical significance, that in a comparison with 28 variables, someone would find that temperature is correlated with one of them at a p-value of 0.05. In fact, it is more likely than not that they would find one with a p-value equal to 0.05.

They thought they found something rare, something to beat skeptics over the head with, but it happens three times out of four. That’s what I found so funny.

Next, a simple reality check. The authors say:

Among 6,422 cases and 59,328 controls that shared at least 1 week of the critical period in summer, a 5-degree [F] increase in mean daily minimum UAT was significantly associated with congenital cataracts (aOR = 1.51; 95% CI: 1.14, 1.99).

A 5°F (2.75°C) increase in summer temperature is significantly associated with congenital cataracts? Really? Now, think about that for a minute.

This study was done in New York. There’s about a 20°F difference in summer temperature between New York and Phoenix. That’s four times the 5°F they claim causes cataracts in the study group. So by their claim that if you heat up your kids will be born blind, we should be seeing lots of congenital cataracts, not only in Phoenix, but in Florida and in Cairo and in tropical areas, deserts, and hot zones all around the world … not happening, as far as I can tell.

Like I said, reality check. Sadly, this is another case where the Venn diagram of the intersection of the climate science fraternity and the statistical fraternity gives us the empty set …

w.

UPDATE: Statistician William Briggs weighs in on this train wreck of a paper.

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

153 Comments
Inline Feedbacks
View all comments
mpainter
December 20, 2012 10:49 am

EMSmith says
We used to play in the fog of pesticide behind the “mosquito trucks”
==========================
Myself, back in the days of DDT. Can’t lay an egg.

jorgekafkazar
December 20, 2012 10:57 am

rgbatduke says: “Jason T, it is good to see that they did a better analysis than appears “at first blush” as you put it, but they ignored the elephant while focusing on the mouse…”
Just want to say I always enjoy your comments.

mpainter
December 20, 2012 11:13 am

Willis Eschenbach: “agenda-driven deception”
Again Willis has put his finger on the nub of the problem. As he points out, the whole issue is settled by comparisons with incidences in warmer climes, so easy to do.
Could it be that this did not occur to the researchers? I doubt it. What this study is actually about is the art of hooking grant money, and these researchers are applying their own brand of AGW panic to shake down some loot.
I wish I had the means to put some kind of trace on the authors, to see if they succeed.

Brian H
December 20, 2012 11:22 am

Willis Eschenbach says:
December 20, 2012 at 2:31 am

So while you are correct that the odds of that particular state are small, the odds that we’ll have at least one result of p = 0.05 are quite large, as I calculated above
w.

Yes, just to make the point clearer, you’d have to pick the specified significant variable in advance of doing the study to get kashua’s result. Saying, “This variable will be significant” is a much tougher standard than “Some variable will (appear to) be significant.”

rgbatduke
December 20, 2012 11:30 am

Heh, if I took a pair of dice and rolled snake eyes on my first roll, should I label the dice as biased?
Possibly. It depends on your Bayesian priors. Bayes theorem actually tells you HOW MUCH you should presume that they are biased given the data and any given initial prior. That is, if you start out with the prior assumption of an unbiased die, held with some numerically stated degree of confidence and then roll a 6, how much “should” that alter your best estimate of the probabilities for the die on the next roll? Bayes theorem will actually answer that, if you let it. If you get the book “Doing Bayesian Analysis” with its cute-puppy cover, you can even walk through how to answer this quantitatively using R.
See “Polya’s Urn”, or (as I first encountered it) how to use Bayes theorem to evaluate the most likely probability for a two sided coin given the data in the context of learning statistical mechanics. Answer: Maximize (information) entropy with your answer.
That’s the other great tragedy — aside from the fact that the article could be a poster child for scientists who need a twelve step program to stop data dredging (see xkcd comic above, a perfect fit right down to the headline) — Bayes alters everything. We don’t actually do statistics any more by just counting — we know too much for that to make any sense. Bayesian priors — especially prevalence — are critical to any assessment of false positive/false negative rates in epidemiology, in addition to helping you understand the Let’s Make a Deal paradox, why when you are shown three doors (one of which holds a treasure) and you choose one, then somebody opens one of the two unchosen ones to show that it is blank, you should change to the other (remaining) door to maximize your chances of winning. Assuming, of course, that the person isn’t trying to game you by assuming that you will do this, and is required to always offer you the choice.
Statistics without Bayes is like, well, often just plain wrong, especially in the arena of hypothesis testing.
rgb

Gail Combs
December 20, 2012 12:06 pm

I second Willis’s thanks.
I have just enough statistics training (a couple of semesters) to know when to scream for help. Too bad most scientists who write these papers are either too arrogant or ignorant to do the same.

page488
December 20, 2012 12:29 pm

I’ve learned so much from your guest posts here!
I’m wondering about blindness in New York city. The other day, I was watching a cable channel that was giving the projected high temps for the New York City area. I don’t remember what they were, but the projected “HI” for the city was 10 degrees F higher than the suburbs – I guess this means that kids in the city are all going blind and God only knows what all else.
Seriously, if they’re going to do studies like this, it seems a lot smarter to test people living in and out of the urban island effect in the same region.

michaeljmcfadden
December 20, 2012 1:11 pm

Rg, I’ve always had trouble with that Deal paradox. Say I had a thousand doors and 999 of them had nothing behind them and one had a zillion samolians. I pick one and clearly have one chance in a thousand of winning. Monty Hall then randomly opens up 998 of them and they all display nothing. The chances at that point of the door on my left (which I had picked originally) having the zillion should be, as far as I can intuitively be comfortable with, 50%, i.e. the same as for the other door. And yet the Monty Hall “explanation” *seems* to tell me I should switch my choices to the door on my right.
Somehow that has NEVER sat comfortably in my gut. It simply feels like a joke being played by statisticians to see how gullible people can be.
– MJM

rgbatduke
December 20, 2012 1:18 pm

Statistics is bloody difficult. He says sitting here with three books on R and Bayesian analysis and two books on javascript at hand, with a dozen more in the house and work, trying to learn R and javascript both so I can learn to use mongo (huge database program with a huge database that speaks JSOL) and R in order to do predictive models both within R and with a proprietary neural network I wrote, once I integrate the db contents with all three. Yesterday. I know a lot, and I understand its basis (which is worth more), but there are huge chunks of stats I still don’t know, or learn only as I need them. I’m at the bleeding edge as far as Markov processes are concerned and multivariate predictive modeling is concerned, way less expert in boring old regression, pretty far out there in Bayesian analysis, and yet there are plenty of tricks used in mundane stats I’ve never used and don’t know. And I can do calculus and algebra and all that (as well as code, obviously, in too-many-languages-to-count+2 new ones, truly expert in C).
Pity the poor suckers like Michael Mann who actually wrote a PCA package in fortran so he could do his tree-ring work. Along comes M&M who actually use, um, R (because other people wrote it, and those people are really, really good at statistics and keep rewriting it and improving it and adding onto it as bugs are discovered, and besides, it’s bone simple to use once you’ve learned its data structures as it does nearly anything you want, the way you should be wanting to do it, automatically and by default once you get the data loaded in at all) and show that his results are bullshit with a few well chosen simulations and exercises.
Back when I was a grad student and early postdoc, I used to write my own numerical code (and yeah, usually in Fortran, as that was sort of the default physics language in those days). Fortran on IBM mainframes, fortran on PDP 1s and 11s, fortran on a Harris 800 (with the Vulcan operating system no less and 3 byte word boundaries), fortran back on a mainframe, fortran on an original 64K motherboard IBM PC, fortran on several AT clones before I converted to C in 1986 and never looked back. My excuse back then is simple — numerical libraries cost a fortune and nobody had them. Nowadays, I use the Gnu Scientific Library and try NOT to write my own code for standard stuff like statistics (although I’ve written a ton of stats routines in the past) because all it takes is one boneheaded bit of code and you’re screwed — garbage in OR garbage code equals garbage out.
Pity also physicians who are trying to do research that involves a serious statistical analysis. One stats course does not an expert make, and that is precisely what most of them have. If that — a lot of them took calculus in University, not EVEN one course in stats. As a consequence, they probably have never even heard of “data dredging”.
If they had, they would never claim that green jellybeans cause acne in a snipe hunt involving 28 possible hypotheses against one binary variable. In fact, they wouldn’t publish the results of the snipe hunt at all — they might legitimately use it to formulate a hypothesis to be tested against completely new data after looking carefully at some of the issues listed above, e.g. the lack of meaningful or corresponding differentiation of total prevalence (already well known) with climate, which suggests that if the effect they observe is not cosmic debris, a purely random occurrence as it is more than likely that it is straight up, it is confounded by other factors that are much stronger.
The hypothesis itself is perfectly reasonable. But you can’t discover the association and test it with the same data set, certainly not in a snipe hunt among 28 snipes with p set at 0.05. certainly not when the differentiation is small changes in a well-known probability.
I mean, which is it? Did the heat wave make the probability 0.0006 (a 50% increase) compared to the control population and was the control population still at 0.0003-whatever (since AFAIK it is only known to be “between 0.0003 and 0.0004” in the US in general)? The smaller subgroups they are trying to analyze have an even smaller representative population. Not only were the jellybeans green, but they worked the best on people believed to have acne that is sensitive to sugar.
I wonder why the blue ones don’t?
rgb

Tim Clark
December 20, 2012 2:20 pm

Presumably, under proper process, the subject/studies under analysis should have answered a questionaire designed to eliminate extraneous and/or conflicting variables. Did that questionaire include appropriate questions and sufficient information to determine if the other threatened cause of blindness was not being consummated?

Alan Bates
December 20, 2012 2:25 pm

Sorry. I am 67 and don’t have the time to go through 110 comments so if anyone has already made this point, ignore me!
I believe the correct way to procede would be to say:
“There is no real evidence for a link between 27 out of the 28 conditions hence we can ignore those and look at the only 1 which has triggered what might be a significant link. Two questions come up. Is there a causal link between temperature and the condition we are studying? Common sense suggests there isn’t (based on the Phoenix/NY arguement). However, if you are really sure this is a critically important problem, do the test again using other data from another place. Surely there are plenty of big cities which have an increase in temperature (UHI effect) and equally surely there are other sets of health studies. If you find that of the 20 sets only 1 does NOT show a significant correlation then you have some kind of evidence that there is a genuine (i.e. P=0.05) correlation.”
THEN you have something to publish and get your fame and glory. But only, of course, if you can find a causal link between temperature and your condition.
For example,there is a strong correlation between the temperature in London and the percentage of men in Germany wearing coats. Does this mean that wearing overcoats in Germany is caused by the temperature in the London streets? Or could it just be that the common causative factor is winter in the Northern hemisphere?

AndyG55
December 20, 2012 2:50 pm

Down under, the outback aboriginals do have an increased level of eye problems, and they do live in very warm places. But the two are not causally linked.
As soon as decent medical care is available from a young age, the problem lessens considerably.
The problems does not exist anywhere near as much for urban aboriginies.
So yes, Alan, coincidence does not imply correlation, and correlation does not imply causation.

clipe
December 20, 2012 3:02 pm

Meanwhile, back at the UN ranch.
This would have been the second surprise. Throughout the epidemic, science has been the last thing the UN’s political leaders have wanted to talk about.
http://www.winnipegfreepress.com/opinion/westview/UN-fakes-effort-to-help-Haiti-184151891.html

GlynnMhor
December 20, 2012 3:19 pm

Blind? Please… just until I need glasses…

Francisco
December 20, 2012 3:44 pm

From the Readings section of Harper’s magazine, January 2013 issue.
The following comes from the instruction manual of a board game called “The Settlers of Catan” Developed by the Worldwatch Institute and the game’s manufacturers.
During your turn, you can convert one oil into two non-oil resources of your choosing. Alternatively, you may choose to forgo the usage of oil, sacrificing some growth for increased environmental security and the prestige of being a sustainability leader. The first player to have sequestered three oils gains the “Champion of the Environment” token.
For every five oils used, an environmental disaster results. Roll the two six-sided dice to determine where disaster strikes. If a seven is rolled, a natural disaster triggered by climate change floods the coasts. Settlements bordering a sea are removed from the board, and cities are reduced to settlements. Roads are not affected. A metropolis (because of its seawalls and other advanced design) is also not affected. If any other number is rolled, industrial pollution has struck. If the affected hex does not contain an oil spring, remove the production-number token from the hex. That hex no longer produces resources.
If the fifth number token is removed from one of the hexes, flooding has overwhelmed Catan and all inhabitants are forced to abandon the island, thus ending the game. While no player truly wins, the player who currently holds the Champion of the Environment token is recognized by the international community for his/her efforts to mitigate climate change and is granted the most attractive land on a neighboring island to resettle.

ttfn
December 20, 2012 4:17 pm

michaeljmcfadden says:
December 20, 2012 at 1:11 pm
“Rg, I’ve always had trouble with that Deal paradox. Say I had a thousand doors and 999 of them had nothing behind them and one had a zillion samolians. I pick one and clearly have one chance in a thousand of winning. Monty Hall then randomly opens up 998 of them and they all display nothing.”
Monty’s not opening the doors randomly. He’s opening them because he knows they don’t contain the zillion samolians. Each empty door he opens gives you a little more information than you had when you made your original choice. If he were opening them randomly, your gut would be right.

MattS
December 20, 2012 4:53 pm

@michaeljmcfadden
“Rg, I’ve always had trouble with that Deal paradox. Say I had a thousand doors and 999 of them had nothing behind them and one had a zillion samolians. I pick one and clearly have one chance in a thousand of winning. Monty Hall then randomly opens up 998 of them and they all display nothing. The chances at that point of the door on my left (which I had picked originally) having the zillion should be, as far as I can intuitively be comfortable with, 50%, i.e. the same as for the other door. And yet the Monty Hall “explanation” *seems* to tell me I should switch my choices to the door on my right.
Somehow that has NEVER sat comfortably in my gut. It simply feels like a joke being played by statisticians to see how gullible people can be.”
You are incorrect and the Monty Hall expanation is correct. This has been tested by Mythbusters.
Look at it this way. For any N door contest with only one prize the probability that the door you select does not contain the prize is N-1/N.
When the host is opening empty doors he knows which door has the prize and won’t open that door. Because of this it doesn’t matter how many doors are opened, the odds you selected the wrong door don’t change and all the extra probability goes to the remaining unselected door.
You will always have better odds switching, the greater N is the greater the advantage to switching.

RACookPE1978
Editor
December 20, 2012 5:56 pm

But ONLY
(1) if the host knows which door has the prize as he opens the remaining 998 doors
AND
(2) if the host decides to act on that knowledge.
If either of the above is true, then your odds have not changed.

F. Ross
December 20, 2012 8:28 pm

Long story short —

rgbatduke says:
December 20, 2012 at 1:18 pm
“…
Pity the poor suckers like Michael Mann who actually wrote a PCA package in fortran so he could do his tree-ring work. Along comes M&M who actually use, um, R … and show that his results are bullshit with a few well chosen simulations and exercises.
…”

Put nicely that is.

michaeljmcfadden
December 20, 2012 10:30 pm

RA, Matt, TT … Thank you! Your comments, along with my own posing of the puzzle in extreme form, have enabled me to finally break through on both an intellectual AND a gut level. Given the host’s deliberate avoidance of the prize, it would appear to be a near-certainty (Well, 998 out of 999?) that the other door holds the prize. Clearly, the chance for the one that I picked originally was only 1 in a thousand. I’m still a *little* fuzzy on the fine points — as I’ve freely admitted at other times in statistical discussions on the net, although I had graduate level statistics, it was never my strongest suit. (Heh, plus it was a while ago.) — but at least now I feel comfortable in looking at the problem and telling myself, “Yup. It’s real. Switch yer choice!”
:>
MJM

theduke
December 20, 2012 10:40 pm

willis and rgbatduke make for a helluva one-two punch.

Frank
December 20, 2012 11:14 pm

Small biotechnology companies with new drug candidates use a slightly different trick when they can’t demonstrate with 95% confidence that their candidate drug is beneficial. They will look inside the group of patients in their clinical trial for a sub-population of patients that did show an unambiguous effect (p<0.05). Perhaps men responded better than women, or healthier patients better than sicker one, or those who hadn't tried anti-cancer drug A responded better to their new drug than those that hadn't. If you look at 20 or so sub-populations, the chances of finding one that clearly benefited (p<0.05) go up a lot. So the company goes to the FDA and requests approval to sell their new drug with information indicating that it should only be given to the sub-population that responded. (Doctors are allowed to prescribe an approved drug to anyone they think will benefit, no matter what the label says.) The FDA, which employs many statisticians and understands these tricks says, run another large expensive clinical trial consisting only of patients you expect to benefit and demonstrate that your drug is beneficial with p<0.05. The company then run to the WSJ and gets them to write an editorial about how the FDA's arbitrary on capricious rules are going to bankrupt another small company with a drug of proven efficacy in a particular group of patients.

michaeljmcfadden
December 20, 2012 11:26 pm

I don’t remember if it was the WSJ at the moment, but your story reminds me of a study where the researchers got headlined as showing that exposure to a certain substance caused “hypertension” in “boys as young as eight years old.” The “hypertension” was an increase of roughly 1 unit systolic, and the pool of boys was actually a range of 8 to 17 years old, but the headlines didn’t go into that sort of detail of course.
The stories weren’t much better, although one of them *did* mention that the deadly exposure happened to DEcrease systolic readings by almost TWO points among girls. The researcher was asked about that and noted that decreased blood pressures could also be a health threat.
No, I’m not kidding. I can dig out the refs if anyone wants.
– MJM

JazzyT
December 21, 2012 3:33 am

I thought that biological plausibility of the of the results for congenital cataracts lent some support to the idea that there could be a causal relationship, but I’m backing off of that. It’s true that all of the statistically significant connections between heat and cataracts fall within the critical period for cataracts, 4-8 weeks, and none of them outside. But that’s probably more likely than it sounds (looking at weeks 1-12 on figure 2, getting within weeks 4-8 is a 5/12 chance or 0.417, and–given three such weeks–getting them all in that window is a chance of 0.072, but for four different birth defects, the chance of getting one hit randomly is a less impressive 0.29. The calculation is more complex allowing for different numbers of weeks showing significance within the critical period, and allowing for the fact that heat wave were defined as three successive days over the 90th percentile in temperature. Also, we don’t actually know the sensitivity for congenital cataracts for for each week in weeks 4-7, so calculating the odds of getting figure 2 by chance becomes impossible.
The point about comparing populations in warmer climates with the study group is a good one. I would add, though, that our ability to deal with heat does include both some genetic basis, and some acclimation. The authors quoted animal studies on hyperthermia causing birth defects, though now, I’ll apologize for not taking the time to look them up and find out how extreme that hypothermia was. Also, hyperthermia itself might not be the causative agent; stress hormones or other reactions to unaccustomed heat might be involved as well. Just to muddy the waters a bit…still, my opinion of this paper has dropped.
Regarding the Monte Hall problem…the best way to understand this is to simply draw out a decision tree. You pick door #1, with a 33% chance at the prize. Game show host then chooses door #2, which is not the prize. 0% chance. Door #1 is still 33%, so door %3 must have a 67% chance of having the prize behind it. Sounds bogus. But draw out each possibility, e.g., say the prize is behind door # 1. Pick door 1, see door 2, stay with door 1. Pick door 1, see door 2, change to door 3. Pick door 1, and this time, see door 3, with 2 choices. Then, pick door 2 and look at those choices. Multiply through the probabilities for each. Then, for all the winning paths (door with the prize and door you end up with are the same) add up the probability for that path, calculated as the product of probabilities for each step. When this hit the news, because a columnist had commented on it, I was sure it was wrong. My brother, who was a math grad student, told me to try drawing out that tree-structure of probabilities. He also told me that he’d tried this with the students the college algebra class he was teaching, with playing cards, and, in over 100 trials, it came out pretty close to 1/3 wins if you stayed with your first choice, and 2/3 wins if you switched.
I finally got it, and I was really annoyed that he’d gotten one up on me. At the time, I’d just typed in a random number generator into the computer (from Marsaglia and co-authors, as it happened) to use for a Monte Carlo program** I was modifying and testing. So, just to outdo my brother in some way, when I left the bar, I wrote a quick program and tested the Monte Hall effect (in a very simple Monte Carlo simulation) over five million trials.
Well, whatever else its failings, this paper’s given us an interesting thread.
[**Monte Carlo simulation involves exploring a simulated system or solving complex equations using many random trials. A simple example would be estimating the area inside some region by defining a rectangle containing that region, then looking at many points within the rectangle, The fraction of points falling in the region gives its area relative to that of the rectangle.]