Statistical failure of A Population-Based Case–Control Study of Extreme Summer Temperature and Birth
Guest Post by Willis Eschenbach
The story of how global warming causes congenital cataracts in newborns babies has been getting wide media attention. So I thought I’d take a look at the study itself. It’s called A Population-Based Case–Control Study of Extreme Summer Temperature and Birth Defects, and it is available from the usually-scientific National Institutes of Health here.
Figure 1. Dice with various numbers of sides. SOURCE
I have to confess, I laughed out loud when I read the study. Here’s what I found so funny.
When doing statistics, one thing you have to be careful about is whether your result happened by pure random chance. Maybe you just got lucky. Or maybe that result you got happens by chance a lot.
Statisticians use the “p-value” to estimate how likely it is that the result occurred by random chance. A small p-value means it is unlikely that it occurred by chance. The p-value is the odds (as a percentage) that your result occurred by random chance. So a p-value less than say 0.05 means that there is less than 5% odds of that occurring by random chance.
This 5% level is commonly taken to be a level indicating what is called “statistical significance”. If the p-value is below 0.05, the result is deemed to be statistically significant. However, there’s nothing magical about 5%, some scientific fields more commonly use a stricter criteria of 1% for statistical significance. But in this study, the significance level was chosen as a p-value less than 0.05.
Another way of stating this same thing is that a p-value of 0.05 means that one time in twenty (1.0 / 0.05), the result you are looking for will occur by random chance. Once in twenty you’ll get what is called a “false positive”—the bell rings, but it is not actually significant, it occurred randomly.
Here’s the problem. If I have a one in twenty chance of a false positive when looking at one single association (say heat with cataracts), what are my odds of finding a false positive if I look at say five associations (heat with spina bifida, heat with hypoplasia, heat with cataracts, etc.)? Because obviously, the more cases I look at, the greater my chances are of hitting a false positive.
To calculate that, the formula that gives the odds of finding at least one false positive is
FP = 1 – (1 – p)N
where FP is the odds of finding a false positive, p is the p-value (in this case 0.05), and N is the number of trials. For my example of five trials, that gives us
FP = 1 – (1 – 0.05)5 = 0.22
So about one time in five (22%) you’ll find a false positive using a p-value of 0.05 and five trials.
How does this apply to the cataract study?
Well, to find the one correlation that was significant at the 0.05 level, they compared temperature to no less than 28 different variables. As they describe it (emphasis mine):
Outcome assessment. Using International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM; Centers for Disease Control and Prevention 2011a) diagnoses codes from the CMR records, birth defect cases were classified into the 45 birth defects categories that meet the reporting standards of the National Birth Defects Prevention Network (NBDPN 2010). Of these, we selected the 28 groups of major birth defects within the six body systems with prior animal or human studies suggesting an association with heat: central nervous system (e.g., neural-tube defects, microcephaly), eye (e.g., microphthalmia, congenital cataracts), cardiovascular, craniofacial, renal, and musculoskeletal defects (e.g., abdominal wall defects, limb defects).
So they are looking at the relationship between temperature and no less than 28 independent variables.
Using the formula above, if we look at the case of N = 28 different variables, we will get a false positive about three times out of four (76%).
So it is absolutely unsurprising, and totally lacking in statistical significance, that in a comparison with 28 variables, someone would find that temperature is correlated with one of them at a p-value of 0.05. In fact, it is more likely than not that they would find one with a p-value equal to 0.05.
They thought they found something rare, something to beat skeptics over the head with, but it happens three times out of four. That’s what I found so funny.
Next, a simple reality check. The authors say:
Among 6,422 cases and 59,328 controls that shared at least 1 week of the critical period in summer, a 5-degree [F] increase in mean daily minimum UAT was significantly associated with congenital cataracts (aOR = 1.51; 95% CI: 1.14, 1.99).
A 5°F (2.75°C) increase in summer temperature is significantly associated with congenital cataracts? Really? Now, think about that for a minute.
This study was done in New York. There’s about a 20°F difference in summer temperature between New York and Phoenix. That’s four times the 5°F they claim causes cataracts in the study group. So by their claim that if you heat up your kids will be born blind, we should be seeing lots of congenital cataracts, not only in Phoenix, but in Florida and in Cairo and in tropical areas, deserts, and hot zones all around the world … not happening, as far as I can tell.
Like I said, reality check. Sadly, this is another case where the Venn diagram of the intersection of the climate science fraternity and the statistical fraternity gives us the empty set …
w.
UPDATE: Statistician William Briggs weighs in on this train wreck of a paper.
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
Congenital cataracts? It’s worse then we thought!
Sorry, couldn’t resist.
“We assigned meteorologic data based on maternal residence at birth . . .”
While, they . . .
“summarized universal apparent temperature (UAT; degrees Fahrenheit) across the critical period of embryogenesis . . .”
and
“ particularly for exposures during weeks 4–7”
This isn’t likely a big issue but folks do move — so residence at birth and residence at 4-7 weeks may not be the same.
Then there is the “maternal fever” thing. Is it possible that some women had a fever during weeks 4-7? Would they remember months later even if asked? Were they asked?
Again, maybe this isn’t too important but I didn’t see how they controlled for either of these things.
The use of multiple significance tests is a well known problem in statistics. The use of many variables is problematic if they are correlated. Else use the binomial distribution. Its expected value Np is already telling. With p = 0.05 and N = 28 we expect 1.4 false positives among independent trials. I once tried to explain that to a researcher who used 60 variables in 60 significance tests and found 3 ‘significant’ results. I did not succeed and the same BS is still around. Problem seems to be very difficult.
Forgive me Willis for not sharing in your amusement. Some people take these people seriously. Even worse, a lot of people take the people seriously that drool over the prospect of using something like this to advance the cause of climate alarmism.
Pnce again poor science gets ‘Eschenbached.’
I am sorry. I was born at night, when the temp was so much lower. Maybe that is why I am not quite blind, yet? If Anthony has to watch his time of observation, don’t these folks with all sorts of degrees and perhaps even cataracts?
Someone I know very well and has an important role interpreting published science in a medical area for public policy makers, tells me that 90% of the published papers in his field are worthless because of problems like this. He tells me, most of it comes from doctors, who think they are scientists.
I googled 2 of the authors and both are epidemiologists. A discipline one would expect to know something about statistics.
I always thought people were homeotherm.
And that an unborn baby would be growing in a steady 37C environment.
How could a baby in the womb notice any change in temperature outside?
And how would it affect him?
And going from outside -20C in winter to +20C in your own house would be bad too?
I won’t bother to read the daft paper then! More seriously, this is exactly the kind of crap science we have come to expect. What bothers me is that this passes muster and is published.
Also, in relation to say, something that is non-linear and semi-chaotic in nature, I dunno, let’s say, something like ‘climate’ – it would be relatively easy to pick a variable and any number of the vast number of climatic ‘events’ and ‘interdependent variables’ and probably make a case for one ‘important’ variable being the sole cause of said events……Oh wait, now I get it, that’s why CO2 is repsonsible for everything!
To correct for multiple comparisons, some people use the Bonferroni correction, which divides your desired significance level (e.g. 0.05) by the number of comparisons. So 0.05/28 = 0.0018, which becomes the new threshold for finding any single comparison significant.
First off they used regressive analysis which can show a false correlation with one variable and they had several. Second, they cannot prove causation as this was not a controlled study. Maybe I don’t have the background to understand these statistical methods? The Rahmstorf and Foster paper also used a method similar to this one to get their “temps match IPCC” graph and I recently read a study on second hand smoke that was touted in the media as proof of the dangers and it was actually a meta-analysis which really blew my mind. I just finished my first stats course and from just playing around with some regressive analysis and sampling I was able to see the inconsistencies with these methods. If you don’t like your results, take another sample, remove outliers and try again, change the scale of your scatter plot. Lots of room to introduce bias…
Warmer temperatures will increase the growing range of golden rice, which prevents blindness in children, but not in politicians.
When I was growing up I had access to AC, it was called a window in summer and a fire in winter. But this study is just rediculous. Of course this had no effect on internal body temperature.
Clearly they didn’t bother with Wikipedia; A congenital disorder, or congenital disease, is a condition existing at birth and often before birth, or that develops during the first month of life (neonatal disease), regardless of causation. Of these diseases, those characterized by structural deformities are termed “congenital anomalies” and involve defects in or damage to a developing fetus.
More tripe just in time for Christmas!
Great post, Willis. Readers may like to visit Prof John Brignall’s wonderful Numberwatch site, where he has done an excellent demonstration on the statistical analysis carried out to day, where large numbers of studies display a Relative Risk ratio of less than 2, when 20 years ago, & before the invoking of the Precautionary Principle, any paper with an RR ratio of less than 3, was cut into small rectangles, bundled together with a small hole pierced in the top left-hand corner, hung on a looped length of string, & placed in the smallest room in the laboratory!!!!! I find it sad that as “ALL CHEMICALS” cause cancer especially by PDREU/UESR standards, scientists pump poor old lab rats full of a substance in volumes beyond a realistic level to supposedly simulate prolonged exposure over time, & the rat develops a tumour, & little mention seems to be made of a likely toxic shock to the system the rat may have suffered from chemical overload in a very short time frame! The press release is usually done before peer review has taken place & the press/media have a field day with yet another “we’re all going to die if we don’t do as we’re told” scare story!
I remember getting taught about the Bonferroni correction (and other options) for this multiple testing problems in my first-year undergrad general stats course about 20 years ago. It’s a very well known problem. Surprised (well, not really) the researches and none of the reviewers picked it up.
I hate to be picky, but 2 points on statistical terminology. In statistical significance testing, the p-value isn’t the probability it happened by random chance, but the probability of observing such a extreme result if the null hypothesis is true. If the null hypothesis is ‘random chance’ then it’s the same thing, but if I’m comparing two groups, 1 null hypothesis might be there’s no difference between them, but an alternative null hypothesis is that there is a difference of a specified amount. For the same data, I’d get a different p-value depending on which null hypothesis I’d identified a-priori as the one to test.
Secondly, a p-value is a probability, not an odds. An odds is the ratio of the probaility of success to the probability of failure, ie p/(1-p). While people often use them interchangeably, in statistics they’re 2 different things. Admittedly there’s not much difference if you’re talking about a probability of 0.05, but if the probability is 0.75, the odds is 3 (ie 0.75/0.25).
Here endeth the lesson. 😉
.’The p-value is the odds (as a percentage) that your result occurred by random chance. So a p-value less than say 0.05 means that there is less than 5% odds of that occurring by random chance.’
That’s not right,is it? If the p-value were a percentage, it would be ‘5%’, not ‘0.05’. . It can be converted to a percentage, but it is expressed as a decimal fraction.
As usual, Mr. Eschenbach got things wrong.
There’s actually no problem with his article, all the math he presents is perfectly correct.
What’s wrong are his assumptions on methods used in the PubMed paper which he apparently didn’t read carefully enough.
First of all, while the paper is inspired by IPCC conclusions, it is not about ambient temperatures at all. It is about heat waves and it even gives detailed description about what is considered a heat wave. And that definition makes pretty good sense both in New York and in Phoenix.
Second, while Mr. Eschenbach is calculating how likely it is that such result may come out of totally random data, he might have also calculated it for the results they actually present. That means, what is the total statistical significance of one result out of 28 having p<0.05 at three separate criteria at once, one result having p<0.05 at two at once, and remaining 26 being not statistically significant in any of the three criteria. Because done correctly, that state coming out of random data is actually pretty unlikely.
That article does not deserve such kind of treatment from Mr. Eschenbach, it is an interesting study. It may not be relevant for global warming for all of the reasons which were discussed on these pages thousands of times already but it still may be relevant to what people may want to care about during pregnancy.
Kasuha says:
December 20, 2012 at 1:48 am
Gosh, Kasuha, that’s a charming way to enter a discussion. You do realize that it reduces your chances of having people care about your opinion, don’t you?
Oh, please. How is a “heat wave” different from “heat”? Are you saying that the heat is a problem only if it doesn’t last long? I don’t understand this objection at all.
For example, a 5° summer heat wave in New York might be 10° colder than an average summer day in Phoenix, and last for a much shorter time … how can a cooler shorter “heat wave” in NY cause congenital cataracts, but not longer lasting, hotter temperatures in Phoenix?
Since you haven’t had the courage to present your mathematics, I fear that I can’t respond to that.
The odds that I have given are for one or more occurrences of p equal to 0.05. So it includes the other “significant” result already.
In addition, you have fallen into another trap that I didn’t discuss in the head post. This is that they compared the various congenital problems to several different measures of temperature. The measures were the mean, minimum, and maximum temperatures.
What neither they nor you seem to have thought about, Kasuha, is that we would expect the mean, minimum, and maximum temperature to be highly correlated with each other. If something is wellcorrelated with one of them, it is likely to be correlated with the other two.
As a result, your claim that they are significant “at three criteria at once” is not meaningful. Because the three criteria are well correlated, that is no more significant than temperature being correlated with any one of them.
Finally, you list out the peculiarities of the dataset. You have one condition significantly correlated with three measures, one condition being significantly correlated with two measures, and 26 with no correlation. You say that the chances of “that state coming out of random data is actually pretty unlikely.” And you are right, the chances of any particular specified state are low.
But we’re not interested in just that state. We are interested in all possible states that have one or more results of p = 0.05. There are many, many more of them then just the particular one that occurred in this dataset.
So while you are correct that the odds of that particular state are small, the odds that we’ll have at least one result of p = 0.05 are quite large, as I calculated above
w.
I concur with Kasuha’s comment above. This is a hypothesis-generating study and the authors stress the need to confirm the possible association in further studies. The authors seem to be well aware of the issue raised by Willis Eschenbach since they discussed it themselves: “Last, because we performed multiple tests to examine the relationships between 28 birth defects groups and various heat exposure indicators in this hypothesis-generating study, statistically significant findings may have been attributable to chance. Under the null hypothesis, we would expect 4 of the 84 effect estimates displayed in Table 3 to be statistically significant at the p = 0.05 level. […] However, the associations with congenital cataracts are biologically plausible, particularly given stronger associations during the relevant developmental window of lens development, and associations were consistent across exposure metrics, making chance a less likely explanation for these findings.”
Willis
You are right. There is no Bonferroni correction. The authors seem to think that performing multiple logistic regression will take care of their ‘confounders’, an additional indication the authors think anything that turns up in their univariate analysis must automatically be meaningful.
Cataract incidence is, yet, associated with three different temperature indices in their data. It would be worth applying the correction and then checking if these association still stand.
Where is the check for association between cataracts and urban heat islands? One may even swallow that heat islands don’t affect gridded anomalies in well-constructed climate datasets, but heat islands definitely affect individual people living them! Heat islands are definitive confounders, accepting the authors’ own findings.
Did the authors correct for Rubella immunization status? I don’t see it.
The incidence of cataracts may well be affected by confounders the authors list (for e.g., alcohol consumption). But importantly, cataracts occur in association with other congenital birth defects, as part of syndromes, a good proportion of which may be undiagnosed and therefore unreported. The authors don’t appear to correct for these.
The authors perform sub-analysis looking for association between congenital defects, but, only among pre-term defects. A reasonable guess is that the authors do this, simply because they are able to. Pre-term deliveries with congenital defects have such data recorded. But cataracts can be congenital and be associated with other defects, and yet not be diagnosed at birth.
The study design is possibly not the best for answering questions of the kind the authors raise. Their controls are people who were exposed to the same temperatures and gave birth to babies with no defects. How are they controls?! You’ve excluded the very effect you are trying to study from the controls. A better design would have been to randomly select individuals who gave birth during non-heat wave periods, irrespective of occurrence of birth defects.
These people are no more scientists than some people in the banking sector are wealth creators; where wealth = something of value and not just a symbol.
[MOD–Oh, bloody hell. Up late, on a netbook that I’m not used to, when the screen went dim–instead of trying to fix te resulting mess, it’s easiest to just put up this version instead.]
Actually, this is nothing like as bad as it looks at first blush. They got some hits, then looked closer, recognized the potential problem chance correlation, and said that their result looked interesting but would need confirmation.
In correlation tests of heat against 28 birth defects, they found statistical significance at the 5% level for thre:, congenital cataracts, renal agenesis/hypoplasia, and also a reduced occurrence of anophthalmia/microphthalmia. I’d expect the latter to be rare, and so have pretty poor sample size. But for cataracts, figure 2 in the paper shows the association as statistically significant during weeks 4,6, and 7. They state weeks 4-7 to be the time that the developing lens is most susceptible (as shown, e.g., by data from mothers with Rubella during pregnancy**.) That’s a bit less random than just getting a hit for statistical significance somewhere within weeks 3-8. Not startling, just noteworthy. They ran correlations against a few different criteria for hot weather,i.e., max, min, and mean temperature and got hits for cataracts with all of these. Since these ought to be fairly well correlated, it’s hard to tell how much that means.
I’ve heard of researchers taking a group of variables, running 100 cross-correlations, getting five hits at the 5% level, and presenting those, with a straight face, as though that were meaningful. These authors were aware of, and addressed, the problem of getting statistical significance by chance:
And, they say explicitly that these preliminary results need confirmation, with their last sentence stating:
**From radiation studies with mice and rats, they can actually pick a day to irradiate the pregnant animal, and get different birth defects depending on which organs are most susceptible that day. Couldn’t find a good web ref, but it’s in here:
http://books.google.com/books/about/Primer_of_medical_radiobiology.html?id=XylrAAAAMAAJ
Oh, yeah. Kasuha, please note that I have treated the three temperature conditions (max, min, and mean) as one, because they are well correlated.
If we include each of them as a separate temperature condition, however, we no longer have 28 possibilities. We have three times that many, 84 possibilities … and with N = 84, we are very, very likely to get false positives.
w.
As is typical, Willis provides excellent analysis. Still, I fully expect to read this story in the Sunday papers. Snip, snip…cut and paste. Big headlines.
Newspapers are in a death spiral and desperately need headlines that “break through”. The mindless readers (50% don’t get past the first paragraph) scan and digest. The information agrees with all the other flawed reporting, therefore they are comfortable with the story line, i.e., it is symmetrical with their level one thinking.
Pierre-Normand says:
December 20, 2012 at 2:46 am
Thanks, Pierre. Yes, indeed they did comment on that … but unfortunately, they didn’t seem to understand what it meant. They still claimed that their results are significant.
But what their own calculation means is that their results are not statistically significant. I don’t care if the outcome is “biologically plausible”. Not significant is not significant, and it should not be written up.
At least on my planet, results that are not statistically significant are not worth of a paper. These results have been hyped around the planet by the media, which clearly thinks they are significant and is reporting them as settled fact.
Now, why in the world would the media think that the results are significant? Here is the claim from their abstract (my emphasis):
Perhaps you could comment on the ethics of writing up a paper to present results that the authors know for a fact are not statistically significant, Pierre, and despite that, claiming in the abstract that the results are significant.
Because in my world, that is agenda-driven deception, and it has no place in science.
All the best,
w.