Margins of Error

James D. Agresti | President | Just Facts

Ever lost a bet? From the lottery or sporting events to casinos or friendly wagers, you may have risked and lost some money because you hoped to win big.

But let me ask you this: How big would the payout have to be and how good would the odds need to be to gamble with your life or the lives of your loved ones?

In this lesson from Just Facts Academy about Margins of Error, we’ll show you how people do that without even realizing it. And more importantly, we’ll give you the tools you need to keep you from falling into this trap.

Ready? C’mon, what have you got to lose?



People often use data from studies, tests, and surveys to make life-or-death decisions, like about what medicines they should take, what kinds of foods they should eat, and what activities should they embrace or avoid.

The problem is such data is that it isn’t always as concrete as the media and certain scholars make it out to be.

Look at it this way. There are four layers to this “margin of error” cake. Let’s start with the simplest one, like this headline from the Los Angeles Times, which declares, “California sea levels to rise 5-plus feet this century, study says.”[1]

That sounds pretty scary, but the study has margins of error, and it actually predicts a sea-level rise of 17 to 66 inches.[2] In the body of the article, the reporter walks back the headline a little, but he fails to provide even a hint that the “5-plus-feet” is the upper bound of an estimate that extends all the way down to a quarter of this.[3]

Studies often have margins of error, or bounds of uncertainty, so the moment you hear someone summarize a study with a single figure, dig deeper. This is the same principle taught in Just Facts Academy’s lesson on Primary Sources: Don’t rely on secondary sources because they often reflect someone’s interpretation of the facts—instead of the actual facts.

Also, don’t assume that the authors of the primary sources will report the vital margins of error near the top of their studies. In the famed Bangladesh face mask study, for example, the authors lay down 4,000 words before they disclose a range of uncertainty that undercuts their primary finding.[4] [5] [6]

Here are a few more tips to help you critically examine margins of error.

Surveys often present their results like this:

11.5% ± 0.3

It’s quite simple. The first number is the nominal or best estimate, technically called the “point estimate.” The second number is the margin of error.

In the case of this survey,[7] it means that the best estimate of the U.S. poverty rate is 11.5%, but the actual figure may be as low as 11.2% or as high as 11.8%.

Scholarly publications often use a less intuitive convention and present their results like this:

4.70; 95% CI, 1.77–12.52

Now, don’t let this barrage of digits intimidate you. They’re actually easy to understand once you crack the code.

The first number is the best estimate. In the case of this study,[8] it means that bisexual men are roughly 4.7 times “more likely to report severe psychological distress” than heterosexual men.

The last two digits are the outer bounds of the study’s results after the margins of error are included. They mean that bisexual men are about 1.8 to 12.5 times more likely to report distress than heterosexual men.[9]

That’s a really broad range, especially when compared to the single figure of 4.7. Do you see why margins of error are so essential?

Now, here’s something a lot of people don’t know because journalists rarely explain it or don’t understand it: Reported margins of error and ranges of uncertainty typically account for just one type of error known as sampling error.[10] [11] [12] [13] [14] [15] This is based purely on the size of the sample used for the study or survey. Generally speaking, the larger the sample, the smaller the margin of sampling error.[16]

It’s super important to be aware of this, because there are often other layers of uncertainty that aren’t reflected in sampling errors.[17] [18] [19] [20] Figures like 1.77 to 12.52 sound very specific and solid, but that can be an illusion. If you don’t understand this, you can be easily misled to believe that the results of a study are ironclad when they are not.

This brings us to the “95% CI.” What does that mean?

It stands for “95% confidence interval,”[21] and contrary to what your statistics teacher may have told you,[22] it generally means that there’s a 95% chance the upper and lower bounds of the study contain the real figure.[23] That means there’s a 5% chance they don’t.

How’s that for gambling? Would you step outside your home today if you knew there was a 1 in 100 chance you wouldn’t make it back alive? Well, even the outer bounds of most study results are less certain than that.

You see, time, money, and circumstances often limit the sizes of studies, tests, and surveys.[24] So even if their methodologies are sound, reality may lie outside the bounds of the results due to mere chance.[25]

On top of this, some studies measure multiple types of outcomes while failing to account for the fact that each attempt to measure a separate outcome increases the likelihood of getting a seemingly solid result due to pure chance.[26] Look at it this way: if you roll of pair of dice 12 times, you’re 12 times more likely roll a 2 than if you roll them once.

Even worse, there are scholars who roll those dice behind the scenes by calculating different outcomes until they find one that provides a result they want. And that’s the only one they’ll tell you about.[27] [28]

Now, let’s take a step back and look at the layers of the cake:

  • First, you have the point estimate.
  • Then, you have the outer bounds, which commonly account for the margin of sampling error but no other sources of uncertainty.
  • Then, you have the confidence interval percentage, or the probability that the outer bounds are correct.

We’ll get to the base layer in a moment, but now is a good time to talk about a concept called “statistical significance,” because we’ve cut through enough cake to understand it.

Study results are typically labeled “statistically significant” if the margin of sampling error with 95% confidence is entirely positive or entirely negative.[29] [30] [31] [32]

For example, if a medical study finds a treatment is 10% to 30% effective with 95% confidence, this is considered to be a statistically significant outcome. That’s a shorthand way of saying the result probably isn’t due to sampling error.[33]

And if a study finds that a treatment is –10% to 30% effective with 95% confidence, such a result is considered to be “statistically insignificant” because it crosses past the line of zero effect.[34] [35] [36] [37] This could mean that the treatment has a positive effect, or no effect, or a negative effect.[38] [39]

One way to sort this out is to look at the size of the study sample. If it’s relatively large, and the results are statistically insignificant, that’s a pretty good indication the effect is trivial.[40] [41] [42] [43]

Hundreds of scholars have called for ending the convention of labeling results as “statistically significant” or “insignificant.” This is because it can lead people to jump to false conclusions.[44] Nonetheless, it’s a common practice,[45] [46] [47] [48] so here are some tips to avoid such risky leaps:

  • One, don’t mistake statistical significance for real-world importance. A study’s results can be statistically significant but also tiny or irrelevant.[49] [50]
  • Two, don’t assume that a statistically insignificant result means there’s no difference or no effect.[51] Sometimes studies are underpowered, which means their samples are too small to detect statistically significant results.[52] In other words, there’s a major distinction between saying that a study “found no statistically significant effect” and saying “there’s no effect.”[53]
  • Third and most importantly, don’t fall into the trap of believing that a study is reliable just because the results are statistically significant.[54] That’s the final layer to the cake, and it’s where the riskiest gambling occurs.

Here’s what I mean.

The study on sea level rise we discussed—well, it’s based on a computer model,[55] a type of study that is notoriously unreliable.[56] [57] [58] [59] [60] [61]

And the study about psychological distress and sexuality—it’s an observational study,[62] which can rarely determine cause and effect, even though scholars falsely imply or explicitly claim that they do.[63] [64] [65] [66] [67]

Then there’s all kinds of survey-related errors exposed by Just Facts’ lesson on Deconstructing Polls & Surveys.

Bottom line—the “margins of error” reported by journalists and scholars rarely account for many other sources of error.

Gone are the days when you can blindly trust a study just because it is publicized by your favorite news source, appears in a peer-reviewed journal, was written by a PhD, or is endorsed by a government agency or professional association.

Incompetence and dishonesty are simply far too rampant to outsource major life decisions without critical analysis.

So don’t gamble your life on “experts” who offer solid bets that “you can’t lose.” Instead, keep it locked to Just Facts Academy, so you can learn how to research like a genius.

Just Facts is a research and educational institute dedicated to publishing facts about public policies and teaching research skills.

Endnotes


[1] Article: “California Sea Levels to Rise 5-Plus Feet This Century, Study Says.” By Tony Barboza. Los Angeles Times, June 24, 2012. http://articles.latimes.com/2012/jun/24/local/la-me-adv-sea-level-20120625

Sea levels along the California coast are expected to rise up to 1 foot in 20 years, 2 feet by 2050 and as much as 5 1/2 feet by the end of the century, climbing slightly more than the global average and increasing the risk of flooding and storm damage, a new study says. …

Coastal California could see serious damage from storms within a few decades, especially in low-lying areas of Southern California and the Bay Area. San Francisco International Airport, for instance, could flood if the sea rises a little more than a foot, a mark expected to be reached in the next few decades. Erosion could cause coastal cliffs to retreat more than 100 feet by 2100, according to the report.

[2] Paper: “Sea-Level Rise for the Coasts of California, Oregon, and Washington: Past, Present, and Future.” By the Committee on Sea Level Rise in California, Oregon, and Washington, National Research Council. National Academies Press, 2012. http://www.nap.edu/catalog.php?record_id=13389

Pages 4–6:

For the California coast south of Cape Mendocino, the committee projects that sea level will rise 4–30 cm [2–12 inches] by 2030 relative to 2000, 12–61 cm [5–24 inches] by 2050, and 42–167 cm [17–66 inches] by 2100.

[3] Article: “California Sea Levels to Rise 5-Plus Feet This Century, Study Says.” By Tony Barboza. Los Angeles Times, June 24, 2012. http://articles.latimes.com/2012/jun/24/local/la-me-adv-sea-level-20120625

Sea levels along the California coast are expected to rise up to 1 foot in 20 years, 2 feet by 2050 and as much as 5 1/2 feet by the end of the century, climbing slightly more than the global average and increasing the risk of flooding and storm damage, a new study says. …

Coastal California could see serious damage from storms within a few decades, especially in low-lying areas of Southern California and the Bay Area. San Francisco International Airport, for instance, could flood if the sea rises a little more than a foot, a mark expected to be reached in the next few decades. Erosion could cause coastal cliffs to retreat more than 100 feet by 2100, according to the report.

[4] Article: “Famed Bangladesh Mask Study Excluded Crucial Data.” By James D. Agresti. Just Facts, April 8, 2022. https://www.justfactsdaily.com/famed-bangladesh-mask-study-excluded-crucial-data

Beyond excluding the death data, the authors engaged in other actions that reflect poorly on their integrity. One of the worst is touting their findings with far more certainty than warranted by the actual evidence. For example, some of the authors wrote a New York Times op-ed declaring that “masks work,” a claim undercut by the following facts from their own study: …

• Their study’s “primary outcome,” a positive blood test for Covid-19 antibodies, found that less than 1% of the participants caught C-19, including 0.68% in villages where people were pressured to wear masks, and 0.76% in villages that were not. This is a total difference of 0.08 percentage points in a study of more than 300,000 people.

• Their paper lays down 4,000 words before it reveals the sampling margins of error in the results above, which show with 95% confidence that … cloth masks reduced the risk of catching symptomatic C-19 by as much as 23% or increased the risk by as much as 8%.

• “Not statistically significant” is the common term used to describe study results that aren’t totally positive or totally negative throughout the full margin of error, like the results above. Yet, the authors skip this fact in their op-ed and bury it in their paper, writing at the end of an unrelated paragraph that it showed “no statistically significant effect for cloth masks.”

NOTE: The next two footnotes document the primary sources.

[5] Paper: “Impact of Community Masking on COVID-19: A Cluster-Randomized Trial in Bangladesh.” By Jason Abaluck and others. Science, December 2, 2021. https://www.science.org/doi/10.1126/science.abi9069

Page 3:

We find clear evidence that surgical masks lead to a relative reduction in symptomatic seroprevalence of 11.1% (adjusted prevalence ratio = 0.89 [0.78, 1.00]; control prevalence = 0.81%; treatment prevalence = 0.72%). Although the point estimates for cloth masks suggests that they reduce risk, the confidence limits include both an effect size similar to surgical masks and no effect at all (adjusted prevalence ratio = 0.94 [0.78, 1.10]; control = 0.67%; treatment = 0.61%).

NOTE: The quote above is buried 4,000 words into the paper. Moreover, the authors misleadingly describe these results. The outer bound of “1.00” for surgical masks actually means no effect at all, but the authors fail to use this term when describing that outcome. Instead, they use the term “no effect at all” to describe the outer bound of “1.10” for cloth masks when this actually means a 10% increase in the risk catching Covid-19.

Page 4:

We find clear evidence that the intervention reduced symptoms: We estimate a reduction of 11.6% (adjusted prevalence ratio = 0.88 [0.83, 0.93]; control = 8.60%; treatment = 7.63%). Additionally, when we look separately by cloth and surgical masks, we find that the intervention led to a reduction in COVID-19–like symptoms under either mask type (p = 0.000 for surgical; p = 0.066 for cloth), but the effect size in surgical mask villages was 30 to 80% larger depending on the specification. In table S9, we run the same specifications using the smaller sample used in our symptomatic seroprevalence regression (i.e., those who consented to give blood). In this sample, we continue to find an effect overall and an effect for surgical masks but see no statistically significant effect for cloth masks.

[6] Commentary: “We Did the Research: Masks Work, and You Should Choose a High Quality Mask if Possible.” By Jason Abaluck, Laura H. Kwong, and Stephen P. Luby. https://www.nytimes.com/2021/09/26/opinion/do-masks-work-for-covid-prevention.html

“The bottom line is masks work, and higher quality masks most likely work better at preventing Covid-19.”

[7] Report: “Poverty in the United States: 2022.” By Emily A. Shrider and John Creamer. U.S. Census Bureau, September 2023. https://www.census.gov/content/dam/Census/library/publications/2023/demo/p60-280.pdf

Pages 20–21:

Table A-1. People in Poverty by Selected Characteristics: 2021 and 2022

2022 … Below poverty … Percent [=] 11.5 … Margin of error1 (±) [=] 0.3 …

1 A margin of error (MOE) is a measure of an estimate’s variability. The larger the MOE in relation to the size of the estimate, the less reliable the estimate. This number, when added to and subtracted from the estimate, forms the 90 percent confidence interval. MOEs shown in this table are based on standard errors calculated using replicate weights.

[8] Paper: “Comparison of Health and Health Risk Factors Between Lesbian, Gay, and Bisexual Adults and Heterosexual Adults in the United States.” By Gilbert Gonzales, Julia Przedworski, and Carrie Henning-Smith. Journal of the American Medical Association, June 27, 2016. http://archinte.jamanetwork.com/article.aspx?articleid=2530417

Data from the nationally representative 2013 and 2014 National Health Interview Survey were used to compare health outcomes among lesbian (n = 525), gay (n = 624), and bisexual (n = 515) adults who were 18 years or older and their heterosexual peers (n = 67 150) using logistic regression. …

After controlling for sociodemographic characteristics … bisexual men were more likely to report severe psychological distress (OR, 4.70; 95% CI, 1.77-12.52), heavy drinking (OR, 3.15; 95% CI, 1.22-8.16), and heavy smoking (OR, 2.10; 95% CI, 1.08-4.10) than heterosexual men….

[9] Paper: “Comparison of Health and Health Risk Factors Between Lesbian, Gay, and Bisexual Adults and Heterosexual Adults in the United States.” By Gilbert Gonzales, Julia Przedworski, and Carrie Henning-Smith. Journal of the American Medical Association, June 27, 2016. http://archinte.jamanetwork.com/article.aspx?articleid=2530417

Data from the nationally representative 2013 and 2014 National Health Interview Survey were used to compare health outcomes among lesbian (n = 525), gay (n = 624), and bisexual (n = 515) adults who were 18 years or older and their heterosexual peers (n = 67 150) using logistic regression. …

After controlling for sociodemographic characteristics … bisexual men were more likely to report severe psychological distress (OR, 4.70; 95% CI, 1.77-12.52), heavy drinking (OR, 3.15; 95% CI, 1.22-8.16), and heavy smoking (OR, 2.10; 95% CI, 1.08-4.10) than heterosexual men….

[10] Article: “The Myth of Margin of Error.” By Jeffrey Henning. Researchscape, October 13, 2017. https://researchscape.com/blog/the-myth-of-margin-of-error

The margin of sampling error is widely reported in public opinion surveys because it is the only error that can be easily calculated. …

In fact, many researchers will just “do the math” to calculate sampling error, ignoring the fact that the assumptions behind the calculation aren’t being met.

[11] Article: “Iowa Poll: Kamala Harris Leapfrogs Donald Trump to Take Lead Near Election Day. Here’s How.” By  Brianne Pfannenstiel. Des Moines Register, November 2, 20224. Updated November 7, 2024. https://www.desmoinesregister.com/story/news/politics/iowa-poll/2024/11/02/iowa-poll-kamala-harris-leads-donald-trump-2024-presidential-race/75354033007/

A new Des Moines Register/Mediacom Iowa Poll shows Vice President Harris leading former President Trump 47% to 44% among likely voters just days before a high-stakes election that appears deadlocked in key battleground states. …

The poll of 808 likely Iowa voters, which include those who have already voted as well as those who say they definitely plan to vote, was conducted by Selzer & Co. from Oct. 28-31. It has a margin of error of plus or minus 3.4 percentage points. …

Questions based on the sample of 808 Iowa likely voters have a maximum margin of error of plus or minus 3.4 percentage points. This means that if this survey were repeated using the same questions and the same methodology, 19 times out of 20, the findings would not vary from the true population value by more than plus or minus 3.4 percentage points.

NOTES:

  • Trump won Iowa by 13.2 percentage points, receiving 55.7% of the vote as compared to 42.5% for Harris.
  • The “maximum margin of error” reported in this article was only the sampling margin of error, as documented in the footnote above.

[12] Post: “Significant Marginal Effects but C.I.S for Predicted Margins Overlapping.” By Dr. Clyde Schechter (Albert Einstein College of Medicine). Statalist, October 10, 2017. https://www.statalist.org/forums/forum/general-stata-discussion/general/1290476-significant-marginal-effects-but-c-i-s-for-predicted-margins-overlapping

If one of the goals is to assess the predicted margins, then they should be presented with confidence intervals because every estimate should always be given with an estimate of the associated uncertainty. (The confidence interval represents a bare minimum estimate of the uncertainty of any estimate in that it accounts only for sampling error, but it is better than nothing.)

[13] Article: “Handling Missing Within-Study Correlations in the Evaluation of Surrogate Endpoints.” By Willem Collier and others. Statistics in Medicine, September 3, 2003. https://pmc.ncbi.nlm.nih.gov/articles/PMC10704210/

To reduce bias in measures of the performance of the surrogate, the statistical model must account for the sampling error in each trial’s estimated treatment effects and their potential correlation.

A weighted least squares (WLS) approach is also frequently used…. The WLS method accounts only for sampling error of estimated effects on the clinical endpoint.

[14] Paper: “Measuring Coverage in MNCH: Total Survey Error and the Interpretation of Intervention Coverage Estimates from Household Surveys.” PLoS Medicine. May 7, 2013. https://pmc.ncbi.nlm.nih.gov/articles/PMC3646211/

Nationally representative household surveys are increasingly relied upon to measure maternal, newborn, and child health (MNCH) intervention coverage at the population level in low- and middle-income countries. Surveys are the best tool we have for this purpose and are central to national and global decision making. However, all survey point estimates have a certain level of error (total survey error) comprising sampling and non-sampling error, both of which must be considered when interpreting survey results for decision making. … Sampling error is usually thought of as the precision of a point estimate and is represented by 95% confidence intervals, which are measurable. … By contrast, the direction and magnitude of non-sampling error is almost always unmeasurable, and therefore unknown.

[15] Report: “2023 Crime in the United States.” Federal Bureau of Investigation, September 2024. https://www.justfacts.com/document/crime_united_states_2023_fbi.pdf

Page 39 (of the PDF):

BJS [Bureau of Justice Statistics] derives the NCVS [National Crime Victimization Survey] estimates from interviewing a sample. The estimates are subject to a margin of error. This error is known and is reflected in the standard error of the estimate.

NOTE: As documented in the footnote above, the “margin of error” in this survey only accounts for the sampling margin of error.

[16] Book: Statistics for K–8 Educators. By Robert Rosenfeld. Routledge, 2013.

Page 92:

In general, larger random samples will produce smaller margins of error. However, in the real world of research where a study takes time and costs money, at a certain point you just can’t afford to increase the sample size. Your study will take too long or you may decide the increase in precision isn’t worth the expense. For instance, if you increase the sample size from 1,000 to 4,000 the margin of error will drop from about 3% to about 2%, but you might quadruple the cost of your survey.

[17] Paper: “Measuring Coverage in MNCH: Total Survey Error and the Interpretation of Intervention Coverage Estimates from Household Surveys.” PLoS Medicine. May 7, 2013. https://pmc.ncbi.nlm.nih.gov/articles/PMC3646211/

Sampling error is usually thought of as the precision of a point estimate and is represented by 95% confidence intervals, which are measurable. … By contrast, the direction and magnitude of non-sampling error is almost always unmeasurable, and therefore unknown.

[18] Post: “Significant Marginal Effects but C.I.S for Predicted Margins Overlapping.” By Dr. Clyde Schechter (Albert Einstein College of Medicine). Statalist, October 10, 2017. https://www.statalist.org/forums/forum/general-stata-discussion/general/1290476-significant-marginal-effects-but-c-i-s-for-predicted-margins-overlapping

If one of the goals is to assess the predicted margins, then they should be presented with confidence intervals because every estimate should always be given with an estimate of the associated uncertainty. (The confidence interval represents a bare minimum estimate of the uncertainty of any estimate in that it accounts only for sampling error, but it is better than nothing.)

[19] Report: “How Crime in the United States Is Measured.” Congressional Research Service, January 3, 2008. https://crsreports.congress.gov/product/pdf/RL/RL34309

Pages 26–27:

Because the NCVS [National Crime Victimization Survey] is a sample survey, it is subject to both sampling and non-sampling error, meaning that the estimated victimization rate might not accurately reflect the true victimization rate. Whenever samples are used to represent entire populations, there could be a discrepancy between the sample estimate and the true value of what the sample is trying to estimate. …

The NCVS is also subject to non-sampling error. The methodology employed by the NCVS attempts to reduce the effects of non-sampling error as much as possible, but an unquantified amount remains.242

[20] Report: “Estimating the Incidence of Rape and Sexual Assault.” Edited by Candace Kruttschnitt, William D. Kalsbeek, and Carol C. House. National Academy of Sciences, National Research Council, 2014. https://nap.nationalacademies.org/catalog/18605/estimating-the-incidence-of-rape-and-sexual-assault

Page 4:

All surveys are subject to errors, and the NCVS [National Crime Victimization Survey] is no exception. An assessment of the errors and potential errors in a survey is important to understanding the overall quality of the estimates from that survey and to initiate improvements. Total survey error is a concept that involves a holistic view of all potential errors in a survey program, including both sampling error and various forms of nonsampling error.

[21] Report: “How Crime in the United States Is Measured.” Congressional Research Service, January 3, 2008. https://crsreports.congress.gov/product/pdf/RL/RL34309

Page 26:

Because the NCVS [National Crime Victimization Survey] is a sample survey, it is subject to both sampling and non-sampling error, meaning that the estimated victimization rate might not accurately reflect the true victimization rate. Whenever samples are used to represent entire populations, there could be a discrepancy between the sample estimate and the true value of what the sample is trying to estimate. The NCVS accounts for sampling error by calculating confidence intervals for estimated rates of victimization.238 For example, in 2000, the estimated violent crime victimization rate was 27.9 victimizations per 100,000 people aged 12 and older.239 The calculated 95% confidence interval240 for the estimated violent crime victimization rate was 25.85 to 29.95 victimizations per 100,000 people aged 12 and older.241

[22] Paper: The Correct Interpretation of Confidence Intervals. By Sze Huey and Say Beng Tan. Proceedings of Singapore Healthcare, 2010. https://journals.sagepub.com/doi/pdf/10.1177/201010581001900316

Page 277:

A common misunderstanding about CIs is that for say a 95% CI (A to B), there is a 95% probability that the true population mean lies between A and B. This is an incorrect interpretation of 95% CI because the true population mean is a fixed unknown value that is either inside or outside the CI with 100% certainty. As an example, let us assume that we know that the true population mean systolic blood pressure and it is 120mmHg. A study conducted gave us a mean systolic blood pressure of 105mmHg with a 95% CI of (95.5 to 118.9 mmHg). Knowing that the true population mean is 120mmHg it would be incorrect to say that there is a 95% probability that the true population mean lies in the 95% CI of (95.5 to 118.9mmHg) because we are certain that the 95% CI calculated did not contain the true population mean. A 95% CI simply means that if the study is conducted multiple times (multiple sampling from the same population) with corresponding 95% CI for the mean constructed, we expect 95% of these CIs to contain the true population mean

[23] Article: “What Does a Confidence Interval Mean?” By Allen B. Downey (Ph.D.), 2023. https://allendowney.github.io/DataQnA/confidence.html

Here’s a question from the Reddit statistics forum (with an edit for clarity):

Why does a confidence interval not tell you that 90% of the time, [the true value of the population parameter] will be in the interval, or something along those lines?

I understand that the interpretation of confidence intervals is that with repeated samples from the population, 90% of the time the interval would contain the true value of whatever it is you’re estimating. What I don’t understand is why this method doesn’t really tell you anything about what that parameter value is.

This is, to put it mildly, a common source of confusion. And here is one of the responses:

From a frequentist perspective, the true value of the parameter is fixed. Thus, once you have calculated your confidence interval, one if two things are true: either the true parameter value is inside the interval, or it is outside it. So the probability that the interval contains the true value is either 0 or 1, but you can never know which.

This response is the conventional answer to this question—it is what you find in most textbooks and what is taught in most classes. And, in my opinion, it is wrong. To explain why, I’ll start with a story.

Suppose Frank and Betsy visit a factory where 90% of the widgets are good and 10% are defective. Frank chooses a part at random and asks Betsy, “What is the probability that this part is good?”

Betsy says, “If 90% of the parts are good, and you choose one at random, the probability is 90% that it is good.”

“Wrong!” says Frank. “Since the part has already been manufactured, one of two things must be true: either it is good or it is defective. So the probability is either 100% or 0%, but we don’t know which.”

Frank’s argument is based on a strict interpretation of frequentism, which is a particular philosophy of probability. But it is not the only interpretation, and it is not a particularly good one. In fact, it suffers from several flaws. This example shows one of them—in many real-world scenarios where it would be meaningful and useful to assign a probability to a proposition, frequentism simply refuses to do so.

Fortunately, Betsy is under no obligation to adopt Frank’s interpretation of probability. She is free to adopt any of several alternatives that are consistent with her commonsense claim that a randomly-chosen part has a 90% probability of being functional. …

Suppose that Frank is a statistics teacher and Betsy is one of his students. …

Now suppose Frank asks, “What is the probability that this CI contains the actual value of μ that I chose?”

Betsy says, “We have established that 90% of the CIs generated by this process contain μ, so the probability that this CI contains is 90%.”

And of course Frank says “Wrong! Now that we have computed the CI, it is unknown whether it contains the true parameter, but it is not random. The probability that it contains μ is either 100% or 0%. We can’t say it has a 90% chance of containing μ.”

Once again, Frank is asserting a particular interpretation of probability—one that has the regrettable property of rendering probability nearly useless. Fortunately, Betsy is under no obligation to join Frank’s cult.

Under most reasonable interpretations of probability, you can say that a specific 90% CI has a 90% chance of containing the true parameter. There is no real philosophical problem with that.

[24] Book: Statistics for K–8 Educators. By Robert Rosenfeld. Routledge, 2013.

Page 92:

In general, larger random samples will produce smaller margins of error. However, in the real world of research where a study takes time and costs money, at a certain point you just can’t afford to increase the sample size. Your study will take too long or you may decide the increase in precision isn’t worth the expense. For instance, if you increase the sample size from 1,000 to 4,000 the margin of error will drop from about 3% to about 2%, but you might quadruple the cost of your survey.

[25] Report: “Drug Use, Dependence, and Abuse Among State Prisoners and Jail Inmates, 2007–2009.” By Jennifer Bronson and others. U.S. Department of Justice, Bureau of Justice Statistics, June 2017. https://bjs.ojp.gov/content/pub/pdf/dudaspji0709.pdf

Page 19:

Standard errors and tests of significance

As with any survey, the NIS [National Inmate Surveys] estimates are subject to error arising from their basis on a sample rather than a complete enumeration of the population of adult inmates in prisons and jails. …

A common way to express this sampling variability is to construct a 95% confidence interval around each survey estimate.

[26] Paper: “Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects.” By Michael L. Anderson. Journal of the American Statistical Association, December 2008. Pages 1481–1495. https://are.berkeley.edu/~mlanderson/pdf/Anderson%202008a.pdf

Page 1481:

This article focuses on the three prominent early intervention experiments: the Abecedarian Project, the Perry Preschool Program, and the Early Training Project. …

But serious statistical inference problems affect these studies. The experimental samples are very small, ranging from approximately 60 to 120. Statistical power is therefore limited, and the results of conventional tests based on asymptotic theory may be misleading. More importantly, the large number of measured outcomes raises concerns about multiple inference: Significant coefficients may emerge simply by chance, even if there are no treatment effects. This problem is well known in the theoretical literature … and the biostatistics field … but has received limited attention in the policy evaluation literature. These issues—combined with a puzzling pattern of results in which early test score gains disappear within a few years and are followed a decade later by significant effects on adult outcomes—have created serious doubts about the validity of the results….

Page 1484:

[M]ost randomized evaluations in the social sciences test many outcomes but fail to apply any type of multiple inference correction. To gauge the extent of the problem, we conducted a survey of randomized evaluation works published from 2004 to 2006 in the fields of economic or employment policy, education, criminology, political science or public opinion, and child or adolescent welfare. Using the CSA Illumina social sciences databases, we identified 44 such articles in peer-reviewed journals. …

Nevertheless, only 3 works (7%) implemented any type of multiple-inference correction. … Although multiple-inference corrections are standard (and often mandatory) in psychological research … they remain uncommon in other social sciences, perhaps because practitioners in these fields are unfamiliar with the techniques or because they have seen no evidence that they yield more robust conclusions.

Pages 1493–1494:

As a final demonstration of the value of correcting for multiple inference, we conduct a stand-alone reanalysis of the Perry Preschool Project, arguably the most influential of the three experiments. …

[A] conventional research design [i.e., one that does not account for multiple inference problems] … adds eight more significant or marginally significant outcomes: female adult arrests, female employment, male monthly income, female government transfers, female special education rates, male drug use (in the adverse direction), male employment, and female monthly income. Of these eight outcomes, two (male and female monthly income) are not included in the other two studies [Abecedarian and Early Training]. The remaining six fail to replicate in either of the other studies. …

[Previous] researchers have emphasized the subset of unadjusted significant outcomes rather than applying a statistical framework that is robust to problems of multiple inference. …

Many studies in this field test dozens of outcomes and focus on the subset of results that achieve significance.

[27] Paper: “HARKing, Cherry-Picking, P-Hacking, Fishing Expeditions, and Data Dredging and Mining as Questionable Research Practices.” Journal of Clinical Psychiatry, February 18, 2021. https://www.psychiatrist.com/jcp/harking-cherry-picking-p-hacking-fishing-expeditions-and-data-dredging-and-mining-as-questionable-research-practices/

P-hacking is a QRP [questionable research practice] wherein a researcher persistently analyzes the data, in different ways, until a statistically significant outcome is obtained; the purpose is not to test a hypothesis but to obtain a significant result. Thus, the researcher may experiment with different statistical approaches to test a hypothesis; or may include or exclude covariates; or may experiment with different cutoff values; or may split groups or combine groups; or may study different subgroups; and the analysis stops either when a significant result is obtained or when the researcher runs out of options. The researcher then reports only the approach that led to the desired result.3,8

[28] Paper: “Big Little Lies: A Compendium and Simulation of p-Hacking Strategies.” By Angelika M. Stefan and Felix D. Schönbrodt. Royal Society Open Science, February 2023. https://royalsocietypublishing.org/doi/10.1098/rsos.220346

In an academic system that promotes a ‘publish or perish’ culture, researchers are incentivized to exploit degrees of freedom in their design, analysis and reporting practices to obtain publishable outcomes [1]. In many empirical research fields, the widespread use of such questionable research practices has damaged the credibility of research results [2–5]. Ranging in the grey area between good practice and outright scientific misconduct, questionable research practices are often difficult to detect, and researchers are often not fully aware of their consequences [6–8].

One of the most prominent questionable research practices is p-hacking [4,9]. Researchers engage in p-hacking in the context of frequentist hypothesis testing, where the p-value determines the test decision. If the p-value is below a certain threshold α, it is labelled ‘significant’, and the null hypothesis can be rejected. In this paper, we define p-hacking broadly as any measure that a researcher applies to render a previously non-significant p-value significant.

p-hacking was first described by De Groot [10] as a problem of multiple testing and selective reporting. The term ‘p-hacking’ appeared shortly after the onset of the replication crisis [9,11], and the practice has since been discussed as one of the driving factors of false-positive results in the social sciences and beyond [12–14]. Essentially, p-hacking exploits the problem of multiplicity, that is, α-error accumulation due to multiple testing [15]. Specifically, the probability to make at least one false-positive test decision increases as more hypothesis tests are conducted [16,17]. When researchers engage in p-hacking, they conduct multiple hypothesis tests without correcting for the α-error accumulation, and report only significant results from the group of tests. This practice dramatically increases the percentage of false-positive results in the published literature [18].

 

[29] Article: “In Research, What Does A ‘Significant Effect’ Mean?” By Matthew Di Carlo (PhD). Albert Shanker Institute, November 1, 2011. https://www.shankerinstitute.org/blog/research-what-does-significant-effect-mean

Then there’s the term “significant.” “Significant” is of course a truncated form of “statistically significant.” Statistical significance means we can be confident that a given relationship is not zero. That is, the relationship or difference is probably not just random “noise.” A significant effect can be either positive (we can be confident it’s greater than zero) or negative (we can be confident it’s less than zero). In other words, it is “significant” insofar as it’s not nothing. The better way to think about it is “discernible.” There’s something there.

[30] Paper: “Effectiveness of Adding a Mask Recommendation to Other Public Health Measures to Prevent SARS-CoV-2 Infection in Danish Mask Wearers.” By Henning Bundgaard and others. Annals of Internal Medicine, November 18, 2020. https://www.acpjournals.org/doi/10.7326/M20-6817

“Although the difference observed was not statistically significant, the 95% CIs [confidence intervals] are compatible with a 46% reduction to a 23% increase in infection.”

[31] Report: “Drug Use, Dependence, and Abuse Among State Prisoners and Jail Inmates, 2007–2009.” By Jennifer Bronson and others. U.S. Department of Justice, Bureau of Justice Statistics, June 2017. https://bjs.ojp.gov/content/pub/pdf/dudaspji0709.pdf

Page 19:

Standard errors and tests of significance

As with any survey, the NIS [National Inmate Survey] estimates are subject to error arising from their basis on a sample rather than a complete enumeration of the population of adult inmates in prisons and jails. …

A common way to express this sampling variability is to construct a 95% confidence interval around each survey estimate.

[32] Paper: “School Vouchers and Student Outcomes: Experimental Evidence from Washington, DC.” By Patrick J. Wolf and others. Journal of Policy Analysis and Management, Spring 2013. Pages 246-270. http://onlinelibrary.wiley.com/doi/10.1002/pam.21691/abstract

Page 258: “Results are described as statistically significant or highly statistically significant if they reach the 95 percent or 99 percent confidence level, respectively.”

[33] Article: “Statistical Significance.” By Michael McDonough. Encyclopaedia Britannica. Last updated October 15, 2024. https://www.britannica.com/topic/statistical-significance

“Statistical significance implies that an observed result is not due to sampling error.”

[34] Article: “In Research, What Does A ‘Significant Effect’ Mean?” By Matthew Di Carlo (PhD). Albert Shanker Institute, November 1, 2011. https://www.shankerinstitute.org/blog/research-what-does-significant-effect-mean

Then there’s the term “significant.” “Significant” is of course a truncated form of “statistically significant.” Statistical significance means we can be confident that a given relationship is not zero. That is, the relationship or difference is probably not just random “noise.” A significant effect can be either positive (we can be confident it’s greater than zero) or negative (we can be confident it’s less than zero). In other words, it is “significant” insofar as it’s not nothing. The better way to think about it is “discernible.” There’s something there.

[35] Paper: “Relative Plasma Volume Monitoring During Hemodialysis Aids the Assessment of Dry-Weight.” By Arjun D Sinha, Robert P Light, and Rajiv Agarwal. Hypertension, December 28, 2009. https://pmc.ncbi.nlm.nih.gov/articles/PMC2819307/

“Mean changes and their 95% confidence intervals are shown. If the confidence interval crosses zero, the mean is statistically insignificant at the 5% level.”

[36] Paper: “Insignificant Effect of Arctic Amplification on the Amplitude of Midlatitude Atmospheric Waves.” By Russell Blackport and James A Screen. Science Advances, February 19, 2020. https://pmc.ncbi.nlm.nih.gov/articles/PMC7030927/

“In all cases, the spread of the modeled LWA [local wave activity] trends crosses zero, consistent with the statistically insignificant observed multidecadal trends.”

[37] Paper: “Effectiveness of Adding a Mask Recommendation to Other Public Health Measures to Prevent SARS-CoV-2 Infection in Danish Mask Wearers.” By Henning Bundgaard and others. Annals of Internal Medicine, November 18, 2020. https://www.acpjournals.org/doi/10.7326/M20-6817

“Although the difference observed was not statistically significant, the 95% CIs [confidence intervals] are compatible with a 46% reduction to a 23% increase in infection.”

[38] Paper: “Effectiveness of Adding a Mask Recommendation to Other Public Health Measures to Prevent SARS-CoV-2 Infection in Danish Mask Wearers.” By Henning Bundgaard and others. Annals of Internal Medicine, November 18, 2020. https://www.acpjournals.org/doi/10.7326/M20-6817

“Although the difference observed was not statistically significant, the 95% CIs [confidence intervals] are compatible with a 46% reduction to a 23% increase in infection.”

[39] Paper: “A Review of High Impact Journals Found That Misinterpretation of Non-Statistically Significant Results From Randomized Trials Was Common.” By Karla Hemming, Iqra Javid, and Monica Taljaard. Journal of Clinical Epidemiology, May 2022. https://www.sciencedirect.com/science/article/pii/S089543562200021X

The first and most problematic issue is when inconclusive trials are interpreted as providing definitive evidence that the treatment under evaluation is ineffective [10]. This is referred to as conflating no evidence of a difference with evidence of no difference (i.e., conflating absence of evidence with evidence of absence) [1]. …

Almost all abstracts of RCTs [randomized controlled trials] published in high impact journals with non-statistically significant primary outcomes appropriately report treatment effects and confidence intervals, yet most make definitive conclusions about active treatments being no different to the comparator treatment, despite this being prima facia [at first sight] inconsistent with a non-statistically significant primary outcome result. … In addition, a large number of studies unhelpfully provide no informative interpretation: in the overall conclusion they simply state that the result is non-statistically significant, despite having reported confidence intervals in the results section. … Clear statements that the study finding is inconclusive (i.e., when the confidence interval provides support for both benefit and harm) in reports of RCTs in high impact journals are rare. Despite high profile campaigns in 2016 to put a stop to this poor practice [38], our review demonstrates that the practice of misinterpretation is still highly prevalent. …

Thus, it might be possible that some studies which reported an overall interpretation of no difference between the two treatment arms were correct in this interpretation: some of these associated confidence intervals might well have excluded clinically important differences, although this was not transparent in the abstract [21].

[40] Article: “The Most Objective Evidence Shows No Indication That Covid Vaccines Save More Lives Than They Take. By James D. Agresti. Just Facts, March 2, 2022. https://www.justfactsdaily.com/most-objective-evidence-covid-vaccines-lives

In this case, the “intervention” is FDA-approved Covid vaccines, and the “outcome” is death. That vital data was gathered in RCTs involving 72,663 adults and older children for the Moderna and Pfizer vaccines. However, the FDA presented these results in a place and manner likely to be overlooked, and no major media outlet has covered them.

The results reveal that 70 people died during the Moderna and Pfizer trials, including 37 who received Covid vaccines and 33 who did not. Combined with the fact that half of the study participants were given vaccinations and the other half were given placebos, these crucial results provide no indication that the vaccines save more lives than they take.

Accounting for sampling margins of error—as is common for medical journals and uncommon for the media—the results demonstrate with 95% confidence that:

• neither of the vaccines decreased or increased the absolute risk of death by any more than 0.08% over the course of the trials.

• the vaccines could prevent up to two deaths or cause up to three deaths per year among every 1,000 people.

[41] Book: Multiple Regression: A Primer. By Paul D. Allison. Pine Forge Press, 1998.

Chapter 3: “What Can Go Wrong With Multiple Regression?” https://us.sagepub.com/sites/default/files/upm-binaries/2726_allis03.pdf

Pages 57–58:

Sample size has a profound effect on tests of statistical significance. With a sample of 60 people, a correlation has to be at least .25 (in magnitude) to be significantly different from zero (at the .05 level). With a sample of 10,000 people, any correlation larger than .02 will be statistically significant. The reason is simple: There’s very little information in a small sample, so estimates of correlations are very unreliable. If we get a correlation of .20, there may still be a good chance that the true correlation is zero. …

Statisticians often describe small samples as having low power to test hypotheses. There is another, entirely different problem with small samples that is frequently confused with the issue of power. Most of the test statistics that researchers use (such as t tests, F tests, and chi-square tests) are only approximations. These approximations are usually quite good when the sample is large but may deteriorate markedly when the sample is small. That means that p values calculated for small samples may be only rough approximations of the true p values. If the calculated p value is .02, the true value might be something like .08. …

That brings us to the inevitable question: What’s a big sample and what’s a small sample? As you may have guessed, there’s no clear-cut dividing line. Almost anyone would consider a sample less than 60 to be small, and virtually everyone would agree that a sample of 1,000 or more is large. In between, it depends on a lot of factors that are difficult to quantify, at least in practice.

[42] Article: “Regulatory Scientists Are Quiet About EUA, Kids Vax, Paxlovid and Boosters.” By Dr. Vinay Prasad. https://vinayprasadmdmph.substack.com/p/regulatory-scientists-are-quiet-about

Many scientists made a career fighting for better regulatory standards. Strangely, when it comes to the regulatory policy around COVID-19, they are dead quiet. …

Regulatory experts have told us for year[s] that if outcomes are generally favorable, you need a very large randomized control trial to show a benefit. …

… Boosting 20-year-olds should not come under the auspices of an EUA [emergency use authorization]. You should do a very large randomized trial to show it has a benefit. And if you can’t run the trial because the sample size is too large that tells you something about how marginal the effect size is.

[43] Article: “FDA Violated Own Safety and Efficacy Standards in Approving Covid-19 Vaccines For Children.” By James D. Agresti. Just Facts, July 14, 2022. https://www.justfactsdaily.com/covid-19-vaccines-children-fda-standards-violated

That doesn’t mean the vaccine doesn’t work, but there is no way to be sure. This is because the study was underpowered, a medical term for clinical trials that don’t enroll enough participants to detect important effects. Beyond severe Covid and hospitalizations for it, the Pfizer and Moderna trials were also too underpowered to measure:

• overall hospitalizations, which are far more informative than hospitalizations for Covid because they also measure the side effects of the vaccines.

• all-cause mortality, which is the only objective way to be certain the vaccines save more lives than they take.

To determine the last of those measures with 95% confidence would require a trial with more than half a billion children for a full year. And that assumes the vaccine works flawlessly by preventing all Covid deaths and causing no deaths from side effects. This astronomically large number is needed because deaths from Covid-19 are extremely rare among children, amounting to about one out of every 500,000 children in the first year of pandemic. In fact, children are about 36 times more likely to die of accidents than Covid-19.

Microscopically smaller than an adequate study, the Moderna vaccine trials for children aged 6 months to 5 years included a total of 6,388 children with a median blinded follow-up time of 68–71 days after the second dose. The Pfizer trial was similarly sized.

Comparing the data above, the trials that were conducted would need to be about 400,000 times larger/longer to objectively determine if the vaccines save more toddlers and preschoolers than they kill.

[44] Commentary: “Scientists Rise Up Against Statistical Significance.” By Valentin Amrhein, Sander Greenland, and Blake McShane. Nature, March 20, 2019. https://www.nature.com/articles/d41586-019-00857-9

In 2016, the American Statistical Association released a statement in The American Statistician warning against the misuse of statistical significance and P values. The issue also included many commentaries on the subject. This month, a special issue in the same journal attempts to push these reforms further. It presents more than 40 papers on ‘Statistical inference in the 21st century: a world beyond P < 0.05’. The editors introduce the collection with the caution “don’t say ‘statistically significant’”3. Another article4 with dozens of signatories also calls on authors and journal editors to disavow those terms.

We agree, and call for the entire concept of statistical significance to be abandoned.

We are far from alone. When we invited others to read a draft of this comment and sign their names if they concurred with our message, 250 did so within the first 24 hours. A week later, we had more than 800 signatories—all checked for an academic affiliation or other indication of present or past work in a field that depends on statistical modelling….

[45] Article: “Statistical Significance.” By Michael McDonough. Encyclopaedia Britannica. Last updated October 15, 2024. https://www.britannica.com/topic/statistical-significance

Since its conception in the 18th century, statistical significance has become the gold standard for establishing the validity of a result. Statistical significance does not imply the size, importance, or practicality of an outcome; it simply indicates that the outcome’s difference from a baseline is not due to chance. …

A growing number of researchers have voiced concerns over the misinterpretation of, and overreliance on, statistical significance. Often, analysis ends once an observation has been deemed to be statistically significant, and the observation is treated as evidence of an effect.

[46] Paper: “A Review of High Impact Journals Found That Misinterpretation of Non-Statistically Significant Results From Randomized Trials Was Common.” By Karla Hemming, Iqra Javid, and Monica Taljaard. Journal of Clinical Epidemiology, May 2022. https://www.sciencedirect.com/science/article/pii/S089543562200021X

The first and most problematic issue is when inconclusive trials are interpreted as providing definitive evidence that the treatment under evaluation is ineffective [10]. This is referred to as conflating no evidence of a difference with evidence of no difference (i.e., conflating absence of evidence with evidence of absence) [1]. …

Almost all abstracts of RCTs [randomized controlled trials] published in high impact journals with non-statistically significant primary outcomes appropriately report treatment effects and confidence intervals, yet most make definitive conclusions about active treatments being no different to the comparator treatment, despite this being prima facia [at first sight] inconsistent with a non-statistically significant primary outcome result. … In addition, a large number of studies unhelpfully provide no informative interpretation: in the overall conclusion they simply state that the result is non-statistically significant, despite having reported confidence intervals in the results section. … Clear statements that the study finding is inconclusive (i.e., when the confidence interval provides support for both benefit and harm) in reports of RCTs in high impact journals are rare. Despite high profile campaigns in 2016 to put a stop to this poor practice [38], our review demonstrates that the practice of misinterpretation is still highly prevalent. …

Thus, it might be possible that some studies which reported an overall interpretation of no difference between the two treatment arms were correct in this interpretation: some of these associated confidence intervals might well have excluded clinically important differences, although this was not transparent in the abstract [21].

[47] Textbook: Statistics: Concepts and Controversies (6th edition). By David S. Moore and William I. Notz. W. H. Freeman and Company, 2006.

Page 42: “It is usual to report the margin of error for 95% confidence. If a news report gives a margin of error but leaves out the confidence level, it’s pretty safe to assume 95% confidence.”

[48] Book: Statistics for K–8 Educators. By Robert Rosenfeld. Routledge, 2013.

Page 91:

Why 95%? Why not some other percentage? This value gives a level of confidence that has been found convenient and practical for summarizing survey results. There is nothing inherently special about it. If you are willing to change from 95% to some other level of confidence, and consequently change the chances that your poll results are off from the truth, you will therefore change the resulting margin of error. At present, 95% is just the level that is commonly used in a great variety of polls and research projects.

[49] Article: “Statistical Significance.” By Michael McDonough. Encyclopaedia Britannica. Last updated October 15, 2024. https://www.britannica.com/topic/statistical-significance

A growing number of researchers have voiced concerns over the misinterpretation of, and overreliance on, statistical significance. Often, analysis ends once an observation has been deemed to be statistically significant, and the observation is treated as evidence of an effect. This tendency is especially problematic given that statistical significance is not equal to clinical significance, a measure of effect size and practical importance. In an experiment, a statistically significant result simply indicates that a difference exists between two groups. This difference might be incredibly small, but, without further testing, its practical impact is unknown.

[50] Paper: “Beyond Statistical Significance: Clinical Interpretation of Rehabilitation Research Literature.” By Phil Page. International Journal of Sports Physical Therapy, October 9, 2014. https://pmc.ncbi.nlm.nih.gov/articles/PMC4197528/

While most research focus on statistical significance, clinicians and clinical researchers should focus on clinically significant changes. A study outcome can be statistically significant, but not be clinically significant, and vice‐versa. Unfortunately, clinical significance is not well defined or understood, and many research consumers mistakenly relate statistically significant outcomes with clinical relevance. Clinically relevant changes in outcomes are identified (sometimes interchangeably) by several similar terms including “minimal clinically important differences (MCID)”, “clinically meaningful differences (CMD)”, and “minimally important changes (MIC)”.

In general, these terms all refer to the smallest change in an outcome score that is considered “important” or “worthwhile” by the practitioner or the patient8 and/or would result in a change in patient management9,10. Changes in outcomes exceeding these minimal values are considered clinically relevant. It is important to consider that both harmful changes and beneficial changes may be outcomes of treatment; therefore, the term “clinically‐important changes” should be used to identify both minimal and beneficial differences, but also to recognize harmful changes.

[51] Commentary: “Scientists Rise Up Against Statistical Significance.” By Valentin Amrhein, Sander Greenland, and Blake McShane. Nature, March 20, 2019. https://www.nature.com/articles/d41586-019-00857-9

How do statistics so often lead scientists to deny differences that those not educated in statistics can plainly see? For several generations, researchers have been warned that a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment on some measured outcome)1. …

These and similar errors are widespread. Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half …

… Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions.

For example, consider a series of analyses of unintended effects of anti-inflammatory drugs2. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome.

Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).

It is ludicrous to conclude that the statistically non-significant results showed “no association,” when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect.

[52] Entry: “underpowered clinical trial.” Segen’s Medical Dictionary, 2012. https://medical-dictionary.thefreedictionary.com/underpowered+clinical+trial

“A clinical trial that has so few patients in each arm that the results will fall short of the statistical power needed to provide valid answers.”

[53] Paper: “A Review of High Impact Journals Found That Misinterpretation of Non-Statistically Significant Results From Randomized Trials Was Common.” By Karla Hemming, Iqra Javid, and Monica Taljaard. Journal of Clinical Epidemiology, May 2022. https://www.sciencedirect.com/science/article/pii/S089543562200021X

The first and most problematic issue is when inconclusive trials are interpreted as providing definitive evidence that the treatment under evaluation is ineffective [10]. This is referred to as conflating no evidence of a difference with evidence of no difference (i.e., conflating absence of evidence with evidence of absence) [1]. …

Almost all abstracts of RCTs [randomized controlled trials] published in high impact journals with non-statistically significant primary outcomes appropriately report treatment effects and confidence intervals, yet most make definitive conclusions about active treatments being no different to the comparator treatment, despite this being prima facia [at first sight] inconsistent with a non-statistically significant primary outcome result. … In addition, a large number of studies unhelpfully provide no informative interpretation: in the overall conclusion they simply state that the result is non-statistically significant, despite having reported confidence intervals in the results section. … Clear statements that the study finding is inconclusive (i.e., when the confidence interval provides support for both benefit and harm) in reports of RCTs in high impact journals are rare. Despite high profile campaigns in 2016 to put a stop to this poor practice [38], our review demonstrates that the practice of misinterpretation is still highly prevalent.

[54] Commentary: “Scientists Rise Up Against Statistical Significance.” By Valentin Amrhein, Sander Greenland, and Blake McShane. Nature, March 20, 2019. https://www.nature.com/articles/d41586-019-00857-9

Nor do statistically significant results ‘prove’ some other hypothesis. Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exists.

[55] Paper: “Sea-Level Rise for the Coasts of California, Oregon, and Washington: Past, Present, and Future.” By the Committee on Sea Level Rise in California, Oregon, and Washington, National Research Council. National Academies Press, 2012. http://www.nap.edu/catalog.php?record_id=13389

Pages 3–4:

Projections of global sea-level rise are generally made using models of the ocean-atmosphere- climate system, extrapolations, or semi-empirical methods. Ocean-atmosphere models are based on knowledge of the physical processes that contribute to sea-level rise, and they predict the response of those processes to different scenarios of future greenhouse gas emissions. These models provide a reasonable estimate of the water density (steric) component of sea-level rise (primarily thermal expansion), but they underestimate the land ice contribution because they do not fully account for rapid changes in the behavior of ice sheets and glaciers as melting occurs (ice dynamics). The IPCC (2007) projections were made using this method, and they are likely too low, even with an added ice dynamics component. Estimates of the total land ice contribution can be made by extrapolating current observations of ice loss rates from glaciers, ice caps, and ice sheets into the future. Extrapolations of future ice melt are most reliable for time frames in which the dynamics controlling behavior are stable, in this case, up to several decades. Semi-empirical methods, exemplified by Vermeer and Rahmstorf (2009), avoid the difficulty of estimating the individual contributions to sea-level rise by simply postulating that sea level rises faster as the Earth gets warmer. This approach reproduces the sea-level rise observed in the past, but reaching the highest projections would require acceleration of glaciological processes to levels not previously observed or understood as realistic. ….

Given the strengths and weaknesses of the different projection methods, as well as the resource constraints of an NRC study, the committee chose a combination of approaches for its projections. The committee projected the steric component of sea-level rise using output from global ocean models under an IPCC (2007) mid-range greenhouse gas emission scenario. The land ice component was extrapolated using the best available compilations of ice mass accumulation and loss (mass balance), which extend from 1960 to 2005 for glaciers and ice caps, and from 1992 to 2010 for the Greenland and Antarctic ice sheets. The contributions were then summed. The committee did not project the land hydrology contribution because available estimates suggested that the sum of groundwater extraction and reservoir storage is near zero, within large uncertainties.

[56] Textbook: Flood Geomorphology. By Victor R. Baker and others. Wiley, April 1998.

Page ix:

[T]rue science is concerned with understanding nature no matter what the methodology. In our view, if the wrong equations are programmed because of inadequate understanding of the system, then what the computer will produce, if believed by the analyst, will constitute the opposite of science.

[57] Paper: “The Use and Misuse of Models for Climate Policy.” By Robert S. Pindyck. Review of Environmental Economics and Policy, March 11, 2017. https://www.journals.uchicago.edu/doi/10.1093/reep/rew012

In a recent article (Pindyck 2013a), I argued that integrated assessment models (IAMs) “have crucial flaws that make them close to useless as tools for policy analysis” (page 860). In fact, I would argue that the problem goes beyond their “crucial flaws”: IAM-based analyses of climate policy create a perception of knowledge and precision that is illusory and can fool policymakers into thinking that the forecasts the models generate have some kind of scientific legitimacy. …

The argument is sometimes made that we have no choice—that without a model we will end up relying on biased opinions, guesswork, or even worse. Thus we must develop the best models possible and then use them to evaluate alternative policies. In other words, the argument is that working with even a highly imperfect model is better than having no model at all. This might be a valid argument if we were honest and up-front about the limitations of the model. But often we are not.

[58] Report: “Face Coverings in the Community and COVID-19: A Rapid Review.” Public Health England, June 26, 2020. https://www.justfacts.com/document/face_coverings_community_covid-19_public_health_england_june_2020.pdf

Page 6:

Part of the limitations of modelling studies is that they must make assumptions in cases where the evidence or data are lacking. For example, models used different parameters to define ‘effectiveness’ of masks, which ranged from an 8% (24) reduction in risk to >95% (29) reduction in risk. The nature of modelling studies also means that simulations are run in controlled environments that may not accurately reflect the behaviours that we observe in real life. Unless controlled for, parameters can be fixed that are usually variable.

Pages 7–8:

[M]odelling and laboratories studies provide only theoretical evidence…. We, therefore, cannot recommend the use of modelling studies alone as evidence to inform or change policy measures.

[59] Commentary: “Five Ways to Ensure That Models Serve Society: A Manifesto.” By Andrea Saltelli and others. Nature, June 24, 2020. https://www.nature.com/articles/d41586-020-01812-9

Now, computer modelling is in the limelight, with politicians presenting their policies as dictated by ‘science’2. Yet there is no substantial aspect of this pandemic for which any researcher can currently provide precise, reliable numbers. Known unknowns include the prevalence and fatality and reproduction rates of the virus in populations. There are few estimates of the number of asymptomatic infections, and they are highly variable. We know even less about the seasonality of infections and how immunity works, not to mention the impact of social-distancing interventions in diverse, complex societies.

Mathematical models produce highly uncertain numbers that predict future infections, hospitalizations and deaths under various scenarios. Rather than using models to inform their understanding, political rivals often brandish them to support predetermined agendas. To make sure predictions do not become adjuncts to a political cause, modellers, decision makers and citizens need to establish new social norms. Modellers must not be permitted to project more certainty than their models deserve; and politicians must not be allowed to offload accountability to models of their choosing2,3.

[60] Paper: “Risk of Bias in Model-Based Economic Evaluations: The ECOBIAS Checklist. By Charles Christian Adarkwah and others. Expert Review of Pharmacoeconomics & Outcomes Research, November 20, 2015. https://www.researchgate.net/profile/Charles-Adarkwah-2/publication/284274465_Risk_of_bias_in_model-based_economic_evaluations_the_ECOBIAS_checklist/links/56544ebb08aeafc2aabbb745/Risk-of-bias-in-model-based-economic-evaluations-the-ECOBIAS-checklist.pdf

Page 1:

Economic evaluations are becoming increasingly important in providing policymakers with information for reimbursement decisions. However, in many cases, there is a significant difference between theoretical study results and real-life observations. This can be due to confounding factors or many other variables, which could be significantly affected by bias. …

There are basically two analytical frameworks used to conduct economic evaluations: model-based and trial-based. In a model-based economic evaluation, data from a wide range of sources [e.g., randomized-controlled trials (RCTs)], meta-analyses, observational studies) are combined using a mathematical model to represent the complexity of a healthcare process.

Page 6:

This study identified several additional biases related to model-based economic evaluation and showed that the impact of these biases could be massive, changing the outcomes from being highly cost-effective to not being cost-effective at all.

[61] Paper: “Economic Evaluations in Fracture Research an Introduction with Examples of Foot Fractures.” By Noortje Anna Clasina van den Boom and others. Injury, March 2022. https://www.sciencedirect.com/science/article/pii/S0020138322000146

The lack of reliable data in the field of economic evaluation fractures could be explained by the lack of reliable literature to base the models on. Since model based studies are the most common design in this field of research, this problem is significant.

[62] Paper: “Comparison of Health and Health Risk Factors Between Lesbian, Gay, and Bisexual Adults and Heterosexual Adults in the United States.” By Gilbert Gonzales, Julia Przedworski, and Carrie Henning-Smith. Journal of the American Medical Association, June 27, 2016. http://archinte.jamanetwork.com/article.aspx?articleid=2530417

Finally, the NHIS [National Health Interview Survey] is a cross-sectional survey and cannot definitively establish the causal directions of the observed associations because cross-sectional studies are prone to omitted variable bias. Missing and unmeasured variables—such as exposure to discrimination or nondisclosure of sexual orientation to family, friends, and health care professionals—may provide alternative explanations for the association between sexual orientation and health outcomes.

NOTE: See the next footnote, where the lead author of this study makes a causal inference about the study.

[63] Article: “Survey Finds Excess Health Problems in Lesbians, Gays, Bisexuals.” By Andrew M. Seaman. Reuters, June 28, 2016. https://ca.news.yahoo.com/survey-finds-excess-health-problems-lesbians-gays-bisexuals-224741845.html

Gilbert Gonzales of the Vanderbilt University School of Medicine in Nashville and colleagues found that compared to heterosexual women, lesbians were 91 percent more likely to report poor or fair health. Lesbians were 51 percent more likely, and bisexual women were more than twice as likely, to report multiple chronic conditions, compared to straight women. …

Gonzales told Reuters Health that the health disparities are likely due to the stress of being a minority, which is likely exacerbated among bisexual people, who may not be accepted by lesbian, gay, bisexual and transgender communities.

[64] Paper: “Association Is Not Causation: Treatment Effects Cannot Be Estimated From Observational Data in Heart Failure.” By Christopher J Rush and others. European Heart Journal, October 2018. https://academic.oup.com/eurheartj/article/39/37/3417/5063542

This comprehensive comparison of studies of non-randomized data with the findings of RCTs [randomized controlled trials] in HF [heart failure] shows that it is not possible to make reliable therapeutic inferences from observational associations.

 

[65] Textbook: Principles and Practice of Clinical Research. By John I. Gallin and ‎Frederick P. Ognibene. Academic Press, 2012.

Page 226: “While consistency in the findings of a large number of observational studies can lead to the belief that the associations are causal, this belief is a fallacy.”

[66] Book: Introductory Econometrics: Using Monte Carlo Simulation with Microsoft Excel. By Humberto Barreto and Frank M. Howland. Cambridge University Press, 2006.

Page 491:

Omitted variable bias is a crucial topic because almost every study in econometrics is an observational study as opposed to a controlled experiment. Very often, economists would like to be able to interpret the comparisons they make as if they were the outcomes of controlled experiments. In a properly conducted controlled experiment, the only systematic difference between groups results from the treatment under investigation; all other variation stems from chance. In an observational study, because the participants self-select into groups, it is always possible that varying average outcomes between groups result from systematic difference between groups other than the treatment. We can attempt to control for these systematic differences by explicitly incorporating variables in a regression. Unfortunately, if not all of those differences have been controlled for in the analysis, we are vulnerable to the devastating effects of omitted variable bias.

[67] Book: Regression With Social Data: Modeling Continuous and Limited Response Variables. By Alfred DeMaris. John Wiley & Sons, 2004.

Page 9:

Regression modeling of nonexperimental data for the purpose of making causal inferences is ubiquitous in the social sciences. Sample regression coefficients are typically thought of as estimates of the causal impacts of explanatory variables on the outcome. Even though researchers may not acknowledge this explicitly, their use of such language as impact or effect to describe a coefficient value often suggest a causal interpretation. This practice is fraught with controversy….

Page 12:

Friedman … is especially critical of drawing causal inferences from observational data, since all that can be “discovered,” regardless of the statistical candlepower used, is association. Causation has to be assumed into the structure from the beginning. Or, as Friedman … says: “If you want to pull a causal rabbit out of the hat, you have to put the rabbit into the hat.” In my view, this point is well taken; but it does not preclude using regression for causal inference. What it means, instead, is that prior knowledge of the causal status of one’s regressors is a prerequisite for endowing regression coefficients with a causal interpretation, as acknowledged by Pearl 1998.

Page 13:

In sum, causal modeling via regression, using nonexperimental data, can be a useful enterprise provided we bear in mind that several strong assumptions are required to sustain it. First, regardless of the sophistication of our methods, statistical techniques only allow us to examine associations among variables.

5 8 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

44 Comments
Inline Feedbacks
View all comments
January 24, 2025 11:06 pm

I’m 64.34% sure that this is valid. With a 97% confidence interval, of course (standard in Climate Scientology).

Reply to  Zig Zag Wanderer
January 24, 2025 11:49 pm

Not very impressed.

Depends on the statistic and what the error band represents. Ninety, 95 or 99% bands, standard errors, standard deviations; whether values are differences or confidence intervals; from another value or from zero, or anomalies from some other value …

Food for thought thou …

Cheers,

Dr Bill Johnston
http://www.bomwatch.com.au

Robert Cutler
Reply to  Bill Johnston
January 25, 2025 7:52 am

The confidence interval is important, but what the author failed to mention is the basis for the percentage calculations in certain types of experiments. For example if there were two deaths in the control group, and one in the test group, this would be hailed in the headlines as a 50% reduction in mortality. What’s often left out, or minimized, is the number of people in the trial.

Let’s say the trial included 100,000 people in each group. Then your chance of dying was reduced from 0.002% to 0.001%.

Now if you thought your chance of dying was reduced by 50% by taking some new wonder drug, would you still take it if you knew that that your original risk of dying was only one in 50,000?

I found this type of statistical reporting to be rampant in medical studies relating to face mask and vaccine efficacy.

MarkW
Reply to  Robert Cutler
January 25, 2025 9:40 am

Beyond that is the fact that are not the same.
If you took the population of an entire state, divided them up into groups of 100,000 and watched them for an entire year. The number of people die, from all causes, will not be the same for each of these groups.

You could have 5 control groups, and the number of deaths in these control groups could range from 0 to 5. Because of this, it’s possible that the drop from 2 to 1 in that study, was just random chance.

Editor
Reply to  Bill Johnston
January 25, 2025 2:36 pm

Bill ==> This is obviously a re-print from the author’s own site. Not meant for a more educated crowd, but maybe useful to Ma and Pa Jones – who both work down at the plant.

Some important points though for all to remember when looking at results reported in the press.

Reply to  Kip Hansen
January 25, 2025 4:50 pm

Dear Kip,

I think the post is quite confusing: “times more likely” is a factorial difference, thus if it is meaningful, the CIs are factorially large. Compare that to a sample mean, or as the Author contends, “a point estimate” (which I would refer to as a sample).
 
A sample, say a single temperature measurement, has an instrument error associated with it, which is half the interval range – rounds to 0.3 degC; or 0.5 degF (which is equivalent in interval length: 200 half-degC met-thermometer intervals between freezing and boiling, vs (212-32 =) 212 intervals for a degF met-thermometer (I have used both and OK there is a slight difference).
 
It is not clear from his examples, whether he is interested in the spread of values about the mean (Standard Deviation), or the variability (precision) between samples contributing to a mean – SEM (OK, one can be calculated from the other, but they express different attributes of the same data).

A Key difference is that the SEM declines as the number of samples increase, but the SD does not tend to increase (or decrease) as the number of samples increase (see for example https://statisticsbyjim.com/basics/difference-standard-deviation-vs-standard-error/). As it is not unusual for people to report SEM (because it is a smaller number), rather than SD, it is important to recognise the difference.
 
For Confidence Intervals (CIs), which are commonly associated with regression analysis, there is an equally large difference between confidence intervals for the line (CI), and CIs of a prediction (PI). It is important to know which is which, because sometimes people use the former (which is smaller), rather than the latter when predicting a new (unknown) value, or something which is outside the data bandwidth. Without going to the reference, I’m not sure if the sea level study used CIs or PIs for example.   
 
The concept of statistically significant can also be confusing because the lower the p-level, the more significant is the effect. The cutoff of p = 0.05 is often taken to be the minimum to which can be ascribed “significance”, anything less is more significant (p = 0.01 … 0.001 for example). Most people these days give actual p-levels.
 
The Author says “don’t mistake statistical significance for real-world importance”, which is why people are encouraged by journals to report effect size, together with means and p-levels.
 
I don’t understand the medical study example, and can’t be bothered going to the references.
 
I am also no statistician!
 
All the best,
Dr Bill Johnston
http://www.bomwatch.com.au
  

Reply to  Bill Johnston
January 25, 2025 8:18 pm

A Key difference is that the SEM declines as the number of samples increase, but the SD does not tend to increase …

The first part of your statement is true. However, one is only justified in using the Standard Error of the Mean for data that have the property of stationarity, which means that for time-series, the mean and SD don’t change with time. If the data are not stationary then the drift in the mean can cause the SD to decrease, stay the same, or increase, depending on the slope of the regression line. One can use transformations of the raw data to normalize it and to remove a trend, but doing so confounds the interpretation of the residual data. These are details that most climatologists don’t bother with.

Reply to  Clyde Spencer
January 25, 2025 8:53 pm

Thanks Clyde,

My comments were of a general nature … as they say.

Yes it does become more complex when dealing with time series, where there are also potential problems of autocorrelation, inhomogeneties, outlier values and badly-behaved residuals.

Another very common error is to compare time series using linear correlation, where p-levels are grossly inflated by underlying trend, and for monthly data, cycles. Assumptions underlying tests are too frequently ignored by Excel users in particular.

Cheers,

Bill

Reply to  Clyde Spencer
January 26, 2025 10:50 am

Clyde,

I would be remiss if I didn’t also point out that comparing time series to determine causality is totally unscientific.

Bob B.
Reply to  Zig Zag Wanderer
January 25, 2025 3:36 am

I’m not sure if it’s valid or not but I am feeling lucky today.

Richard Greene
Reply to  Zig Zag Wanderer
January 25, 2025 4:07 am

The IPCC is 103% sure with a 105% confidence level. 97% is for losers.

January 25, 2025 2:13 am

When studies show vaccines as safe, or protective against “side effects”, they have huge margins of errors; when damage is found from vaccine, the margin of error is sometimes narrow. And sometimes large, and these less precise studies are cited to “show no proven harm from vaccine” – when the title, from a “precautionary principle” standpoint (*), must be “study fails to show vaccine safety”.

Notably, studies showing less MS cases among those getting the hep B vaccine had ridiculously large margins of error, so much so, they are meaningless.

Also, a problem is that a study showing “no significant” link between A and B is used to dismiss the link when it actually can reinforce (very weakly) the idea of a strong link, when other studies are significant.

(*) yeah, you may not like that “precautionary” basis, but THEY made their beds, THEY have to lie in them; they don’t get to drop the “precautionary principle” when they get Big Pharma funds.

Richard Greene
January 25, 2025 3:58 am

This article is too long and a tedious read. I made it 1/3 through before I gave up. I was expecting a transition to climate, not bisexual men!

The article misses the most important points. The author seems detached from common sense reality. The editor who chose this gets a Three Stooges head bop ad eye poke.

The bottom line is applying statistics to incompetent data, and/or data compiled by biased authors, is a meaningless exercise. Mathematical Mass-turbation.

The most important aspects of a study.

(1) Does it pass your BS test based on experience and intuition?

(2) Are the data appropriate for the conclusion?

(3) How much contrary data are being ignored?

(4) Are the data accurate?

(6) Do the authors reject the current consensus without explaining why, or not even trying to refute the current consensus?

(7) Are the study authors likely to be biased?

(8) Has the study been replicated?

(9) Do other studies you trust come to the same or similar conclusions?

(10) Are you only reading studies that confirm what you already believed?

There is always a publication bias: Studies with exciting results get published and read. Inconclusive results are often ignored. Boring results are often ignored.

The there is the replication failure rate

Finally, there is my Study / Article Rule of Thumb:

Half of what you read is BS. The tough job is figuring out which half. BS studies have statistics too.

Mark Twain famously said, “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so,” meaning the real problem is not ignorance, but confidently believing things that are actually not true. 

Common forms of bias: I have two distant. leftist friends. I try to avoid them. Both are wealthy and think they know it all on every subject.

If you disagree with Steve on any subject, he says: “You have book smarts, but I have street smarts”, meaning he’s right and I’m wrong.

If you disagree with Dorothy on any subject, she claims you have confirmation bias, never mind that she never heard or read a conservative news source in her life.

These two people are multi-millionaires, think they know everything and would never answer a question with “No one knows”.

People are not 100% logical like imaginary Vulcans in Star Trek.

Reply to  Richard Greene
January 25, 2025 5:07 am

Much of what is discussed here deals with studies where something either occurs or it doesn’t. At the very start where he discusses whether a 90% good, 10% bad quality assessment is not a measurement example. The outcome is good or bad, i.e., binary.

Measurements don’t work that way. That is why uncertainty is not treated as good or bad. That is what the old error paradigm used to do. By “removing” all error, one could guarantee that the “true value” had been established. The new paradigm of using uncertainty emphasizes the fact that all measurements are estimates.

Uncertainty is used to assess the probability distribution created by a number of measurements of the same thing. The thing that most people miss is that every measurement is an estimate. Therefore, the probability distribution itself is “fuzzy”. It is one reason that resolution is so important, it establishes a boundary condition on the information that you know. ISO (International Organization for Standardization) requires standard uncertainty intervals be expanded by at least a factor of 2 so that the “fuzziness” can be taken into account.

Food for thought. CAGW adherents insist that anomalies are the only way to assess whether the global temperature is increasing or decreasing. That just isn’t the case. An average of absolute temperatures will also tell you if the average temperature is increasing or decreasing. It won’t tell you where, but anomalies don’t either.

The real reason is a form of data dredging. A variance of small numbers will be a small number. A variance of large numbers will be a large number. By drawing a graph using small numbers, any change looks large. What is the relative value however?

0.05°/1° = 5%

0.05°/70° = 0.07%

In other words the apparent change is vastly different on a graph. This is a form of finding a statistical method that given you the desired effect, just as the article states.

Richard Greene
Reply to  Jim Gorman
January 25, 2025 8:04 am

Temperature anomalies and absolute temperatures tell us the same thing. But anomalies look scarier on a chart while absolute is close to a flat line.

comment image

The margin of error for the global average temperature statistic can be claimed to be +/- 0.1 degrees C. but that is total BS. There were few measurements before 1900 with little coverage of the Southern Hemisphere. The margin of error could not possibly be known.

While a specific instrument may have a margin of error rating, how often are weather station instruments calibrated to be sure they still meet specs?

And what about the changing environments around the land weather stations?

Garbage or biased data
= garbage conclusions

Garbage or biased data
+ statistical analyses
= garbage conclusions

Reply to  Richard Greene
January 26, 2025 5:45 am

Temperature anomalies and absolute temperatures tell us the same thing.”

Actually they don’t, not if the issue being investigated is the “climate”. An anomaly of +2C if far different for climate if the change is from -1C to +1C than it is for a change from 20C to 21C.

Anomalies of temperature can’t tell you *anything* about climate. In fact, temperature can’t do so either! I still don’t understand why climate science clings to temperature as the proper metric for climate when it isn’t the proper metric!

My go to example is the difference in climate between Las Vegas and Miami. Both have similar temperatures but vastly different climate. So how can temperature be used as a metric to compare two different locations?

Mr.
Reply to  Jim Gorman
January 25, 2025 10:42 am

As far as I can tell, drawing from my early stint as a factory production auditor, climate “data” have only 3 issues that render them unfit for use –

PROBITY
PROVENANCE
PRESENTATION

Apart from that, I think reading the Babylon Bee is more informative.

hdhoese
Reply to  Richard Greene
January 25, 2025 6:53 am

Yep, whole stat course, but important point mentioned at least twice about burying the caveat(s) seems to be too common nowadays. I remember the phrase “best available science” now sometimes like “all models are wrong, but—-.”

Jeff Alberts
Reply to  Richard Greene
January 25, 2025 7:08 am

This article is too long and a tedious read.”

Hmm. You tell us all the time that you read a lot of scientific papers. Are those exciting and titillating? Please show us one.

Richard Greene
Reply to  Jeff Alberts
January 25, 2025 8:19 am

I have read about 200 scientific papers on CO2 enrichment and plant growth. About one a month for 20 years. Tedious reading until I found good one page summaries of such studies at a good website: Their archive is still available:

CO2 Science

I did a lot of tedious reading for a BS degree: Thermodynamics, Mechanics, Physics, and Differential Equations. No tolerance any more. Now my most tedious scientific (anatomy) reading is Big Hooters Illustrated

Editor
Reply to  Jeff Alberts
January 25, 2025 2:38 pm

Jeff ==> The main articles, sans endnotes, is only 1600 words, a 5 or 6 minute read.

Scarecrow Repair
Reply to  Richard Greene
January 25, 2025 8:45 am

And you complain the article is too long and drifts off topic???

fah
January 25, 2025 7:21 am

One of my favorite quotes, attributed to Ernest Rutherford, is “If your experiment needs statistics, you ought to have done a better experiment.” The gist of the the statement is that it is much preferred to study something that is well defined and has a well defined set of initial conditions and outcomes, than trying to untangle a bunch of miscellaneous iota about an ill defined set of outcomes.

A second favorite quote is Feynman’s description of an ideal experiment, from his lectures in the section titled Quantum Behavior, “…. define an “ideal experiment” as one in which there are no uncertain external influences, i.e., no jiggling or other things going on that we cannot take into account. We would be quite precise if we said: “An ideal experiment is one in which all of the initial and final conditions of the experiment are completely specified.” What we will call “an event” is, in general, just a specific set of initial and final conditions.”

A third relevant example is an often mentioned hypothetical about asking many people how deep the Nile is (or any other thing for which they don’t actually have personal empirical evidence). The argument will go that if you ask enough people the mean will approach some number and the error (either standard deviation or standard error of the mean) will get small and even more by the “law of large numbers” the distribution will approach normal. So one could wind up with an estimate and confidence intervals with “statistics” backing them up. But Rutherford will still say you ought to have done a better experiment. The thing you are studying makes a difference.

Reply to  fah
January 25, 2025 7:57 am

The thing you are studying makes a difference.

Great quotes! Your interpretations are right on point also.

Richard Greene
Reply to  fah
January 25, 2025 8:22 am

That WAS A GOOD COMMENT

Reply to  fah
January 25, 2025 8:33 pm

A third relevant example is an often mentioned hypothetical about asking many people how deep the Nile is (or any other thing for which they don’t actually have personal empirical evidence).

It is my opinion that the claim is false. If you ask a lot of people to estimate things they have familiarity with, I can accept that the average might approach the correct value. However, if you ask people a question that none of them even have a vague idea of the order of magnitude [such as the depth of the clouds on Jupiter, the weight of the sun, or God’s phone number] you would do as well to pick a number between zero and infinity.

Reply to  Clyde Spencer
January 26, 2025 5:49 am

The problem is that the Nile has a varying depth as it moves across the terrain and geography. The question should be further restricted to “how deep is the Nile at this latitude and longitude?”. In other words it’s like your post concerning stationarity. The depth of the Nile is not stationary in space just like temperature is not stationary in time.

Kevin Kilty
January 25, 2025 8:09 am

How’s that for gambling? Would you step outside your home today if you knew there was a 1 in 100 chance you wouldn’t make it back alive?

The confidence interval should be set according to the consequences of making a Type II error. So, it goes without saying, an error that could cost a person their life should should be far smaller than one in a hundred.

The power system reliability is often reported as aiming toward one day of outage in ten years. What is this in terms of a confidence interval? Well, the simplest interpretation is that it is about a day in 3650 days of unplanned outage, or maybe it is truly the accumulation of 24 hours worth of outage in 87,600 hours of operation. You can see that it isn’t possible to set an exact C.I. but it isn’t much less than one part in four thousand which is to say something like 99.975% availability. The actual performance falls a bit short of this goal, but not much. The risks associated with power outage grow with length of the outage, and the conditions under which it occurs.

The even more difficult statistic to explain is the recurrence interval estimate — i.e. the one in one-hundred year event.

ferdberple
January 25, 2025 9:13 am

Experts are reasonably good at predicting the past but fail horribly when predicting the future.

Statistics is nothing more than simple curve fitting. It can tell you a 7 is more likely than a 4 when throwing dice. But it cannot tell you what will actually happen.

Reply to  ferdberple
January 26, 2025 5:55 am

It’s why they are STATISTICAL DESCRIPTORS and are *not* actual measurements of anything physical. They “describe” what you know but can’t tell you what you don’t know (i.e. the future). Too many conflate statistical descriptors with probabilities. The phrase “the past average temperature is x” can’t tell you if the temperature tomorrow will be “x”. It can only tell you what the most common temperature has been in the past. It’s why people bet on the long shot at the track, sometimes the long shot wins!

Sparta Nova 4
Reply to  ferdberple
January 28, 2025 6:59 am

Hind casting is curve fitting the models to the recorded data (pure or adulterated).

ferdberple
January 25, 2025 9:50 am

Death Valley and Miami: Same Climate?

Death Valley, California, and Miami, Florida, share the same average annual temperature—25°C (77°F). Yet, their climates couldn’t feel more different.

By defining climate as the 30-year average temperature, climate science lumps these extremes together. This simplification ignores variability, rendering natural climate change meaningless under such a narrow metric.
.

Reply to  ferdberple
January 26, 2025 5:58 am

100%. Ask yourself why climate science stubbornly clings to temperature as a metric for climate. Climate science has had access to the data needed to use enthalpy (a far better metric for climate) for over 40 years. But they for some reason keep on pushing a metric developed 500 years ago. Why?

Erik Magnuson
January 25, 2025 10:11 am

One pitfall with statistics is dealing with systematic error, which could be caused by improper sampling (e.g. Iowa poll), problems with instrument calibration or other failures to get a truly random sample.

Then there is the news media only reporting the extreme best or worst case result depending on what best fits their narrative.

I don’t think peer review ever resulted in all science papers being trustworthy – the real test for the value of a science paper is the test of time.

ferdberple
January 25, 2025 10:20 am

The Difference Between Variability of Average and Average of Variability

The variance of the average is smaller than the average of the variance because averaging smooths out fluctuations. When you average data, short-term changes are hidden, reducing the overall variability. In contrast, the average of variances shows the full spread of individual data points without smoothing, giving a more accurate measure of variation.

This is crucial in climate science. Using daily and seasonal averages hides the significant fluctuations that occur within those periods, making the climate seem more stable than it actually is. By smoothing out extremes, averages mislead us into underestimating natural variability, obscuring the true dynamics of our climate. This has led to an incorrect belief that Natural Variability is low.

Thus the use of daily and seasonal averages in the form of anomalies has resulted in an under estimate of natural variability. This has led to an over estimate of the role of CO2 in climate change.

BillR
Reply to  ferdberple
January 25, 2025 5:13 pm

In agreement with your comment: often not appreciated is that the average of a non-linear function and the non-linear function of an average are entirely different things.

There exists in flow measurement something called “square root error”. Flow measurements yielding differential pressure follow a square law due to Bernoulli’s principle. The pressure differentials are often small, riding on top of large static system pressure, and can be noisy, so are ripe for low pass filtering to improve readability. But filtering a delta-p signal prior to taking the square root can result in significant measurement error.

Reply to  ferdberple
January 26, 2025 6:02 am

Even worse is combining variables with different variances with no weighting to develop an apples-to-apples data set. E.g. combining southern hemisphere temps with northern hemisphere temps when cold temps have a different variance than warm temps.

And then climate science tries to hide this by using the “but we are using anomalies” meme (a corollary of the “all measurement uncertainty is random, Gaussian, and cancels” meme) when the truth is that the anomalies inherit the variances of the absolute temps!

January 25, 2025 12:48 pm

I’ve seen some charts and graphs that include error bars, usually in gray.
Maybe some Cli-Sci charts and graphs don’t include error bars because, if they did, the whole thing would be gray? 😎

Bob
January 25, 2025 1:33 pm

This is powerful stuff but not easy to wrap your head around.

steve_showmethedata
January 25, 2025 5:01 pm

This is the subtitle to a recent peer review journal paper of mine: ‘Are Severely Under-Powered Studies Worth the Effort?’ (DOI: 10.9734/CJAST/2022/v41i333946). The paper concentrated on one study but also provided some novel stats methods for BACI designs and a count response variable.

For that study and unfortunately many studies across most science disciplines the answer is NO. That is part of the cause of the ‘reproducibility crisis’ in science.

Loren Wilson
January 25, 2025 5:28 pm

We commonly call the uncertainty the margin of error, but it really is the range of probable uncertainty. Since we don’t know the true answer, there is no error, only uncertainty in the point estimate.

sherro01
January 25, 2025 6:06 pm

Statisticians have devised elegant ways to deal with errors in groups of numbers, but less progress has been made with quantifying the uncertainty of the ideas behind the numbers.

Donald Rumsfeld, US politician, said on February 12, 2002 –

“Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tends to be the difficult ones.”

People point to statistical correlations between atmospheric CO2 and global temperatures since 1950 and claim that they are numerically related with small errors. What has not yet been done is a scientific experiment that supports the claim, the idea. You can calculate error terms of the numbers used, but there is still uncertainty as to whether there is causation of one on the other. We have not yet dealt adequately with all of the unknowns. In this case, we do not even know the sign.

People continue to confuse error and uncertainty. They are different animals.
Geoff S

Reply to  sherro01
January 26, 2025 6:08 am

If the measurement data has uncertainty, e.g. stated value +/- uncertainty, then that uncertainty has to be carried over into the sample data. The “standard error of the mean” is basically the standard deviation of the sample means. Climate science calculates the SEM using only the stated values and ignores the uncertainty. But the sample means should be given as “stated value +/- uncertainty” when combined into a data set used for determining the standard deviation of the sample means. In other words the SEM has its own uncertainty that climate science never bothers with.

It’s all part of climate science’s meme of “all measurement uncertainty is random, Gaussian, and cancels”. Yet the assumption of randomness and distribution shape is never justified. It’s just to be taken on faith!

January 26, 2025 6:13 am

What it means, instead, is that prior knowledge of the causal status of one’s regressors is a prerequisite for endowing regression coefficients with a causal interpretation, as acknowledged by Pearl 1998.”

Climate science needs to take this to heart! Even creating a regression analysis of CO2 vs temperature has a built-in assumption that CO2 is somehow causally related to temperature. That assumption needs to be justified FIRST, before doing the regression analysis, not using the regression analysis to justify causation.

Verified by MonsterInsights