Guest Essay by Kip Hansen
This essay is second in a series of essays about Averages — their use and misuse. My interest is in the logical and scientific errors, the informational errors, that can result from what I have playfully coined “The Laws of Averages”.
Averages
As both the word and the concept “average” are subject to a great deal of confusion and misunderstanding in the general public and both word and concept have seen an overwhelming amount of “loose usage” even in scientific circles, not excluding peer-reviewed journal articles and scientific press releases, I gave a refresher on Averages in Part 1 of this series. If your maths or science background is near the great American average, I suggest you take a quick look at the primer in Part 1 before reading here.
A Beam of Darkness Into the Light
The purpose of presenting different views of any data set — any collection of information or measurements about a thing, a class of things, or a physical phenomenon — is to allow us to see that information from different intellectual and scientific angles — to give us better insight into the subject of our studies, hopefully leading to a better understanding.
Modern statistical [software] packages allow even high school students to perform sophisticated statistical tests of data sets and to manipulate and view the data in myriad ways. In a broad general sense, the availability of these software packages now allows students and researchers to make [often unfounded] claims for their data by using statistical methods to arrive at numerical results — all without understanding either the methods or the true significance or meaning of the results. I learned this by judging High School Science Fairs and later reading the claims made in many peer-reviewed journals. One of the currently hotly discussed controversies is the prevalence of using “P-values” to prove that trivial results are somehow significant because “that’s what P-values less than 0.05 do”. At the High School Science Fair, students were including ANOVA test results about their data –none of them could explain what ANOVA was or how it applied to their experiments.
Modern graphics tools allow all sorts of graphical methods of displaying numbers and their relationships. The US Census Bureau has a whole section of visualizations and visualization tools. An online commercial service, Plotly, can create a very impressive array of visualizations of your data in seconds. They have a level of free service that has been more than adequate for almost all of my uses [and a truly incredible collection of possibilities for businesses and professionals at a rate of about a dollar a day]. RAWGraphs has a similar free service.
The complex computer programs used to create metrics like Global Average Land and Sea Temperature or Global Average Sea Level are believed by their creators and promoters to actually produce a single-number answer, an average, accurate to hundredths or thousandths of a degree or fractional millimeters. Or, if not actual quantitatively accurate values, at least accurate anomalies or valid trends are claimed. Opinions vary wildly on the value, validity, accuracy and precision of these global averages.
Averages are just one of a vast array of different ways to look at the values in a data set. As I have shown in the primer on averages, there are three primary types of averages — Mean, Median, and Mode — as well as a number of more exotic types.
In Part 1 of this series, I explained the pitfalls of averages of heterogeneous, incommensurable objects or data about objects. Such attempts end up with Fruit Salad, an average of Apples-and-Oranges: illogical or unscientific results, with meanings that are illusive, imaginary, or so narrow as not to be very useful. Such averages are often imbued by their creators with significance — meaning — that they do not have.
As the purpose of looking at data in different ways — such as looking at a Mean, a Median, or a Mode of the numerical data set — is to lead to a better understanding, it is important to understand what actually happens when numerical results are averaged and in what ways they lead to improved understanding and in what ways they lead to reduced understanding.
A Simple Example:
Let’s consider the height of the boys in Mrs. Larsen’s hypothetical 6th Grade class at an all boys school. We want to know their heights in order to place a horizontal chin-up bar between two strong upright beams for them to exercise on (or as mild constructive punishment — “Jonny — ten chin-ups, if you please!”). The boys should be able to reach it easily by jumping up a bit so that when hanging by their hands their feet don’t touch the floor.
The Nurse’s Office supplies the heights of the boys, which are averaged to get the arithmetical mean of 65 inches.
Using the generally accepted body part ratios we do quick math to approximate the needed bar height in inches:
Height/2.3 = Arm length (shoulder to fingertips)
65/2.3 = 28 (approximate arm length)
65 + 28 = 93 inches = 7.75 feet or 236 cm
Our calculated bar height fits nicely in a classroom with 8.5 foot ceilings, so we are good. Or are we? Do we have enough information from our calculation of the Mean Height?
Let’s check by looking at a bar graph of all the heights of all the boys:

This visualization, like our calculated average, gives us another way to look at the information, the data on the heights of boys in the class. Realizing that because the boys range from just five feet tall (60 inches) all the way to almost 6 feet (71 inches) we will not be able to make one bar height that is ideal for all. However, we see now that 82% of the boys are within 3 inches either way of the Mean Height and our calculated bar height will do fine for them. The 3 shortest boys may need a little step to stand on to reach the bar, and the 5 tallest boys may have to bend their legs a bit to do chin ups. So we are good to go.
But when they tried the same approach in Mr. Jones’ class, they had a problem.
There are 66 boys in this class and their Average Height (mean) is also 65 inches, but the heights had a different distribution:

Mr. Jones’ class has a different ethnic mix which results in an uneven distribution, much less centered around the mean. Using the same Mean +/- 3 inches (light blue) used in our previous example, we capture only 60% of the boys instead of 82%. In Mr. Jones class, 26 of the 66 boys would not find the horizontal bar set at 93 inches convenient. For this class, the solution was a variable height bar with two settings: one for the boys 60-65 inches tall (32 boys), one for the boys 66-72 inches tall (34 boys).
For Mr. Jones’ class, the average height, the Mean Height, did not serve to illuminate the information about boys’ height to allow us to have a better understanding. We needed a closer look at the information to see our way through to the better solution. The variable height bar works well for Mrs. Larsen’s class as well, with the lower setting good for 25 boys and the higher setting good for 21 boys.
Combining the data from both classes gives us this chart:

This little example is meant to illustrate that while averages, like our Mean Height, serve well in some circumstances, they do not do so in others.
In Mr. Jones’ class, the larger number of shorter boys was obscured, hidden, covered-up, averaged-out by relying on the Mean Height to inform us of the best solutions for the horizontal chin-up bar.
It is worth noting that Mrs. Larsen’s class, shown in the first bar chart above, has a distribution of heights that more closely mirrors what is called a Normal Distribution, a graph of which looks like this:

Most of the values are creating a hump in the middle and falling off evenly, more or less, in both directions. Averages are good estimations of data sets that look like this if one is careful to use a range on either side of the Mean. Means are not so good for data sets like Mr. Jones’ class, or for the combination of the two classes. Note that the Arithmetical Mean is exactly the same for all three data sets of height of boys — the two classes and the combined — but the distributions are quite different and lead to different conclusions.
US Median Household Income
A very common measure of economic well-being in the United States is the US Census Bureau’s annual US Median Household Income.
First note that it is given as a MEDIAN — which means that there should be an equal number of families above this income as families below this income level. Here is the chart that the political party currently in power — regardless of whether it is the Democrats or the Republicans — with both the Oval Office (US President) and both houses of Congress in their pocket, will trot out:

That’s the Good News! graph. Median Family Income on a nice steady rise through the years, we’re all singing along with the Fab Four “I’ve got to admit it’s getting better, A little better all the time…”
This next graph is the Not So Good News graph:

The time axis is shortened to 1985 to 2015, but we see that families have not been gaining much, if at all, in Real Dollars, adjusted for inflation, since about 1998.
And then there is the Reality graph:

Despite the Good News! appeal of the first graph, and the so-so news of the second, we see that if we dig below the surface, looking at more than just the single-numeral Median Household Income by year, we see a different story — a story obscured by both the Good News and the Not So Good News. This graph is MEAN Household Income of the five quintiles of income, plus the Top 5%, so the numbers are a bit different and it tells a different story.
Breaking the population into five parts (quintiles), the five brightly colored lines, the bottom-earning 60% of families, the green, brown and red lines, have made virtually no real improvement in real dollars since 1967. The second quintile, the middle/upper middle classes in purple, have seen a moderate increase. Only the top 20% of families (blue line) have made solid steady improvement — and when we break out the Top 5%, the dashed black line, we see that not only do they earn the lion’s share of the dollars, but they have benefited from the lion’s share of the percentage gains.
Where are the benefits felt?

Above is what the national average, the US Median Household Income metric, tells us. Looking a bit closer we see:

Besides some surprises, like Minnesota and North Dakota, it is what we might suspect. The NE US: NY, Massachusetts, Connecticut, NJ, Maryland, Virginia, Delaware — all coming in at the highest levels, along with California, Washington. Utah has always had the more affluent Latter-Day Saints and along with Wyoming and Colorado has become a retirement destination for the wealthy. The states whose abbreviations are circled have state averages very near the national median.
Let’s zoom in:

The darker green counties have the highest Median Household Incomes. It is easy to see San Francisco/Silicon Valley in the west and the Washington DC-to-NYC- to-Boston megapolis in the east.
This map answered my big question: How does North Dakota have such a high Median Income? Answer: It is one area, circled and marked “?”, centered by Williams County, with Williston as the main city. The area has less than 10,000 families. And “Williston sits atop the Bakken formation, which by the end of 2012 was predicted to be producing more oil than any other site in the United States,” it is the site of Americas latest oil boom.
Where is the big money? Mostly in the big cities:

And where is it not? All those light yellow counties are areas in which many to most of the families live at or below the federal Poverty Line for families of four.

An overlay of US Indian reservations reveals that they are, in the west particularly, in the lowest and second lowest income brackets. (An interest of mine, as my father and his 10 brothers and sisters were born on the Pine Ridge in southwestern South Dakota, the red oval.) One finds much of the old south in the lowest bracket (light yellow), and the deserts of New Mexico and West Texas and the hills of West Virginia and Kentucky.
One more graphic:

What does this tell us?
It tells us that looking at the National Median Household Income, as a single-number–especially in dollars unadjusted for inflation–presents a picture that obscures, hides, whitewashes over the inequalities and disparities that are the important facts of this metric. The single number, National Average (Median) Household Income number tells us only that one very narrow bit of information — it does not tell us how American families are doing income-wise. It does not inform us of the economic well-being of American families — rather it hides the true state of affairs.
Thus, I say that the publicly offered Average Household Income, rather than shedding light on the economic well-being of American families, literally shines a Beam of Darkness that hides the real significant data about the income of America’s households. If we allow ourselves to be blinded by the Beam of Darkness that these sort of truth-hiding averages represent, then we are failing in our duty as critical thinkers.
Does this all mean that averages are bad?
No, of course not. They are just one way of looking at a batch of numerical data. The are not, however, always the best way. In fact, unless the data one is considering is very nearly normally distributed and changes are caused by known and understood mechanisms, averages of all kinds more often lead us astray and obscure the data we should really be looking at. Averages are a lazy man’s shortcut and seldom lead to a better understanding.
The major logical and cognitive fault is allowing one’s understanding to be swayed, one’s mind to be made up, by looking at just this one very narrow view of the data — one absolutely must recognize that the view offered by any type of average is hiding and obscuring all the other information available, and may not be truly representative of the overall, big picture.
Many better methods of data analysis exist, like the simplistic bar chart used in the school boys’ example above. For simple numerical data sets, charts and graphs, if used to reveal (instead of hide) information are often appropriate.
Like averages, visualizations of data sets can be used for good or ill — the propaganda uses of data visualizations, which now include PowerPoints and videos, are legion.
Beware of those wielding averages like clubs or truncheons to form public opinion.
And climate?
The very definition of climate is that it is an average — “the weather conditions prevailing in an area in general or over a long period.” There is no single “climate metric” — no single metric that tells us what “climate” is doing.
By this definition above, pulled at random from the internet via Google, there is no Earth Climate — climate is always “the weather conditions prevailing in an area in general or over a long period of time”. The Earth is not a climatic area or climate region, the Earth has climate regions but is not one itself.
As discussed in Part 1 — the objects in sets to be averaged must be homogeneous and not so heterogeneous as to be incommensurable. Thus, when discussing the climate of a four-season region, generalities are made about the seasons to represent the climatic conditions in that region during the summer, winter, spring and fall, separately. A single average daytime temperature is not a useful piece of information to summertime tourists if the average is taken for the whole year including the winter days — such an average temperature is foolishness from a pragmatic point of view.
Is it also foolishness from a Climate Science point of view? This topic will be covered in Part 3 of this series. I’ll read your comments below — let me know what you think.
Bottom Line:
It is not enough to correctly mathematically calculate the average of a data set.
It is not enough to be able to defend the methods your Team uses to calculate the [more-often-abused-than-not] Global Averages of data sets.
Even if these averages are of homogeneous data and objects, physically and logically correct, averages return a single number and can incorrectly be assumed to be a summary or fair representation of the whole set.
Averages, in any and all cases, by their very nature, give only a very narrow view of the information in a data set — and if accepted as representational of the whole, will act as a Beam of Darkness, hiding and obscuring the bulk of the information; thus, instead of leading us to a better understanding, they can act to reduce our understanding of the subject under study.
Averages are good tools but, like hammers or saws, must be used correctly to produce beneficial and useful results. The misuse of averages reduces rather than betters understanding.
# # # # #
Author’s Comment Policy:
I am always anxious to read your ideas, opinions, and to answer your questions about the subject of the essay, which in this case is Averages, their uses and misuses.
As regular visitors know, I do not respond to Climate Warrior comments from either side of the Great Climate Divide — feel free to leave your mandatory talking points but do not expect a response from me.
I am not an economist — nor a national policy buff – nor interested in US Two-Party-Politics squabbles. Please keep your comments to me to the question of the uses of averages rather than the details of the topics used as examples. I actually have had experience in building exercise equipment for a Youth Camp.
I am interested in examples of the misuse of averages, the proper use of averages, and I expect that many of you will have varying opinions regarding the use of averages in Climate Science.
# # # # #
Kip,
I have one small quibble with an otherwise excellent article. You said, “…believed by their creators and promoters to actually produce a single-number answer, an average, ACCURATE to hundredths or thousandths of a degree or fractional millimeters.” I believe that “precise” would be a better choice of words than “accurate.” The issue of the accuracy of the average is another question entirely, more related to the sampling and interpolation protocols.
Clyde Spencer ==> I believe the claims are both to accuracy, with unsupportable tiny error bars, and to precision, to physically-impossible precision — example: Global Mean Sea Level change precise to tenths of a millimeter.
“… to physically-impossible precision — example: Global Mean Sea Level change precise to tenths of a millimeter”
IIRC, there’s a convention problem with things like this. Let’s suppose that sea level rise at a given set of tidal gauges or satellite is accurate to 0.6mm. Ignoring the fact that one is measuring Apparent Sea Level change whereas the other reports Eustatic Sea Level change, pragmatically one can either report to the nearest mm and lose accuracy or report to the nearest 0.1mm and imply more accuracy than exists. Rigorously one would have to specify error limits (one sigma? 3 sigma? something else?). That’s messy and you’d just end up defending your error estimates.
I think there’s more to the analysis than that, but I don’t recall the details. Anyway, the convention is to report enough precision not to lose accuracy. Seems to me probably the least bad resolution?
Don ==> I am thinking more specifically of the satellite derived MSL (mean sea level). If one checks the expected error bars on the satellites, each newer version, the latest one was expected to be able to read sea level to within +/- 2 or 3 mm. I have these figures somewhere preparing for an essay on the topic, but that’s abut right.. So from an instrument that takes multiple reading at multiple locations at multiple times….to an bar measured no closer than 2 or 3 mm — they derive, somehow, a Global Mean to tenths of a mm?
Kip: “the latest one was expected to be able to read sea level to within +/- 2 or 3 mm” A couple of cm, not mm I believe. There’s apparently that much uncertainty in all satellite orbits — I think due to the fact that drag is neither constant nor entirely predictable. (Don’t push me [too] hard on in-track vs transverse error — I don’t know the answers) Add to that other uncertainties — e.g. If you’re trying to measure radar returns to mm accuracy, you probably have to worry about every detail of the signal path like where the antenna is with respect to the satellite center of mass.. Then there’s variable ionospheric delay.to worry about. And waves on the ocean surface. And air pressure. And … You get the idea.
OTOH, they are able to make a lot of measurements — 20 per second if memory serves. They are using the same instrumentation for all measurements. And, conceptually at least satellite passes repeat over “exactly” the same spot every few hundred passes (254 if the internet isn’t lying to me) which means you can get millions of was-is_now pairings at the same “spot” at 10,20,30 … day intervals that may or may not be free of instrument drift (or drifting at known rates which is probably just as good if you can compute corrections.
You’re proposing to analyze the errors in the values reported from that setup?
Seems challenging.
Don K,
There is an old philosophical observation that one can never step into the same stream twice. Similarly, a satellite can never pass over the same spot twice because of waves and tides.
“There is an old philosophical observation that one can never step into the same stream twice” Clyde -Ignoring the poetry/philosophy, there’s a good pragmatic reason to worry about being very close to exactly the “same” latitude and longitude when comparing observations. It turns out the sea surface is rather far from being “level”. If I haven’t misplaced a decimal point, the worst case is in the Eastern Indian Ocean where the slope of the sea surface might approach one part in 10000. That is to say that if you’re worried about mm accuracy, you need to be within 10m of the same place.
Tides? Oh yeah. Tides. Thanks for bringing that up. They need to be corrected for. We know tides with reasonable precision I think. Surely within a few cm? but we’re worried about mm or microns (0.1mm = 100 microns, right?) 100microns = 0.01 cm.
Kip. It occurs to me that I’ve failed to make the point I’m concerned about explicitly. What I’m trying to say is that the physical systems involved in measuring sea level change at the sub-mm level from satellites are extraordinarily complex even by “climate science” standards. While trying to work out the error budget would be a great intellectual exercise, I’m not sure that it is doable. Certainly I can’t do it. The number of people who can may well be zero.
Comparable to those problems, and the even greater uncertainties and inaccuracies, of trying to “measure” the mass differences caused by assumed ice changes over Antarctica and Greenland. When the same satellite pair is chasing each other over the globe overhead is assumed to be properly affected by different masses below the thousands of meters of not-constant-density ice, and varying rock weights under that, across the two unequal heights and depths of the never-measured invisible rock base miles beneath the ice!
But what the GRACE satellites presume is that they can determine the CHANGES in ice mass (changes in ice depth) from year to year by assuming the rock depths below are moving the vertically way the scientists assume the rock moves.
Kip, Dr. Lubchenco, former NOAA administrator, once assured me that the TOPEX satellite provided “instantaneous” data concerning sea level changes.
Nevertheless, according to Impact of Altimeter Data Processing on Sea Level Studies, the satellite has suffered instrumentation drift that requires modification using surface tidal stations as reference. Also, in the time it takes to scan an area of the ocean, waters have moved from place to place, leaving the measurements to the same tidal variations as the terrestrial measurements.
“The high temperature today was 104. That is 23 degrees warmer than the normal of 81.” You can hear that on TV any day. First of all there is no such thing as a Normal temperature. Who can say what is normal? Nobody. The number they are calling normal is in fact an average of the high temperature on that date for the 30 year period ending at the beginning of the current decade as calculated and published by The National Weather Service. It is a short term average which is somewhat meaningless from a climatic point of view. Such is life.
While we’re on this subject, can anyone identify what useful information is conveyed by averaging the results of different computer models? Let’s say I know almost nothing about hockey (a completely true assumption) but that I’m given $100 to bet on an upcoming hockey game. Knowing nothing about hockey, I look online and see that “X” sports analyst is predicting team 1 to win by 2 goals, “Y” sports analyst is predicting team 1 to win by 1 goal, and “Z” sports analyst is predicting Team 2 to win by 1 goal.
The average of the three predictions is team 1 by 2/3 of a goal, and I see that the spread is 1 goal, so I bet on team 2 to beat the spread. But, as far as I can deduce, the information conveyed by the average relates to the predictions of the game – not the actual results of the game, which is a unique event not amenable to a mathematically probabalistic analysis. What I’m really measuring is the expected value of a next sequential prediction by another analyst. I’m not measuring anything related to the actual event, i.e. the singular game to be played in the future.
That I resort to taking such an average highlights, not my expertise in hockey (never that) or even my mathematical aptitude since I’ve come up with completely useless information. What it illustrates is my ignorance; because I lack the ability to look at the teams and their past performance and make my own judgment, I have to resort to a simple average of predictions of people I presume to be experts.
So given all that, what am I to conclude about the “expertise” of the IPCC when, faced with widely differing scenarios presented by different models developed by different groups of people with different assumptions about the inner workings of the climate, the IPCC can’t just pick the best and instead just averages the results?
Kurt ==> Averaging the chaotic outputs of climate models was discussed in my essay Chaos and Models.
I had read that a while back, and I think we’re talking about different things. If I have a single model of a nonlinear dynamic system, and its output produces chaotic results that vary significantly from one set of initial conditions to another, it might be argued that the average results of the runs of that particular model represent some likely outcome on the assumption that all possible initial conditions are equally probable (although I’m instinctively skeptical of even that).
But once you start dealing with different models, with different assumptions used in each model (or tuned to different observations) then I can’t see any analytical reason to average the results of those disparate models together, beyond “Hey, look at the pretty graph that we’ve simplified for all you silly rubes.”
Kurt ==> “…although I’m instinctively skeptical of even that” in that, you would be very justified. Averaging chaotic results is a fool’s errand — it does not produce anything approaching a valid prediction. Any “ensemble” of model runs only reveals the boundary conditions (assuming enough runs) for that model but not for the physical system being modeled
Given that, averaging the results of different model ensembles no more produces valid predictions than averaging the results from 100 gypsy fortune tellers. .
Kurt
I’m smart enough to think I understand your analysis, but not smart enough to know if it is correct. However, it does seem to me that it would apply equally well to the stock market, to most economic analyses, and to a wide variety of other things. Are you trying to destroy civilization as we know it?
I’m not that grandiose. I have a hard enough time trying to destroy the weeds that keep infesting my lawn, so I just try to smite the little stuff.
Not sure what you mean by the “stock market” or “economic analyses.” The DJIA is an average share price of a specified basket of stocks, presumably indicative of the market as a whole, so the question is whether the selection of what’s in the basket is a sufficiently representative sample of everything, like polling people and trying to get a demographically representative sample.
My issues relate to the narrow question of what you are sampling. If you assume that the DJIA is indeed representative of the entire universe of stocks in the NYSE, then you should be able to take another representative sample of different stocks and come up with the same average number. If you assume that your polling methodology is accurate, a new poll should produce similar results even if the people change. The average relates to the common feature of your samples
If I sample results from a single model, I can at least get my mind around the idea that the results show the expected behavior of the theoretical climate system that the model represents, which can be compared to reality as a kind of benchmark for whether my model accurately simulates the climate’s behavior. But if I start averaging together the output of different models, each model representing different theories of how the climate system works, that average doesn’t represent any useful metric. I’m sampling from among different, mutually exclusive theories about how something might work.
Let’s assume for example that the “average” behavior of the models happens to line up very well with the way the climate actually behaves in the years subsequent to the model runs, but none of the individual model averages do. If individually, none of the models got it right, then I can’t have confidence in the set of assumptions made by any individual one of the models. If I can’t have confidence in the ability of any individual model to accurately simulate the climate, what is the point of sampling them to begin with?
The polling example may provide the best analogy. Real Clear Politics averages multiple polls, each with different sampling and weighting methodologies, but the polls often have widely disparate results that can’t all be true. Averaging them together shouldn’t give you any better information about the opinions of the electorate at any given point in time. Possibly, if all of them are moving in the same direction over time, you can infer that, regardless of which poll is most accurately samples the electorate as a whole, one candidate is gaining steam and the other is not. But the average tells you nothing, and if one poll contradicts the trend of all the others, how do you know that that one poll isn’t the one that is doing the correct sampling, without exercising independent expertise as to why it should be treated as an outlier?
Kurt — I was thinking of stock market analyses, not the DJIA/S&P/Nikkei per se.
But you are aware that the DJIA is a moving target? Companies are added and subtracted. I think that the only remaining member of the original 1896 DJIA is GE. Most (all?) of the rest are not only no longer in the index. They are mostly defunct.although some still exist after a fashion e.g. US Rubber ended up as a small part of Michelin.
“But you are aware that the DJIA is a moving target? Companies are added and subtracted.”
I’m not sure that’s necessarily a problem. Two different opinion polls taken in consecutive weeks will sample different people, but that doesn’t mean that each poll is not representative of the entire population at the time it was taken,
Kurt,
It seems to me that with an ensemble, one can logically only have one ‘best’ result (barring duplicates). Thus, averaging the best result with the inferior results will give one something in between the best and worst. Is that useful? Probably not as useful as the best result. By not determining which are the good and poor results, no insight is gained on what contributes to the quality or utility of the different models.
But that’s true with calling a coin toss heads or tails, as well. Only one result is actually going to happen, but that doesn’t mean that the “average” of the group of all possible outcomes isn’t useful.
For simplicity, assume that health insurance only insures against lung cancer. Insurer’s know that a smoker will either get lung cancer , or not. They assess the probability of a particular insured getting cancer in the policy term (usually pretty low, even if you are a smoker) and multiply it by the expected cost if cancer is contracted, to arrive at their “expected” cost. They add a “premium” on top of that for their profit, and bill the insured. The “expected cost” in this example will never happen. If a person gets cancer, the costs to the insurer will vastly exceed the “expected cost.” If the person does not get cancer, the costs are zero.
If the insurer makes this calculation for a very large number of people, accurately assessing the risks of each person and charging them the appropriate amount for their policy, then the insurer can probably pre-calculate how much profit it will make for the whole group, but for each insured, the proper, “expected” result does not correspond with the actual outcome.
Now Kip, above, says that averaging chaotic results retains no useful information, and my gut instinct of agrees with that, though I think it’s more a question of how you assign a respective probability to each of the sets of initial conditions. I think the modelers treat them all as equally probable, but that has to be an assumption made of convenience (i,e. it’s all they can do) rather than being a reasoned assumption.
Kurt ==> Must remember not to confuse random results with truly chaotic results (chaotic in the sense of Chaos Theory).
A fair coin toss has a nearly perfect 50-50 probability ration – not because of averaging, but because of probability. The results of any actual series of coin tosses has a semi-chaotic result, as the imperceptible initial condition effects, air currents, landing surface imperfections, etc may skew results chaotically — this effect will be very small but there none the less.
Not that it matters, but “An interest of mine, as my father and his 10 brothers and sisters where born on the Pine Ridge in southwestern South Dakota, the red oval.” “where” surely should be “were”.
Don ==> Absolutely correct — I’ll fix it — danged auto-correct spellchecker! (and lazy-eyed copy editor).
Note on the Normal Curve/Gaussian distribution. Statisticians love the Normal Distribution because of its nifty mathematical properties. But AFAICS hardly anything on the real world other than (usually) the Standard Error of the Mean actually distributes normally. If I Recall Correctly, the poster child for the normal distribution — the Intelligence Quotient curve actually required a substantial adjustment at one point to make the numbers coming out of testing more closely match the theoretical distribution.
Here is a real world Gaussian example:
https://m.youtube.com/watch?v=1WAhTdWErrU
Hmmmm. You ever actually measured a batch of resistors? Me either. But I wouldn’t assume that the results will actually be Gaussian. May well depend on the manufacturing process — e.g. tested and trimmed before encapsulation vs build then test and sort. See discussion at https://electronics.stackexchange.com/questions/157788/resistors-binning-and-weird-distributions.
Here so this June we have had 12 days colder than normal and 7 days warmer with the months avg temp so far 0.2c warmer than normal.
Two general comments:
1. I think it’s OK and often useful to add apples to oranges. You just need to remember that the units of the result will be “fruit”. Not hard in that case. But sometimes the distinctions are a lot more subtle.
2. Not all useful numbers have a sound physical basis. Example — Sunspot numbers. If I understand them correctly, they are a rather complicated index, not a count. That’s why we never see sunspot numbers less than 10. Nonetheless, they do seem to correlate to solar activity and are said to be useful. What about statistical operations on sunspot numbers. Are those operations well behaved? Meaningful?
Sunspot numbers are an asymptotic proxy for solar activity, i.e., the sunspot number has a lower limit of zero while other measures solar activity are decidedly not zero. Using stars other than our own, Astrostatistical Analysis in Solar and Stellar Physics concludes, in part, “We find that incorporating multiple proxies reveals important features of the solar cycle that are missed when the model is fit using only the sunspot numbers.”
It all gives me brain-ache yet still its a ‘brain-worm’ I cannot shake off..
There’s something pretty epic in it though – an ‘instinct’ or ‘intuition’ tells me so and its how people interact with these ‘averages’.
What’s got me going is the flood discussion.
Lets say A River, somewhere anywhere has an average flow-rate or (more easily measured by people) an average depth .
That average could be decades or centuries long.
Lets look at the River Mild and its been 5 feet deep under the bridge at Boringville for the last 250 years. Fine.
But because we’re doing averages, that means and by definition, unusual things happen.
So, after some cold wet weather on River Mild’s catchment, the sun comes out, temperatures soar and thunderstorms break out. Perfectly feasible as the T storms are fed by all the recent wet weather.
And suddenly, on a Tuesday and 3 hours after midnight, the River Mild rises by 20 feet and its flow rate goes up 30 fold. Because of the T storms.
Because lots of folks thought it was really a River Mild, they built houses next to it. They used ‘The Average’
But for 6 or 7 hours it turned it into River Wild.
It cut a vast swathe of muddy devastation through Boringville, destroyed homes, gardens, fields and a great deal of the people themselves.
Then for the next 250+ years it returned to being River Mild.
That single one-off event damaged a great number of people but when Ivory Tower Dwellers, at public expense come to work out The Average, that (flash) flood completely vanishes.
So the people rebuild their houses and the High Street at Boringville and then what happens?
10 years late another freak flood arrives. Similar thing to playing The National Lottery.
(Not unlike Carlisle, on the River Eden. In Cumbria.) (The flood, not the lottery)
So where are your averages then. What use did they do you in either Carlisle or Boringville. Was there *really* any point to working them out. Are they not just even more faerie counting?
Did they lull a false sense of security or, as they’re equally validly for averages, a false sense of doom and disaster? As per climate science right now.
How do you/me/anyone connect those numbers with The Real World and Real People.
Maybe we could start by opening the doors and windows of a few Ivory Towers and kick the residents out.
I think I’ve worked it out here and now. There are too many Moshers and the like.
Over to you Donald……..
Nassim Nicholas Taleb writes entertaining and possibly insightful books about extreme events and our inability to think clearly about them — The Black Swan, Fooled by Randomness. Possibly worth reading if you haven’t encountered them.
Peta,
Your story illustrates why skewness and variance need to be provided along with a mean. Also, providing mode and median is informative. Focusing on mean alone is either being deceptive purposely, or illustrating statistical ignorance.
Kip,
I really don’t see the point of this loquacious dumbing down of already elementary statistics. Especially if inaccurate – the mean is an average, but the mode and median are not. They are all measures of central tendency.
I really can’t see the point of your second example. Of course the average, or any other summary statistic, like median, can’t tell you all about the dataset. That is why the various authorities present all that other information that you quote – all subset averages. I think the Dow is a useful average to think about. It is very widely quoted and used. It won’t tell you whether copper stocks are booming, or taxis have collapsed. Everyone knows that it won’t, but they still find it useful.
Your first example does again illustrate the use of averaging as an estimate of population mean, though you don’t seem to see it that way. It probably isn’t an average being sought, but a bottom quintile, or some other summary statistic, but the principle is the same. What you want is an estimate of the average of the population of boys for whom the bar is intended. The data for Mrs Larsen’s class is the sample you have available, at least initially. And you do need that summary statistic. The boys may vary, but the bar can only have one height.
Sampling is a science, and you need to get it right. As you say, it might happen that her class is not representative. That is not a problem with average calculation; it is a problem of sampling. And there are remedies. Maybe Mrs Larsen’s class had an unusually large number of Hispanics, and maybe they tend to be shorter (I don’t know that that is true). So you re-weight according to what you know about the population. It is that knowledge that you need to design the bar, plus someone who actually knows about statistics.
Nick writes <blockquote<Of course the average, or any other summary statistic, like median, can’t tell you all about the dataset.
Or indeed what a changing average actually means.
“Or indeed what a changing average actually means.”
Yes, you have to figure it out. Like when the Dow drops 2000 pts. It’s not obvious why, but plenty of people are going to be curious. It matters to them.
Sorry we only model to the market level. We dont go down the the stock level and when we try, its not a great result. But I’m sure knowing what the market in general is doing will give you enough information to help you choose your portfolio.
Nick Stokes ==> I’m sorry, in my first part in post images of the common everyday definitions of averages, like this one:
You may argue with it all you like, and decide that “statistician’s special language” trump’s everyday English and that what you learned in Stats classes nullifies what everyone else learns in their K-12 mathematics class.
I am not a statistician and I do not write for statisticians but for the readers here.
I write essays here with the intention of helping readers gain a better understanding of the topics they see in their everyday lives and exposing some of the abuses of Science that appear in their news outlets.
Read the other reader’s comments here — try to understand that not everyone has a university degree that included deep statistical theory and practice — in fact, as you know all too well, very few people anywhere understand statistics beyond high school levels — including most scientists.
So — elementary as it may be, essays like this help people become smarter — to read smarter — to listen smarter.
I’m not trying to answer this question but just trying to clarify. My English is not so good so I beg your pardon in advance.
If we think at “Local Climate” as a class, could we say that “Global Climate” being a SuperClass of LC is a LC Class it self? Let define the Members of LC Class: “Temperature, Air pressure, eccetera”. Can we find such a members in the GC Class?
If not then GC is not a Climate Class but a name for the avarage Climate Class instances!
In other words. An engine is a system witch part are well connected and exchanging energy but can we build an engine of engines?
Mariano ==> I like the engine analogy — not perfect, but I like it. All the parts of an Engine are Engine Parts — but the Engine itself, though made of the same materials, is not itself an Engine Part.
Kip ==> thank for your comment. I don’t say that Climate is like an engine (I’M not qualify to said that) what I’m saying is that we must know if Climate can be seen like a motor or like a society. A society is not only a sum of their individuals but has its own law that modify the society it self.
I can’t see such a difference in the Climate debate. As I said, I can’t afford such a complex field with my bad English, but I would like to read someone saying something about close or open systems, for example.
Kip,
Analogies don’t have to be perfect to be useful. Indeed, if they were “perfect” they would be equivalent to the original statement and would lose the utility of viewing the problem from a slightly different viewpoint. The optimal analogy is similar enough to the original statement that no one can claim that it is unrelated or a non sequitur, but different enough that it can break down prejudices or biases that are interfering with someone seeing the essence of the problem.
Hmm. Nice one
In fact this misuse of statistics is in fact part of a wider almost philosophical problem that lies behind nearly all of the problems the not so modern mind seems to have in dealing with the complexity of life as it really is, rather than in the idealised and simplified pictures that are all we seem capable of.
Science itself is just such and idealised and simplified picture. And there will always be a compromise between ‘idealised and simplified to the point of error’ and ‘so complex we can’t compute it anyway’.
Unfortunately my message, that in climate science these two areas overlap massively, is unwelcome by alarmists and skeptics alike.
Everyone wants to seek out the One True Cause of climate change.
The message that they never will, is not desired.
Kip
An excellent post and a good reminder of the limitations of using averages for analysis.
I do have a quibble but it may add to your points. You have used Household and Family incomes interchangeably. The Census Bureau has a very precise definition of both terms. There are about 125 million Households and about 80 million Families in their reports. When using either term the results will give you different data, much like mean versus median and nominal versus real. Even the word “income” can mean different things, market versus aggregate.
To reinforce your point, when looking at the data for income over a 50 or 60 year period, there are changes in demographics etc that can alter the meaning of any comparisons over decades. For instance, the growth in real income of a family with two earners has been greater than those of single households. Why? Because there is a positive correlation between marriage rates and education and thus income. Also, there are many more Households with a single individual today than 50 years ago. The proportion of Single Mom families has grown substantially and the disparity in income between that unit and the two income earner families has grown with it. The real story sometimes is down in the weeds and each piece has to be dissected to understand what is really going on.
cerescokid ==> I raised this issue some time ago in my What Are They Really Counting?
Especially when they are offering averages of huge amounts of data (public surveys are the same problem — what exactly did they ask?).
All-in-all, the Median Income example serves to show the effect of averaging disparate information often hides what we really need to know.
In our “sound bite era” one has to dig dig dig to get at the real meat of a story.
If you really want to have fun, try having a discussion with an ‘average’ person about percentiles. I once tried (and failed) to convince a young Physicians Assistant that I could not be in the 100th percentile.
Really, isn’t the 100th percentile the value equal to the maximum value in a distribution?
If we were talking about intelligence, he would have to be smarter than himself to be in the 100th percentile.
Good observations, but I wish you could have been as detailed in illustrating the problem with “climate” data as you were able to illustrate with income data.
One simple example I sometimes use to illustrate the uselessness of “average temperature” is to talk about weather patterns in my region (central Arkansas). Weather sources will routinely report that the “average” temperature for a particular day is such and such a number. But in reality I would expect it to be bi-modal, especially in the warmer months. We are in a region where the weather is determined by fronts “passing through” the state. On either side of the front the temperature will vary significantly. So one day the high is, say 90, and the next day it is 80. “Statistics” will say that the “average” temperature “for this time of year” is
75[85], when in fact it is almost always higher or lower. “Mode” is a better measure of central tendency, and would show the pattern to be bi-modal.Someone upthread gave the example of Oregon’s “average rainfall.” Again, meaningless, since the regional climate is so different on either side of the Cascades. Climate is at best a regional concept, and even then simple averages can be misleading because even within a well defined climate region averages will vary over time because weather patterns (and long term changes in weather patterns, aka “climate”) are dynamic.
I’m no expert on climate “models” but do they even try to reflect the dynamics of weather/climate change or are they essentially static “models” for discrete periods of time?
blcjr ==> If I use too many climate examples (even any, really) the result is often just a bunch of knee-jerk reactions from the Climate Warriors all spouting Mandatory Climate [Consensus or Skeptic] Talking Points.
Many people get distracted from the main point — in this case how averages can hide and obscure information — and focus only on defending their favored position on climate science.
We have had the recent example of a drought in California coupled with floods in the south of England. Average global rainfall may not have changed a whit, but the climate effects on human comfort were rather severe.
One of the more exotic types was the geometric mean. Here’s an example I picked up reading articles by Isaac Asimov in “Fantasy and Science Fiction” magazine. He was comparing the size of humans to blue whales and mice. Are we bigger or smaller than the average mammal?
Here are their average masses:
10^5 kg blue whale
70 kg human
3*10^-2 kg mouse
An arithmetic mean would give just about 1/3 of 10^5 kg- most mammals are WAY below average in
mean mass. A more reasonable comparison is the geometric mean, which would be
(10^5 kg*70 kg* 3*10^-2 kg)^0.3333333 which gives 59.44 as the geometric mean. Humans are bigger compared to mice than blue whales are compared to humans.
Alan McIntire ==> As a teen, I had a collection of early SciFi mags, over 500 of them, all read and re-read.
When looking hard at renewable energy it occurred to me that in programming a computer, in calculating the cost and benefit of an aircraft design and indeed when looking at putting windmill on a grid, the income is derived from the average performance of the engineered solution, but the cost is dominated by worst case provisions.
Perhaps that is a subject worthy of an essay.
(anyway its impact on renewable energy is massive: the worst case of renewable energy is its generating nothing and the cost of covering for that exceeds all the value in the renewable solution).
Leo ==> I’d love to read it. For wind, it seems there are a lot of breakdowns that are very expensive to repair, so expensive that in many cases, the windmill is simply left out of service — not to mention the well-publicized catastrophic failures. We see almost nothing about solar failures, though they must happen — wiring close to the ground or on roof-tops, different repair scenarios. Does anyone know when one out of 20 solar panels fails on their roof? Lots of interesting questions.
IIRC (and it is a somewhat vague R!) domestic panels are wired in series so any one failure is seen as complete. Hence why it is so important that no panel is in shadow e.g. from a chimney or a TV antenna.
Whether larger installations are in series/parallel groups or actively connected so that they appear such I don’t know.
Paul Krugman walks into the bar, Cheers …
Per capita bar patron income goes up. Income inequality among bar patrons gets worse. Median net worth is higher. The total amount of taxes paid by bar patrons increases. The likelihood a randomly selected patron is also a contributor to the New York Times opinion columns rises astronomically. The chance that a randomly selected patron voted Republican in the last presidential election decreases, slightly. The average “carbon footprint” among bar patrons rises measurably. The amount of hot air expended in conversations among Cliff, Norm, Fraiser, and now Paul contributes to global warming by a comical amount …
One of the things that I think is important in any discussion of averages is what is called the “Flaw of Averages.” Boiled down simply, it is that while averages may be used to describe the population the data describes, all too often no single individual data-point will ever fit that average profile. Particularly if more than a single descriptor is used.
See this article: https://www.thestar.com/news/insight/2016/01/16/when-us-air-force-discovered-the-flaw-of-averages.html
So, the question is simply this: the average can be determined, but does it really say anything at all if it is used in an effort to apply that average to what is actually occurring?
Chaotic dynamical systems tend to produce distributions that are the “opposite” of normal distributions, that is, they are heavily weighted at the tails. Averages are least applicable to such systems.
The other day I was listening to some blather about the Big Bang being preceded by (even though there is no time yet ???) a singularity that was, among other properties, infinitely hot. But temperature is a measure of average kinetic energy. How can a singularity have an average in the form of temperature? Worse, kinetic energy implies velocity, and velocity implies space over time. I thought those had not “unfurled” yet.
Allow me to inject a curious, yet distantly related fact, into a most enlightening discussion:
Jan Kareš from the Czech Republic did 232 pull-ups on the 19th of June in 2010, establishing what is believed to be the world record for this exercise. My max pull ups ever was 15, some multiple decades ago.
Has anyone determined whether Thursdays in Maharishi Vedic City, Iowa have gotten warmer over the past ten or so years? Then we have to ask, “Warmer how?” … right? — “warmer” at what particular time of day?, … are we talking rainy days?, … cloudy days?, where exactly — under a tree – WHICH tree?, ten feet above the ground?, ten and a HALF feet?, … who measured it? … using what sort of instrument?, … was this person patient enough to use the measuring instrument competently? … was he she sober? … etc.
… seems to be somewhat elusive.
Good article, as far as it goes. However, there are two economic factors that also need to be considered
1. Number of people per household has been shrinking. So, income per person is growing faster than income per household.
2. The quintiles are not made up of the same households, from year to year. Families frequently move to different quintiles over time. I personally have been in all five income quintiles at various points in time.
David ==> You realize the the essay is not about incomes?
Nonetheless, that Median FAMILY income at the lowest levels does not improve over such a long time period, in real dollars, means the poor are getting poorer, even if many, like us, escape to higher economic status.
Also an illustration of the nirvana fallacy, aka “making the perfect the enemy of the good”.
3 inches is about 5% of the height of the boys. Maybe a better analogy would have been taking a trend of 1/10th of an inch per year and using it to justify a model that says it should be built 10 inches higher for future generations (even though the change in the past 20 years has been less than predicted).