Guest essay by Clyde Spencer
The point of this article is that one should not ascribe more accuracy and precision to available global temperature data than is warranted, after examination of the limitations of the data set(s). One regularly sees news stories claiming that the recent year/month was the (first, or second, etc.) warmest in recorded history. This claim is reinforced with a stated temperature difference or anomaly that is some hundredths of a degree warmer than some reference, such as the previous year(s). I’d like to draw the reader’s attention to the following quote from Taylor (1982):
“The most important point about our two experts’ measurements is this: like most scientific measurements, they would both have been useless, if they had not included reliable statements of their uncertainties.”
Before going any further, it is important that the reader understand the difference between accuracy and precision. Accuracy is how close a measurement (or series of repeated measurements) is to the actual value, and precision is the resolution with which the measurement can be stated. Another way of looking at it is provided by the following graphic:
The illustration implies that repeatability, or decreased variance, is a part of precision. It is, but more importantly, it is the ability to record, with greater certainty, where a measurement is located on the continuum of a measurement scale. Low accuracy is commonly the result of systematic errors; however, very low precision, which can result from random errors or inappropriate instrumentation, can contribute to individual measurements having low accuracy.
For the sake of the following discussion, I’ll ignore issues with weather station siting problems potentially corrupting representative temperatures and introducing bias. However, see this link for a review of problems. Similarly, I’ll ignore the issue of sampling protocol, which has been a major criticism of historical ocean pH measurements, but is no less of a problem for temperature measurements. Fundamentally, temperatures are spatially-biased to over-represent industrialized, urban areas in the mid-latitudes, yet claims are made for the entire globe.
There are two major issues with regard to the trustworthiness of current and historical temperature data. One is the accuracy of recorded temperatures over the useable temperature range, as described in Table 4.1 at the following link:
Section 4.1.3 at the above link states:
“4.1.3 General Instruments. The WMO suggests ordinary thermometers be able to measure with high certainty in the range of -20°F to 115°F, with maximum error less than 0.4°F…”
In general, modern temperature-measuring devices are required to be able to provide a temperature accurate to about ±1.0° F (0.56° C) at its reference temperature, and not be in error by more than ±2.0° F (1.1° C) over their operational range. Table 4.2 requires that the resolution (precision) be 0.1° F (0.06° C) with an accuracy of 0.4° F (0.2° C).
The US has one of the best weather monitoring programs in the world. However, the accuracy and precision should be viewed in the context of how global averages and historical temperatures are calculated from records, particularly those with less accuracy and precision. It is extremely difficult to assess the accuracy of historical temperature records; the original instruments are rarely available to check for calibration.
The second issue is the precision with which temperatures are recorded, and the resulting number of significant figures retained when calculations are performed, such as when deriving averages and anomalies. This is the most important part of this critique.
If a temperature is recorded to the nearest tenth (0.1) of a degree, the convention is that it has been rounded or estimated. That is, a temperature reported as 98.6° F could have been as low as 98.55 or as high as 98.64° F.
The general rule of thumb for addition/subtraction is that no more significant figures to the right of the decimal point should be retained in the sum, than the number of significant figures in the least precise measurement. When multiplying/dividing numbers, the conservative rule of thumb is that, at most, no more than one additional significant figure should be retained in the product than that which the multiplicand with the least significant figures contains. Although, the rule usually followed is to retain only as many significant figures as that which the least precise multiplicand had. [For an expanded explanation of the rules of significant figures and mathematical operations with them, go to this Purdue site.]
Unlike a case with exact integers, a reduction in the number of significant figures in even one of the measurements in a series increases uncertainty in an average. Intuitively, one should anticipate that degrading the precision of one or more measurements in a set should degrade the precision of the result of mathematical operations. As an example, assume that one wants the arithmetic mean of the numbers 50., 40.0, and 30.0, where the trailing zeros are the last significant figure. The sum of the three numbers is 120., with three significant figures. Dividing by the integer 3 (exact) yields 40.0, with an uncertainty in the next position of ±0.05 implied.
Now, what if we take into account the implicit uncertainty of all the measurements? For example, consider that, in the previously examined set, all the measurements have an implied uncertainty. The sum of 50. ±0.5 + 40.0 ±0.05 + 30.0 ±0.05 becomes 120. ±0.6. While not highly probable, it is possible that all of the errors could have the same sign. That means, the average could be as small as 39.80 (119.4/3), or as large as 40.20 (120.6/3). That is, 40.00 ±0.20; this number should be rounded down to 40.0 ±0.2. Comparing these results, with what was obtained previously, it can be seen that there is an increase in the uncertainty. The potential difference between the bounds of the mean value may increase as more data are averaged.
It is generally well known, especially amongst surveyors, that the precision of multiple, averaged measurements varies inversely with the square-root of the number of readings that are taken. Averaging tends to remove the random error in rounding when measuring a fixed value. However, the caveats here are that the measurements have to be taken with the same instrument, on the same fixed parameter, such as an angle turned with a transit. Furthermore, Smirnoff (1961) cautions, ”… at a low order of precision no increase in accuracy will result from repeated measurements.” He expands on this with the remark, “…the prerequisite condition for improving the accuracy is that measurements must be of such an order of precision that there will be some variations in recorded values.” The implication here is that there is a limit to how much the precision can be increased. Thus, while the definition of the Standard Error of the Mean is the Standard Deviation of samples divided by the square-root of the number of samples, the process cannot be repeated indefinitely to obtain any precision desired!1
While multiple observers may eliminate systematic error resulting from observer bias, the other requirements are less forgiving. Different instruments will have different accuracies and may introduce greater imprecision in averaged values.
Similarly, measuring different angles tells one nothing about the accuracy or precision of a particular angle of interest. Thus, measuring multiple temperatures, over a series of hours or days, tells one nothing about the uncertainty in temperature, at a given location, at a particular time, and can do nothing to eliminate rounding errors. A physical object has intrinsic properties such as density or specific heat. However, temperatures are ephemeral and one cannot return and measure the temperature again at some later time. Fundamentally, one only has one chance to determine the precise temperature at a site, at a particular time.
The NOAA Automated Surface Observing System (ASOS) has an unconventional way of handling ambient temperature data. The User’s Guide says the following in section 3.1.2:
“Once each minute the ACU calculates the 5-minute average ambient temperature and dew point temperature from the 1-minute average observations… These 5-minute averages are rounded to the nearest degree Fahrenheit, converted to the nearest 0.1 degree Celsius, and reported once each minute as the 5-minute average ambient and dew point temperatures…”
This automated procedure is performed with temperature sensors specified to have an RMS error of 0.9° F (0.5° C), a maximum error of ±1.8° F (±1.0° C), and a resolution of 0.1° F (0.06° C) in the most likely temperature ranges encountered in the continental USA. [See Table 1 in the User’s Guide.] One (1. ±0.5) degree Fahrenheit is equivalent to 0.6 ±0.3 degrees Celsius. Reporting the rounded Celsius temperature, as specified above in the quote, implies a precision of 0.1° C when only 0.6 ±0.3° C is justified, thus implying a precision 3 to 9-times greater than what it is. In any event, even using modern temperature data that are commonly available, reporting temperature anomalies with two or more significant figures to the right of the decimal point is not warranted!
Where these issues become particularly important is when temperature data from different sources, which use different instrumentation with varying accuracy and precision, are used to consolidate or aggregate all available global temperatures. Also, it becomes an issue in comparing historical data with modern data, and particularly in computing anomalies. A significant problem with historical data is that, typically, temperatures were only measured to the nearest degree (As with modern ASOS temperatures!). Hence, the historical data have low precision (and unknown accuracy), and the rule given above for subtraction comes into play when calculating what are called temperature anomalies. That is, data are averaged to determine a so-called temperature baseline, typically for a 30-year period. That baseline is subtracted from modern data to define an anomaly. A way around the subtraction issue is to calculate the best historical average available, and then define it as having as many significant figures as modern data. Then, there is no requirement to truncate or round modern data. One can then legitimately say what the modern anomalies are with respect to the defined baseline, although it will not be obvious if the difference is statistically significant. Unfortunately, one is just deluding themselves if they think that they can say anything about how modern temperature readings compare to historical temperatures when the variations are to the right of the decimal point!
Indicative of the problem is that data published by NASA show the same implied precision (±0.005° C) for the late-1800s as for modern anomaly data. The character of the data table, with entries of 1 to 3 digits with no decimal points, suggests that attention to significant figures received little consideration. Even more egregious is the representation of precision of ±0.0005° C for anomalies in a Wikipedia article wherein NASA is attributed as the source.
Ideally, one should have a continuous record of temperatures throughout a 24-hour period and integrate the area under the temperature/time graph to obtain a true, average daily temperature. However, one rarely has that kind of temperature record, especially for older data. Thus, we have to do the best we can with the data that we have, which is often a diurnal range. Taking a daily high and low temperature, and averaging them separately, gives one insight on how station temperatures change over time. Evidence indicates that the high and low temperatures are not changing in parallel over the last 100 years; until recently, the low temperatures were increasing faster than the highs. That means, even for long-term, well-maintained weather stations, we don’t have a true average of temperatures over time. At best, we have an average of the daily high and low temperatures. Averaging them creates an artifact that loses information.
When one computes an average for purposes of scientific analysis, conventionally, it is presented with a standard deviation, a measure of variability of the individual samples of the average. I have not seen any published standard deviations associated with annual global-temperature averages. However, utilizing Tchebysheff’s Theorem and the Empirical Rule (Mendenhall, 1975), we can come up with a conservative estimate of the standard deviation for global averages. That is, the range in global temperatures should be approximately four times the standard deviation (Range ≈ ±4s). For Summer desert temperatures reaching about 130° F and Winter Antarctic temperatures reaching -120° F, that gives Earth an annual range in temperature of at least 250° F; thus, an estimated standard deviation of about 31° F! Because deserts and the polar regions are so poorly monitored, it is likely that the range (and thus the standard deviation) is larger than my assumptions. One should intuitively suspect that since few of the global measurements are close to the average, the standard deviation for the average is high! Yet, global annual anomalies are commonly reported with significant figures to the right of the decimal point. Averaging the annual high temperatures separately from the annual lows would considerably reduce the estimated standard deviation, but it still would not justify the precision that is reported commonly. This estimated standard deviation is probably telling us more about the frequency distribution of temperatures than the precision with which the mean is known. It says that probably a little more than 2/3rds of the recorded surface temperatures are between -26. and +36.° F. Because the median of this range is 5.0° F, and the generally accepted mean global temperature is about 59° F, it suggests that there is a long tail on the distribution, biasing the estimate of the median to a lower temperature.
In summary, there are numerous data handling practices, which climatologists generally ignore, that seriously compromise the veracity of the claims of record average-temperatures, and are reflective of poor science. The statistical significance of temperature differences with 3 or even 2 significant figures to the right of the decimal point is highly questionable. One is not justified in using the approach of calculating the Standard Error of the Mean to improve precision, by removing random errors, because there is no fixed, single value that random errors cluster about. The global average is a hypothetical construct that doesn’t exist in Nature. Instead, temperatures are changing, creating variable, systematic-like errors. Real scientists are concerned about the magnitude and origin of the inevitable errors in their measurements.
Mendenhall, William, (1975), Introduction to probability and statistics, 4th ed.; Duxbury Press, North Scituate, MA, p. 41
Smirnoff, Michael V., (1961), Measurements for engineering and other surveys; Prentice Hall, Englewood Cliffs, NJ, p.181
Taylor, John R., (1982), An introduction to error analysis – the study of uncertainties in physical measurements; University Science Books, Mill Valley, CA, p.6
1Note: One cannot take a single measurement, add it to itself a hundred times, and then divide by 100 to claim an order of magnitude increase in precision. Similarly, if one has redundant measurements that don’t provide additional information regarding accuracy or dispersion, because of poor precision, then one isn’t justified in averaging them and claiming more precision. Imagine that one is tasked with measuring an object whose true length is 1.0001 meters, and all that one has is a meter stick. No amount of measuring and re-measuring with the meter stick is going to resolve that 1/10th of a millimeter.