Guest post by Lance Wallace
Last week (Aug 30), Anthony Watts posted my analysis of the errors in estimating true mean temperatures due to the use of the (Tmin+Tmax)/2 approach widely used in thousands of temperature measuring stations worldwide: http://wattsupwiththat.com/2012/08/30/errors-in-estimating-temperatures-using-the-average-of-tmax-and-tmin-analysis-of-the-uscrn-temperature-stations/ . The errors were determined using the 125 stations in NOAA’s recently-established US Climate Reference Network (USCRN) of very high-quality temperature measuring stations. Some highlights of the findings were:
A majority of the sites had biases that were consistent throughout the years and across all seasons of the year.
The 10-90% range was about -0.5 C to + 0.5 C. (Negative values indicate underestimates of the true temperature due to using the Tminmax approach.)
Two parameters—latitude and relative humidity–were fairly powerful influences on the direction and magnitude of the bias, explaining about 30% of the observed variance in the monthly averages. Geographic influences were also strong, with coastal sites typically overestimating true temperature and continental sites underestimating it.
A better approach than the Tminmax method may be to use observations at fixed hours, which would eliminate the problem of the time of observation of the temperature extremes. One common algorithm is to use measurements at 6 AM, noon, 6 PM, and midnight. We will describe this method as 6121824. A second approach used in Germany for many years was to use measurements at 7AM, 2 PM, and 9 PM (71421) or in some cases to use double weights for the 9 PM measurement (7142121). (h/t to Michael Limburg for the information on the German algorithm.)
How do these methods compare to the Tminmax method? Do they lower the error? Would latitude and RH and geographic conditions continue to be predictors of their errors, or would other parameters be important? In this Part II of this study, we attempt to answer these questions, using again the USCRN as a high-quality test-bed.
In Part I, two datasets from the NOAA site ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/ were employed—the daily and monthly datasets, with about 360,000 station-days and 12,000 station-months, respectively. For our purposes here, we also need the hourly dataset, with about 8.2 million records. This was obtained (again with help from the NOAA database manager Scott Embler) on Sept. 4, 2012. These three datasets are all available from me at lwallace73@gmail.com.
The hourly dataset provides the maximum, minimum, and mean temperature for each hour. Also recorded are precipitation (mm), solar radiation flux (W/m2), and RH (%). Since the RH measurements were added several years after the start of the network, only about a third of the hours (2.8 million), days (120,000) and months (3600) have RH values.
A first look confirms that 3 or 4 measurements per day are better than two (Figure 1). The entire range of the 6121824 method almost fits into the interquartile range of the Tminmax method (-0.2 to +0.2C).
Figure 1. Errors in using four algorithms to estimate true mean temperature. Values are monthly averages across all months of service for 125 stations in the USCRN.
A measure of the monthly error is provided by the distribution of the absolute errors (Table 1). The Tminmax method is clearly inferior by this method, having about 3 times the absolute error of the 6121824 method. The two German methods are intermediate at close to 0.2 C.
Table 1. Distribution of absolute errors for 4 algorithms.
| Valid N | Mean Abserror | Std.Dev. | 25%ile | Median | 75%ile | Maximum | |
| ABSMINMAX | 11109 | 0.32 | 0.27 | 0.10 | 0.20 | 0.50 | 1.9 |
| ABS6121824 | 11333 | 0.11 | 0.10 | 0.04 | 0.08 | 0.15 | 1.3 |
| ABS71421 | 11333 | 0.19 | 0.17 | 0.07 | 0.15 | 0.26 | 1.3 |
| ABS7142121 | 11333 | 0.20 | 0.17 | 0.08 | 0.16 | 0.28 | 1.3 |
We can compare methods across years or across seasons for any given site. The error for a given method was often about the same across all four seasons, although the bias across methods could be quite large (Figure 2). Errors across years were even more stable, but again with large biases across the methods (Figures 3 & 4).
Figure 2. Errors (C) by season at Durham NC. DeltaT is the error from the Tminmax method.
Figure 3. Errors (C) by year at Gadsden AL.
Figure 4. Errors (C) at Newton GA.
In Part I, I provided a map of the error from the Tminmax method. That map (updated to include 4 new Alaskan stations and an additional month of August 2012) is reproduced here as Figure 5. The strong geographic effect is immediately apparent, with the overestimates (blue) located along the Pacific Coast and in the Deep South, while underestimates (red) are in the higher and drier western half of the continent as well as along the very northernmost tier of states from Maine to Washington.
Figure 5. DeltaT at 121 USCRN stations. Colors are quartiles. Red: -0.67 to -0.20 C. Gold: -0.20 to -0.02 C. Green: -0.02 to +0.21 C. Blue: +0.21 to +1.35 C.
The next three Figures (Figures 6-8) map the three algorithms discussed in this post: the 4-point 6121824 algorithm as in the ISH network and the 3-point algorithms used in Germany (71421 and 7142121). The 4-point algorithm (Figure 6) does not have the well-demarcated geographic clusters of the Tminmax method. There is a cluster of overestimates (blue) in the farmland of the Middle West from North Dakota to Texas. Just to the West of them, however, there are a set of strong underestimates (red) from Montana through Colorado to New Mexico.
Figure 6. DeltaT 6121824 at 125 USCRN stations. Colors are quartiles. Red: -0.24 to -0.07 C. Gold: -0.07 to -0.02 C. Green: -0.02 to +0.02 C. Blue: +0.02 to +0.25 C.
The 3-point scale 71421 (Figure 7) shows something of a latitude-longitude dependence, with the strongest overestimates (blue) mostly in the North and West. This algorithm is rather heavily biased toward positive errors, so that even the red dots include some overestimates along with strong underestimates.
Figure 7. DeltaT 71421 at 125 USCRN stations. Colors are quartiles. Red: -0.21 to +0.08 C. Gold: +0.08 to +0.13 C. Green: +0.13 to +0.20 C. Blue: +0.20 to +0.45 C.
The errors in method 7142121 with the doubled 9 PM measurement (Figure 8) have a cluster of strong underestimates (red) in the Deep South and the Atlantic Coast from Florida to the Carolinas. Here the green dots are the best estimates (between -0.04 and +0.03) but they are spread throughout most of the country with the exception of the Deep South.
Figure 8. DeltaT 7142121 at 125 USCRN stations. Colors are quartiles. Red: -0.41 to -0.17 C. Gold: -0.17 to -0.04 C. Green: -0.04 to +0.03 C. Blue: +0.03 to +0.43 C.
As in Part I, a multiple regression was performed to detect what measured parameters might have an effect on the error associated with a given method. There are 6 available parameters: latitude, longitude, elevation, precipitation, solar radiation, and RH. Since some of these may be collinear, it is important to determine whether they are sufficiently related to cause errors in the multiple regression. The best way to do this is probably the test devised in Belsley, Kuh, and Welsch (1980). Their test has been incorporated in the SAS PROC REG/COLLIN. Not knowing SAS, or having access to someone who does, I tried factor analysis, as implemented in Statistica v11 (Table 2). Two variables with heavy loadings on Factor 1 were solar radiation and RH (with opposite signs). Factor 2 was dominated by the latitude and longitude. Since the earlier regressions showed that RH was generally stronger than solar radiation, and latitude stronger than longitude, the two weaker variables were left out of some regressions to see if the sign and magnitude of the other parameters would change markedly. However, little change was noticed. Therefore the multiple regressions presented here include all 6 variables.
Table 2. Factor analysis of 6 explanatory variables.
| Factor 1 | Factor 2 | |
| LONGITUDE | 0.11 | 0.86 |
| LATITUDE | 0.29 | -0.78 |
| ELEVATION | -0.58 | -0.39 |
| PRECIP | 0.50 | 0.10 |
| SOLRAD | -0.73 | 0.30 |
| RHMEAN | 0.86 | 0.09 |
Following are the multiple regressions on the errors due to the four different methods (Tables 3-6). Table 3 is a slightly modified (addition of stations in Alaska and Hawaii plus one additional month) version of the corresponding table for the Tminmax errors in Part I. As in Part I, the updated regression shows about equal effects of latitude and RH, accounting for nearly all of the 29% R2 value. The maps in Part I and Figure 5 above showed the powerful effect of the coastal stations (overestimates) and the Western Continental stations (underestimates).
The six measured parameters had far less effect on the method using four equally-spaced hourly measurements (Table 4). In this case, solar radiation had the strongest effect, with an increase in sunlight leading to larger underestimates. However, the R2 was very small, at about 6%.
The strongest effect on the 71421 method was latitude, and it was in the opposite direction of the effect as noted for the Tminmax method (Table 5). Overall, however, the R2 was similarly low, at about 7%.
The method that double-counted the 9 PM measurement was similar in one respect to the Tminmax results, with the two main parameters being RH and latitude, both close to equal in explanatory power (t values of +18 and -18.6) (Table 6). However, the signs of each were in the opposite direction from the Tminmax results. The R2 value of 17% was quite a bit higher than for the other two methods using specified hours, but less than for Tminmax.
Discussion
A clear finding from this analysis is that the multipoint methods are better than the Tminmax method at estimating the true temperature. In fact, a nice result is that the 2-point method (Minmax) had an absolute average error of about 0.3 C, the 3-point method error was around 0.2 C, and the 4-point method brought the mean absolute error down to 0.1 C. However, this is averaged across all 125 sites and 11,000 months, so errors can be quite a bit larger for individual sites as shown in some of the figures above.
Although one could guess, based on the multiple regression results, that higher-latitude sites using the Tminmax method would be more likely to be underestimating the true temperature, and coastal sites to be overestimating, still the R2 was small enough (29%) that only a ground-truth investigation could be relied on to determine the precise sign and magnitude of the error. It might also be argued that even determining the size of the error at the present time would not tell us what the error was historically. However, the great stability across the years shown by these sites suggests that in fact a proper measurement today could predict past performance for many stations that had stable locations and measurement methods.
With respect to the 4-point method, a second network, the Integrated Surface Hourly (ISH) network uses this approach: ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.TXT. This network apparently has some thousands of stations, although I am not sure how many are of the same high quality as the USCRN stations. Based on these findings, one could expect that the errors at this network are considerably smaller than the errors at stations using the Tminmax method. However, the multiple regressions here give little indication of what direction and magnitude the error might have at any individual station. Therefore, at this network as well as at other stations, a proper series of measurements over several years would be needed to give an idea of the magnitude and direction of the error at a given station. However, if the basic finding here that such errors are highly repeatable over the years applies to many or most stations, then such an approach could go far to indicating the actual temperature field of the world even at much earlier times when only a limited set of measurements (subject to errors of the magnitude and direction found here) were available.
Conclusions
None of the temperature measurement algorithms were without error. The traditional Tminmax method was the worst, with a mean absolute error of about 0.3C. The 3-point German method (71421 and 7142121) had a mean absolute error of about 0.2C, and the 4-point (6121824) method a mean absolute error of about 0.1C. The Tminmax method is strongly affected by latitude and RH, whereas the other methods are less affected by these variables.
All methods were very stable from year to year for most sites. There was somewhat more variation by season, but a majority of methods had the same sign (i.e. consistently over- or under-estimated the true mean temperature) for all four seasons and for all years.
For a given site, it was difficult to predict which of the three fixed-time methods might over- or under-estimate the true mean temperature. Even the Tminmax method performed better than all the others for some sites.
The use of the USCRN network to study these methods was advantageous in offering one of the highest-quality networks available. However, it is of course limited to the US, with a limited latitude and longitude range. Of interest would be to extend this analysis to a more globally representative group of stations. For example, might it be true that stations at polar and tropical latitudes would confirm the latitude dependence found here, and perhaps even show higher underestimates? Would coastal sites around the world continue to over-estimate true mean temperatures? How would poor-quality sites, such as those affected by urban heat island (UHI) or other effects, depend on these parameters compared to high-quality sites? If large areas around the globe were found to be over- or under-estimating true mean temperatures due to the algorithm employed, how might it affect global climate models (GCMs), which may be tuned to slightly wrong historical temperature fields?
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
Table 1 too large, running into “facebook” section, can’t be read.
kadaka (KD Knoebel) says:
September 12, 2012 at 10:42 pm
“Table 1 too large, running into “facebook” section, can’t be read.
The obscured numbers are :
75th percentile: 0.50, 0.15. 0.26. 0.28.
maximum: 1.9, 1.3, 1.3, 1.3.
So all methods resulted in at least one station having an absolute mean error >1 C.
I am a bit confused. Surely with the advent of temp. recorders temperatures are now measured every second or so and a mean is calculated for the day, also indicating what was the max and what was the min?
Yes but historically observers would go out once a day and record just the Tmax and Tmin values in the past 24 hours. So most of our historical global record for the past 100 years is subject to the sort of errors discussed in the post. Even contemporary stations (NOAA-ISH) may use the four observations per day method. I suspect stations in less developed areas may not have access to electronic recorders. Someone more knowledgeable than I might comment on what fraction of stations globally still use a small number of observations to estimate mean daily temperature. And I seem to recall a reference to a meteorological organization (WMO?) continuing to endorse the Tmax Tmin approach in order to maintain historical continuity. Perhaps a reader can confirm or refute that impression.
What is the relationship between temperature and energy in a fluid? Does it have any real meaning?
I’m not sure why you think accurately computing the mean temperature for the day is important. This number really doesn’t tell us anything useful. Surely what matters is whether temperature has changed. Replicating the measurements taken in the past from today’s more complete data would seem a simpler approach.
I wonder what the influence of time zones is. In Part I (Fig. 9), you show the temperature as a function of (local) time. If you want to estimate the true mean (~ area under graph) for the sinusoidal shape, it makes sense that you have a number of “best” times that add up to a good estimate. However, if two closeby stations are on different sides of a time zone, they will have horizontally shifted graphs, and they will use different “best” times. I wonder what happens if you correct for longitude and then sample at 3 or 4 points. I think that strip of reds and blues close to each other (through the center of the USA) in Figs. 6-8 might suddenly turn out to be less different from each other.
Frank de Jong says:
September 13, 2012 at 12:14 am
“I wonder what the influence of time zones is.”
An important point that I had not considered. It should be possible to consider this as an additional parameter among those that might affect the error.
With correcting for longitude, I mean sampling each station at “true local time”, i.e. a continuous time related to its longitude. One station close to a time zone would then sample it’s “6 AM” point at, say, 05:35 AM, whilst the one on the other side would sample at 6:25 AM.
HenryP You’re a bit confused, for me is an understatement. If I understand this (and I probably don’t), Its not the thermometers that get it wrong, it’s the method of calculating the mean that is the problem and which mean temperature figure you are using for comparison. if i want to know how much the global temperature is actually increasing or decreasing and i use the 1961 – 1990 mean i will get one answer and if I use 1951 – 1980 I get a different answer. Australia has just had a bumper snow season, best in ten years and in Melbourne for the last 2 years it hasn’t stopped raining. Our dams were below 30% full and now there at almost 80% and still rising. If this is Global Warming bring it on.
Ian H says:
September 13, 2012 at 12:03 am
“I’m not sure why you think accurately computing the mean temperature for the day is important.”
I think it’s always important to get the best estimates you can of anything you are studying. In particular, we are trying to understand the climate, which is driven by parameters such as temperature, among others. An erroneous measurement of one parameter will have ramifications on our calculations. Global climate models are matched against historical measurements of temperature. If those are erroneous, the models will be wrongly tuned. Of course, it may be that these errors are small enough that it will not matter. But it might. Why not try to find the best estimates for everything going into your models?
Richard111 says:
September 12, 2012 at 11:33 pm
“What is the relationship between temperature and energy in a fluid? Does it have any real meaning?”
A deep question, much discussed over the years, in its relationship to climate. I think energy is the important parameter here. Almost the entire energy flux consists of radiation from the sun and radiation into space from the earth. I don’t think we can measure earth’s radiation very well, so don’t know if the flux is balanced or tipped one way or the other. From that point of view, the temperature is a secondary quantity. In fact a “global temperature” has about as much meaning as an average telephone number. A value of 15C, as an average between 30C at the tropics and 0C at the poles, would have very different energy implications from the same value of 15C, as an average between 15C at the equator and 15C at the poles. However, temperature does enter into physics and weather and climate calculations, as in the perfect gas law and of course the Stefan-Boltzmann Law, so from that respect it seems important to measure it as accurately as possible.
So the solution is to bring the whole world up to USCRN quality, sorted.
While looking into the Time of Observation (TOBs) issue, I found this apparently-forgotten NCDC directory:
ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/daily/
It has a curious README file about the USHCN (save it before it is disappeared):
ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/daily/README
A real head-scratcher is in “NCDC QA Checks and Adjustments”:
Am I reading that right? The min-max measurements are for the previous 24 hours. Let’s say an observer takes readings in the early morning. They know the maximum must have occurred during the previous day, before midnight.
But if the observer tries to record that maximum on the day it actually happened, the day of occurrence, as part of “quality control” that maximum is transferred to the day when the readings were taken in the morning, the day of observation, which is not the day when the maximum happened.
That sounds bad enough, as the adjustment creates error. The quality checker knows that measurement belongs to a certain day, but assigns it to the following day?
But how is this done in practice? “Nearby” stations, perhaps hundreds of miles away, have a maximum of about a certain number recorded for this day, this station has that measurement on the previous day, but since it’s impossible for “nearby” stations to have different maximums on different days, they all have to have matching maximums on matching days, it’s obvious the number is on the wrong day so it gets moved?
And there’s still the issue of being sure which actual day the maximum actually occurred on. If a researcher wants the maximum of a certain day, like one where there was a notable tornado, and that region has a lot of morning observers recording maximums on day of observation instead of day of occurrence, what reading will he find when checking the daily records?
Plus I’ve known there to be freakish weather, when a different front is moving in or something similar, when the minimum was during the day with the maximum occurring during the night. How do temperature records built from 24 hour min/max readings show that?
Am I understanding what it says about that quality check right?
The previous quality check also doesn’t sound that great.
Okay, who else besides me has seen weather swings where the maximum of one day can be less than the minimum of the previous day, usually in the spring and fall, or the minimum can exceed the previous day’s maximum? NCDC seems to believe this is impossible.
An article I wrote using Australian data shows how using tmin+tmax/2 compared to fixed time temperature measurements over-estimates the amount of warming over the last 60 years by 43%.
http://www.bishop-hill.net/blog/2011/11/4/australian-temperatures.html
One might cynically conclude that the reason the WMO, etc persist with the min/max method is it shows far more warming than other more accurate methods.
Excellent research for the next 50 years. Thereafter, modern measurement systems should take over.
@richard, “What is the relationship between temperature and energy in a fluid? Does it have any real meaning?” The relationship is in the specific heat capacity of the fluid, the intrinsic and most real of these phenomena. Temperature is a measure of thermal energy flow.
Perhaps if you look to the meaning of field in math and physics the meanings may become clear.
Why not let Nature do the averaging? Put your thermometer a few feet underground and you’ll get a nice smooth temperature that rises and falls with the seasons. This would only go wrong in areas with lots of geothermal activity or radioactive minerals, but those are easy to spot.
An unknown (to me) with the local time values in USCRN is daylight saving time, from the USCRN description of “local time” I would assume that it does represent true local time, with daylight saving adjustment for the summer. Thus would this affect any fixed time of day readings?
It should be visible from the hourly data, I just haven’t had time to see if the changeover days have 23 or 25 hourly readings rather than a fixed 24.
Another thought I’ve had in this area is to calculate a daily average based upon sunrise to sunrise.
As a database guy I always think in those terms, and it I did a few back-of-the-envelope calculations regarding temperature records. If one took a reading every fifteen minutes you’d have 96 records/day for a site. That would give you a nice sample of the day to produce a temperature curve — if the low of 50 only occurs for one sample and then the curve rapidly climbs and stays around for much of the day, was the average temperature really 61? Surely knowing the shape of the temperature curve would be useful information. Statistical analyses can’t possibly reproduce a measurement like that.
So, if one had 1500 weather stations producing 96 records/day you would get 144,000 records/day for the entire system. You don’t need but a station ID, timestamp and a temperature measurement in a record, so you’ve got an integer, timestamp, and float for a record (at its most basic). Even allowing for large storage types you’re only talking about 80 bytes/record, or about 11M/day of storage. A year comes to about 4G. A cheap hard drive these days has a terabyte on it, so one of those could store about 250 years worth of data, give or take.
Wouldn’t that be a nice dataset to look at?
kadaka wrote
“Checks were implemented to ensure that maximum temperatures were never less than minimum temperatures on the day of occurrence, the preceding day, and the following day.”
Okay, who else besides me has seen weather swings where the maximum of one day can be less than the minimum of the previous day, usually in the spring and fall, or the minimum can exceed the previous day’s maximum?
———————
Since I have all the USCRN data in a SQL database, I can write a query across that data set and see if that situation ever occurred in the last 8-10 years it covers (in the US).
I have already checked that the min is always <= max for a day for every day with valid data, but hadn't thought about needed check to nearby days.
Since we have a lot of existing data on the Tmaxmin basis, an extension of the research that would be useful is how does a trend derived from Tmaxmin data compare to a trend derived from the other methods.
It’s interesting how many people seem to miss this.
Understanding the measurement capability and errors in the system is very important, particularly when the issues being debated are of the same magnitude. If we’re ever to have good data that can tell us what is actually happening, these are the kinds of things that need determining. This is very good work. My compliments to Mr. Wallace.
Gerry Parker
JamesS,
“Wouldn’t that be a nice dataset to look at?”
In my experience too, more data is always better, particularly in a noisy environment. Some will say Shannon’s Theorem tells us we only need so many samples to find the min/max signal, but that is without regard to higher frequency system noise (intermittent jet exhaust) or, as you say, what about higher frequency components that distort the signal into non-sinusoidal shapes. Is there something to be learned from that?
Gerry Parker
Be gentle with me if I am being stupid ( this is not my area of expertise) but……
T max/min are the result of energy flow into and out of the climate system.
The primary source of that energy is the sun and the max amount of energy from it at any one time ( at the equator) is a function of sin(a). E, where a is the angle above the horizon and E is solar emission.
The amount absorbed by the earth is moderated by such things as clouds, so the actual amount absorbed may be less than max.
Only knowing what Tmin/max are does not tell us the total amount of radiation received because we have no intermediate points on the curve that would allow us to calculate it….so that curve could be parabolic, hyperbolic or straight line. only if we record a time series can we get any meaningful result.
So it seems to me that using only Tmin/max to obtain a mean global temp is an exercise in futility.
Sorry…that should be ‘Only if we record a daily time series’…..
Frank de Jong says:
September 13, 2012 at 12:14 am
“I wonder what the influence of time zones is.”
=============
And daylight savings time – which is not universally applied in all time zones. Figure 6 for example. It looks like someone forgot that farmers don’t bother with daylight savings time.