By Andy May
While studying the NOAA USHCN (United States Historical Climate Network) data I noticed the recent differences between the raw and final average annual temperatures were anomalous. The plots in this post are computed from the USHCN monthly averages. The most recent version of the data can be downloaded here. The data shown in this post was downloaded in October 2020 and was complete through September 2020.
There are two ways to compute the difference and they give different answers. One way is to subtract the raw from the final temperature month-by-month, ignoring missing values, then average the differences by year. When computed this way from the USHCN monthly values, the values are only subtracted when both the raw and final average temperature exist for a given month and station. Using this method, the numerous “estimated” final temperatures are ignored, because there is no matching raw temperature. This plot is shown in Figure 1.
Figure 1. Plot of USHCN final temperatures minus raw data. The difference between final and raw monthly average temperatures is computed when both exist for a specific month. The differences are then averaged. Data used is from NOAA.
In Figure 1 we can see two things. First, the number of raw data stations drops quickly from 2005 to 2019. As we can see in Figure 2 this is not a problem for the final temperatures. How is this so? In turns out that as the active weather stations disappear from the network, the final temperature for them is estimated from neighboring stations. The estimates are made from nearby active stations using the NOAA pairwise homogenization algorithm or “PHA” (Menne & Williams, 2009a). The USHCN is a high-quality subset of the larger COOP set of stations. The PHA estimates are not made with just the USHCN high quality stations, the algorithm utilizes the full COOP set of stations (Menne, Williams, & Vose, 2009).
Figure 2. Final temperatures from 1900 to 2019. Notice 1218 values are present from 1917 to 2019. As shown in Figure 1, these are not all measurements, a significant number of the values are estimated. Data used is from NOAA.
So, what happens if we simply average the values for each month in both the raw dataset and the final dataset, ignoring nulls, then subtract the raw yearly average from the final yearly average? This is done in Figure 3. We realize that the raw values represent fewer stations and that the final values contain many estimated values. The number of estimated final values increases rapidly from 2005 to 2019.
Figure 3. USHCN final-raw temperatures computed year-by-year, regardless of the number of stations in each dataset. The sharp rise in the temperature difference is from 2015-2019.
The above plots use all raw data and all final data in the USHCN datasets. Information about the data is available on the NOAA web site. In addition, John Goetz has written about the data and the missing values in some detail here.
The USHCN weather stations are a subset of the larger NOAA Cooperative Observer Program weather stations, the “COOP” mentioned above. USHCN stations are the stations with longer records and better data (Menne, Williams, & Vose, 2009). All the weather station measurements are quality checked and if problems are found a flag is added to the measurement. To make the plots shown here, the flags were ignored, and all values were plotted and used in the calculations. Some plots made by NOAA and others with this data are made this way and others reject some or all flagged data. Little data exists before 1900, so we chose to begin our plots at that date. There are less than 200 stations in 1890. All the weather stations in the USHCN network are plotted in Figure 4, those with more than 50 missing monthly averages between January 2010 and the end of 2019 are noted with red boxes around the symbol.
Figure 4. All USHCN weather stations. Those with missing raw data monthly averages have red boxes around them. Data source: NOAA.
The plots above show that the overall effect of the estimated, or “infilled,” final monthly average temperatures is a rapid recent rise in average temperature as is clearly seen in Figure 3. In Figure 3 the overall monthly averages from the estimated (“infilled”) final weather station values are averaged and then compared to the average of the real measurements, the raw data. This is not a station by station comparison. The station-by-station comparison is shown in Figure 1. In Figure 1 the monthly differences are computed only if a station has both a raw measurement and a final estimate. The values from 2010 to 2019 still look strange, but not as strange as in Figure 3.
Clearly the rapid drop-off of stations during this time, which averages more than 20 stations per year, is playing a role in the strange difference between Figures 1 and 3. But, the extreme jump seen from 2015-2019 in Figure 3 is mostly in the estimated values in the final dataset. We might think the 2016 El Nino played a role in this anomaly, but it continues to 2019. The El Nino effect reversed in 2017 in the U.S., as seen in Figure 2. Besides, this anomaly is not in temperature, it is a difference between the final and raw temperature values in the USHCN dataset.
Figure 4 makes it clear that the dropped stations (boxed in red) are widely scattered. The areal coverage over the lower 48 states is similar in 2010 and 2019, except perhaps in Oklahoma, not sure what happened there. But, in the final dataset, values were estimated for all the terminated weather stations and those estimated values apparently caused the jump shown in Figure 3.
I don’t have an opinion about how the year-by-year Final-Raw anomaly in Figure 3 happened, only that it looks very strange. Reader opinions and additional information are welcome.
Some final points. I used R to read and process the data, although I used Excel to make a lot of the graphs. The USHCN data is complete and reasonably well documented for the most part, but hard to read and get into a usable form. For those that want to check what I’ve done and make sure these plots were made correctly I’ve collected my R programs in a zip file that you can download and use to check my work.
I plan to do more with the USHCN data and its companion GHCN (Global Historical Climate Network) dataset. I’ll publish more posts on them as issues come up.
One point of confusion in the data, unrelated to this post. NOAA calls their time-of-day corrected data “tob.” It stands for time-of-day bias and accounts for minimum and maximum temperatures taken at different times in different stations. All the tob data supplied on their ftp site has 13 monthly values. I’ve read the papers and the documentation but cannot figure out why there are 13 monthly values for tob, but only 12 monthly values for all the other datasets. I emailed them to ask but have not received an answer to date. Does anyone know? If so, please put the answer in the comments.
Download the R code used to read the USHCN monthly raw and final data and compute the data plotted in this post here.
You can purchase my latest book, Politics and Climate Change: A History, here. The content in this post is not from the book.
The bibliography can be downloaded here.