Comparing GHCN V1 and V3

Much Ado About Very Little

Guest post by Zeke Hausfather and Steve Mosher

E.M. Smith has claimed (see full post here: Summary Report on v1 vs v3 GHCN ) to find numerous differences between GHCN version 1 and version 3, differences that, in his words, constitute “a degree of shift of the input data of roughly the same order of scale as the reputed Global Warming”. His analysis is flawed, however, as the raw data in GHCN v1 and v3 are nearly identical, and trends in the globally gridded raw data for both are effectively the same as those found in the published NCDC and GISTemp land records.

clip_image002

Figure 1: Comparison of station-months of data over time between GHCN v1 and GHCN v3.

First, a little background on the Global Historical Climatology Network (GHCN). GHCN was created in the late 1980s after a large effort by the World Meteorological Organization (WMO) to collect all available temperature data from member countries. Many of these were in the form of logbooks or other non-digital records (this being the 1980s), and many man-hours were required to process them into a digital form.

Meanwhile, the WMO set up a process to automate the submission of data going forward, setting up a network of around 1,200 geographically distributed stations that would provide monthly updates via CLIMAT reports. Periodically NCDC undertakes efforts to collect more historical monthly data not submitted via CLIMAT reports, and more recently has set up a daily product with automated updates from tens of thousands of stations (GHCN-Daily). This structure of GHCN as a periodically updated retroactive compilation with a subset of automatically reporting stations has in the past led to some confusion over “station die-offs”.

GHCN has gone through three major iterations. V1 was released in 1992 and included around 6,000 stations with only mean temperatures available and no adjustments or homogenization. Version 2 was released in 1997 and added in a number of new stations, minimum and maximum temperatures, and manually homogenized data. V3 was released last year and added many new stations (both in the distant past and post-1992, where Version 2 showed a sharp drop-off in available records), and switched the homogenization process to the Menne and Williams Pairwise Homogenization Algorithm (PHA) previously used in USHCN. Figure 1, above, shows the number of stations records available for each month in GHCN v1 and v3.

We can perform a number of tests to see if GHCN v1 and 3 differ. The simplest one is to compare the observations in both data files for the same stations. This is somewhat complicated by the fact that station identity numbers have changed since v1 and v3, and we have been unable to locate translation between the two. We can, however, match stations between the two sets using their latitude and longitude coordinates. This gives us 1,267,763 station-months of data whose stations match between the two sets with a precision of two decimal places.

When we calculate the difference between the two sets and plot the distribution, we get Figure 2, below:

clip_image004

Figure 2: Difference between GHCN v1 and GHCN v3 records matched by station lat/lon.

The vast majority of observations are identical between GHCN v1 and v3. If we exclude identical observations and just look at the distribution of non-zero differences, we get Figure 3:

clip_image006

Figure 3: Difference between GHCN v1 and GHCN v3 records matched by station lat/lon, excluding cases of zero difference.

This shows that while the raw data in GHCN v1 and v3 is not identical (at least via this method of station matching), there is little bias in the mean. Differences between the two might be explained by the resolution of duplicate measurements in the same location (called imods in GHCN version 2), by updates to the data from various national MET offices, or by refinements in station lat/lon over time.

Another way to test if GHCN v1 and GHCN v3 differ is to convert the data of each into anomalies (with baseline years of 1960-1989 chosen to maximize overlap in the common anomaly period), assign each to a 5 by 5 lat/lon grid cell, average anomalies in each grid cell, and create a land-area weighted global temperature estimate. This is similar to the method that NCDC uses in their reconstruction.

clip_image008

Figure 4: Comparison of GHCN v1 and GHCN v3 spatially gridded anomalies. Note that GHCN v1 ends in 1990 because that is the last year of available data.

When we do this for both GHCN v1 and GHCN v3 raw data, we get the figure above. While we would expect some differences simply because GHCN v3 includes a number of stations not included in GHCN v1, the similarities are pretty remarkable. Over the century scale the trends in the two are nearly identical. This differs significantly from the picture painted by E.M. Smith; indeed, instead of the shift in input data being equivalent to 50% of the trend, as he suggests, we see that differences amount to a mere 1.5% difference in trend.

Now, astute skeptics might agree with me that the raw data files are, if not identical, overwhelmingly similar but point out that there is one difference I did not address: GHCN v1 had only raw data with no adjustments, while GHCN v3 has both adjusted and raw versions. Perhaps the warming the E.M. Smith attributed to changes in input data might in fact be due to changes in adjustment method?

This is not the case, as GHCN v3 adjustments have little impact on the global-scale trend vis-à-vis the raw data. We can see this in Figure 5 below, where both GHCN v1 and GHCN v3 are compared to published NCDC and GISTemp land records:

clip_image010

Figure 5: Comparison of GHCN v1 and GHCN v3 spatially gridded anomalies with NCDC and GISTemp published land reconstructions.

If we look at the trends over the 1880-1990 period, we find that both GHCN v1 and GHCN v3 are quite similar, and lie between the trends shown in GISTemp and NCDC records.

1880-1990 trends

GHCN v1 raw: 0.04845 C (0.03661 to 0.06024)

GHCN v3 raw: 0.04919 C (0.03737 to 0.06100)

NCDC adjusted: 0.05394 C (0.04418 to 0.06370)

GISTemp adjusted: 0.04676 C (0.03620 to 0.05731)

This analysis should make it abundantly clear that the change in raw input data (if any) between GHCN version 1 and GHCN version 3 had little to no effect on global temperature trends. The exact cause of Smith’s mistaken conclusion is unknown; however, a review of his code does indicate a few areas that seem problematic. They are:

1. An apparent reliance on station Ids to match stations. Station Ids can differ between versions of GHCN.

2. Use of First Differences. Smith uses first differences, however he has made idiosyncratic changes to the method, especially in cases where there are temporal lacuna in the data. The method which used to be used by NCDC has known issues and biases – detailed by Jeff Id. Smith’s implementation and his method of handling gaps in the data is unproven and may be the cause.

3. It’s unclear from the code which version of GHCN V3 that Smith used.

STATA code and data used in creating the figures in this post can be found here: https://www.dropbox.com/sh/b9rz83cu7ds9lq8/IKUGoHk5qc

Playing around with it is strongly encouraged for those interested.

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
275 Comments
Inline Feedbacks
View all comments
June 23, 2012 7:08 am

steven mosher says:
…….
Hi Steven
Thanks for comment on the other thread. I noticed the Santa Fe BEST (page 10) shows similar spectral response, but I am not certain if using 5yr smoothing is a good idea.

June 23, 2012 7:13 am

Richard Fowler,
Here, perhaps this can help somewhat. This is EM’s description of what he does.
“2) Missing data handling. For classical First Differences, if there are missing data, you just re-zero and reset. As there are a lot of missing data in some places, that gives some very strange results. I just assume that if you have January for some years in a row, and leave out a couple, then get some more, that somewhere in between you passed through the space between the two. So if you had a series that was +1/10, +1/10, -1/10, missing, missing, +2/10; I just hold onto the last anomaly and wait while skipping missing data. When I hit the +2, I just account for all the change in that year. So you would have 1/10, 2/10, 1/10, 3/10 as the series. This, IMHO, more accurately reflects the reality of the temperatures recorded. That last year WAS 2/10 higher than the prior value, so throwing it away and using a zero is just wrong. In this way this code is much more robust to data dropouts and also more accurate.”
So, my concern is this.
1. we know from Jeff Id’s fine work ( Jeff is the skeptic who tore Steig’s work to shreds ) that First differences is a flawed method.
2. EM departs from this method and “invents” his own approach.
That approach is untested ( have a look at the synthetics tests that Jeff Id did in first differences)
If you ask me why EM gets odd results quite logically I can only point to two possibilities
data or method. Assuming he used ghcn v1 raw and ghcn v3 raw, that logically leaves method as the reason. I look at his method and I see that he uses a method that has been discredited by leading skeptics and that he has made untested changes to the method ( while trying to say its “peer reviewed”). I kinda shrug and suggest that maybe there is an issue there. For me, I use the better methods as suggested by Jeff and Roman. I used to think First differences was the best. I was wrong, One thing I have always appreciated here and at climate audit is people’s insistence that we use the best methods.

June 23, 2012 7:17 am

Sure thing vuk.
You can expect some updates to that Sante fe chart in the coming months. I suspect folks who do spectral analysis will be very interested.

June 23, 2012 7:22 am

Steven Mosher,
“Interesting that you think Tobs only applies to the US. It doesn’t.”
If you say this is that you have a case in mind. Have you a reference?
#############
yes. do more reading and post your results.
“With regard to station moves, I prefer the BEST methodology.”
It has the disadvantage of not allowing to assess the magnitude of the adjustments.
###############
Of course you can assess the magnitude. Its a switch in the code you can turn off
or turn on.
“although in practice we know that explicit adjustments give the same result.”
Yes, explicitly or implicitly all global temperatures curves are homogenized.
####
The wonderful thing about having new data is that you can actually test a method
by withholding data. You can test the ability with and without.

Pamela Gray
June 23, 2012 7:39 am

Question. Are the stations between the two data versions still the exact same stations or has there been station drop out or upgrades (and in some cases deteriorating stations) as time went by? Keeping due diligence over the study plots (tracking how they have changed with careful observations), and having a gold standard control group that is kept in pristine condition to compare them with is vital before homogenization methods can be developed. I don’t think that has been done to any extent. Therefore the raw and homogenized results are probably worthless. The homogenization methods are a shot in the dark.

June 23, 2012 7:40 am

(Moderator — feel free to snip, but I think this is relevant)
I tried to compare BEST to Environment Canada data for one station near where I live.
I realized the data in BEST was crappy.
http://sunshinehours.wordpress.com/2012/03/13/auditing-the-latest-best-and-ec-data-for-malahat/
“To start with I am looking at one station. In BEST it is StationID 7973 – “MALAHAT, BC”. In EC it is station MALAHAT which is Station_No 1014820.
I am comparing the BEST SV (Single Valued) data to the BEST QC (Quality Controlled) data.
The first minor problem is that the EC data has records from the 1920s and 1930s that BEST does not have (that I have found). Thats no big deal. The next problem is that out of 166 Month/Year records, not one of them matched exactly. BEST SV and QC data is to 3 decimal points while EC is to 1.
For example. Jan 1992 has QC = 5.677, as does SV, while EC = 5.8. Close. But not an exact match.
However, the real problem is that there are 5 records that have been discarded between SV and QC. Two out of the five make no sense at all, and one is iffy.
Where it says “No Row” it means BEST has discarded the record completely between SV and QC.
1991 is iffy. EC has it has 4.5, SV has 3.841. Close, but not that close
1993 makes no sense at all.
2002 is fine. Thats a huge error. But where the heck did BEST get the -13.79 number in the first place.
2003 is fine. But again, where the heck did BEST get the -4.45 number in the first place.
Finally, 2005 makes no sense at all. There is little difference between -1.1 and -1.148. Certainly most records are that different.
And those are just the discarded records!
There are another 48 record with a difference of .1C or greater and here are the greater than .2C ones.”

DocMartyn
June 23, 2012 7:41 am

steven mosher, I have, as always, a simple request.
In your figure 3 you have all the stations that have been mathematically warmed and cooled during the revision, and also have their location.
If it is not too much trouble could you color-code them and plonk them on a whole Earth map and let us see if all the warm ones are clustered around the areas where we are observing ‘unprecedented’ warming?
Call me a cynical old-fool, but a bit of warming here and a bit of cooling there and pretty soon you can be talking Catastrophic Anthropogenic Global Warming.

June 23, 2012 7:45 am

pouncer.
“fit for purpose”
From my standpoint the temperature record has nothing to do with knowing that radiative physics is true. We know that from engineering tests. If you add more GHGs to the atmosphere your change the effective radiating level of the atmosphere and that over time will result in the surface cooling less rapidily.
So it depends upon what purposes you are talking about. People are also confused about what the temperature record really is and what it can really tell us and how it is used. Chief amongst those confusions is the idea that we claim to know it to within tenths or hundreths.
Let me see if I can make this simple. Suppose I do a reconstruction and I say that the
“average” global anomaly in 1929 is .245 C. What does that mean?
It means this: It means that if you find hidden data record from 1929 to the present and calculate its anomaly, the best estimate of its 1929 anomaly will be .245C. That is, this estimate will minimize the error in your estimate. We collect all the data that we do have
and we create a field. That field actually is an estimate that minimizes the error, such that if you show up with new data my estimate will be closer to the new data than any other estimate.
The global average isnt really a measure of a physical thing. You can compare it to itself and see how its changing, its only a diagnostic, an index. Of course in PR land it gets twisted into something else.
So fit for purpose? You can use it for many purposes. you could use it to tell a GCM molder that he got something wrong. You could use it to calibrate a reconstruction and have a rough guess at past temperatures. You could use it to do crude estimates of sensitivity. Its fit for many purposes. I wouldnt use it to plan a picnic.
Let me give you a concrete example. I’ve come across some new data from the 19th century
Hundreds of stations never seen before.Millions of new data points preserved photos of written records on micro fiche. What method would you use to predict the temperatures I hold in my hand? Well, the best method you have is to form some kind of average of all the data that you hold and use that to estimate the data that I hold in my hands.
The best averaging method is not First differences. Its not the common anomaly method. Its not the reference station method. The best method is going to be something like Jeff Ids method, or Nick Stokes method, or the Berkeley Earth method. When they estimate .245678987654C
that estimate will be closer to the data that I hold than any other method. The precision of that guess has nothing to do with the quality of the data, it has to do with minimizing the error of prediction given the data.

June 23, 2012 7:57 am

Hi Doc,
One reason why Zeke provided the data and code in the public drop box is to allow people with questions to answer the questions for themselves. By releasing the data and the code we effectively give people the power to prove their points.
When I fought to get hansen to release the code and when I sent FOIA into Jones it wasnt to get them to answer my questions. It was to get their tools so I could ask the questions I wanted to ask.

Rob Dawg
June 23, 2012 8:05 am

• “We can perform a number of tests to see if GHCN v1 and 3 differ. The simplest one is to compare the observations in both data files for the same stations. This is somewhat complicated by the fact that station identity numbers have changed since v1 and v3, and we have been unable to locate translation between the two. ”
——
Simply a stunning admission. What judge in any court would, upon hearing the prosecutors admit to this, not throw the case out of court?

June 23, 2012 8:25 am

Mosher: “Well, the best method you have is to form some kind of average of all the data that you hold and use that to estimate the data that I hold in my hands.”
I wouldn’t use all data, I would use data for a reasonably sized region and compare the data and see what the differences are.
For example, if all of your old data was 1C or 2C warmer than NOAA’s adjusted old data, then the odds are your data is right.

Michael R
June 23, 2012 8:30 am

I certainly do not have the kind of expertise to take sides in the argument. One thing that does give me pause however is that I have seen the type of analysis done in Figure three previously, also used as evidence of no bias in adjustments, however as was pointed out then, that kind of graph is isn’t supportive of anything.
Should most or all of those adjustments that are on the right side of the bell curve happen to be in the more recent time and the opposite true for earlier time, then you artificially create a warming trend and indeed could make a huge warming trend while still showing equally cool and warming adjustments.
I cannot comment on the rest of it, but I would caution the use of the argument used in the above paragraph as he last time I saw it used was in an attempt to mislead people which automatically makes me untrusting of the following argument, which may very well be unfortunate rather than unwarranted.

June 23, 2012 8:31 am

sunshinehours the Env Canada data has quality control flags that you should apply before doing any comparison. If you download my package CHCN ( be sure to get the latest version ) that might help you some.
Lets start with environment canada for jan 1992 you really didnt tell people everything
now did you? here is the actual data from environment canada
Time Tmax Tmean Tmin
1992-01 8.60 E 5.80 E 2.60 E
As you note Environment canada has 5.8 and Best has 5.67
Best does not use environment canada as a source.
If you use my package however you can look at environment canada data.
When you do, here is what you find. What you find is that you failed to report
the quality flags. See that letter E that follows the Tmax and tmin and tmean
That flag means the value in their database is ESTIMATED
Now do the following calculation (8.6+2.6)/2 or (tmax + tmin)/2 see what you come up with?
is it 5.8? nope. looks like 5.6 to me. Without looking at Best data ( I ‘m tired of correcting your mistakes ) I would bet that the source for the BEST data is daily data. Enviroment canada also has daily data for that site it has daily data from
1920-01-01/2012-05-16
Looking at the environment canada excell file for that site and examining the quality flags you should note that a very very large percentage of the figures you rely on have a quality flag of
“E” or “I”
E stands for estimated
I stands for incomplete
Folks can verify that by looking at the csv file for the station.
Station Name MALAHAT
Province BRITISH COLUMBIA
Latitude 48.57
Longitude -123.53
Elevation 365.80
Climate Identifier 1014820
WMO Identifier 71774
TC Identifier WKH
Legend
[Empty] No Data Available
M Missing
E Estimated
B More Than One Occurrence and Estimated
I The value displayed is based on incomplete data
S More than One Occurrence
T Trace

June 23, 2012 8:37 am

1. “I’m using “raw” in the sense that most of you do.”
If it’s not philosophically “raw.” Then “raw” is not a very descriptive term for what it is. That’s a big problem.
2. “There are scads of errors like this”
Sounds like the data may not be very meaningful if there continues to be “scads” of errors in it.
Andrew

Pamela Gray
June 23, 2012 8:38 am

Steven, a significant portion of your time as a scientist should be spent explaining what you have said. To refuse to do so by telling others to figure it out for themselves seems a bit juvenile and overly dressed in “ex-spurt” clothes. If questions come your way, kindly doing your best to answer them seems the better approach, especially with such a mixed audience of arm-chair enthusiasts such as myself and truly learned contributors. I for one appreciate your post. Don’t spoil it.

Ripper
June 23, 2012 8:55 am

The first minor problem is that the EC data has records from the 1920s and 1930s that BEST does not have (that I have found).
=======================================================
Same thing here in western Australia.
Instead of looking at the records used ,It is time to look at the records that are not used.
It appears to me that the records with warm years in Western Australia in the 1920-40’s are just no used.
If someone can point me to where GHCN or CRU uses the long records from e.g.Geraldton town that starts in 1880
http://members.westnet.com.au/rippersc/gerojones1999line.jpg
or Kalgoorlie post office that starts in 1896
http://members.westnet.com.au/rippersc/kaljones1999line.jpg
I would be most grateful.
Or Carnarvon post office that starts in 1885… etc.
It appears to me that in the SH Phil Jones used very little data that was available from 1900 to 1940-50 odd, but selected different stations in each grid cell for 1895-1899.
e.g. instead of using those years from the above mentioned stations, he filled the grid cell with 1897-1899 (despite that record going to 1980) from Hamelin pool.

Pamela Gray
June 23, 2012 9:03 am

Station drop out has always intrigued me. Here’s why. Station dropout may have done so in non-random fasion. Therefore the raw data collected from stations over time may have non-random bias in them before they were homogenized, which may have caused the raw data to be even more biased.
How would one determine this? One way would be comparing ENSO-driven temperature trend patterns from analogue years and dated station dropout plots. I would start with the US since ENSO patterns are fairly well studied geographically. Overlaying a map of dated station dropout on temperature and precipitation pattern maps under El Nino, La Nina, and neutral ENSO conditions through the decades may be quite revealing and possibly demonstrate explained variance being related to non-random station dropout interacting with ENSO oscillations and patterns. If so, homogenization may have only served to accentuate the bias in the raw data itself.

climatebeagle
June 23, 2012 9:21 am

Rob Dawg says: June 23, 2012 at 8:05 am
“Simply a stunning admission.”
Exactly, that jumped out at me as well, how on earth did anyone think that was a good idea.
Thanks for posting it, it would be good if both sides could continue the dialog to resolution, whatever any outcome may be, (e.g. both have valid points or acceptance that one (or both) analysis is wrong/weak/strong/right/…)

Pamela Gray
June 23, 2012 9:25 am

By the way, local low-temperature records in Oregon are falling right and left, as are precipitation records. Why? Not because of global cooling but because of the ENSO oceanic and atmospheric conditions we currently have in and over the great pond to the West of us that uniquely affect our weather patterns. These oceanic and atmospheric conditions have been in place, more or less, for several years now, resulting in several predictable changes in flora and fauna response and broken records related to lows over the same number of years in Oregon.
What is interesting and connected to my earlier comment, is that these same conditions result in year in and year out heat waves and drought elsewhere. Which leads me to restate: Station dropout patterns over time need further study in relation to ENSO conditions and the fairly well established geographic temperature trends tied to ENSO conditions. I think the raw data may be biased.

dp
June 23, 2012 9:25 am

Start over and show where Smith’s error is. All you’ve shown is you have arrived at a different result with different methods. No surprise. I could do a third and be different again. It would not show that Smith or you are right (or wrong).

Andrew Greenfield
June 23, 2012 9:27 am

There is NO significant warming when will they learn?
http://www.drroyspencer.com/wp-content/uploads/UAH_LT_1979_thru_May_2012.png
There is no hotspot in the TLT

E.M.Smith
Editor
June 23, 2012 10:00 am

I haven’t had time to look through the comments, but to address a couple of points quickly:
The version of v3 which I used is:
ghcn.tavg.v3.1.0.20120511.qcu.dat
The version used ought not to have much effect. If it does, then there are far more variations in the data than IMHO could be reasonably justified.
The “attack” on station matching is just silly. I went out of my way, especially in comments, to make clear that the “shift” was not a result of individual data items, but a consequence of the assemblage of records. It isn’t a particular record that has changes, it is WHICH records are collected together. As I don’t do “station matching”, there isn’t any consequence from station IDs changing. The “country code” portion changes dramatically, but the WMO part less so. I do make assemblages of records by “country code”, assuring that the correct mapping of countries is done, as a different approach to getting geographical smaller units to compare That map is at:
http://chiefio.wordpress.com/2012/04/28/ghcn-country-codes-v1-v2-v3/
In large part, this article asserts ‘error’ in doing something that I specifically asserted I did not do. Match stations or compare them data item by data item. If asked, I would freely respond that most individual records from individual stations will have highly similar, if not identical, data for any given year / month. (There are some variations, but they are small). It ignores what I stated was the main point; that the collection of thermometer records used changes the results in a significant way.
In essence, “point 1” is irrelevant as it asserts a potential “error” in something I specifically do not do. Station match.
The chart showing an equal distribution to each side of zero for non-identical records ignores the time distribution. (I’m not making an assertion about what the result from a station by station comparison on the time domain, just pointing out that it is missing). For the assemblage of data, the time distribution of change matters. The deep past shows more warm shifting while the middle time shows more cold, then swapping back to warmth more recently. The data in aggregate have a cooler time near the periods used as the “baseline” in codes like GIStemp and by Hadley, but is warmer recently (and in the deep past). It would be beneficial to make that “by item” comparison 3 D with a time axis. (No, I’m not asserting what the result would be. As it is a “by item” comparison, it says nothing about the impact of the data set in total).
Point 2 has some validity in that I do handle data drop out differently from the classical way. I simply hold onto the last value and wait for a new data item to make the comparison. So, for example, if Jan 1990 has data, but Jan 1989 does not, while Jan 1988 does again, the classical method would give 0 0 0 as the anomaly series. (Since it would ‘reset’ on the dropout and each ‘new set’ starts with a zero). That is, IMHO, fairly useless especially in data with a large number of data dropouts (as are common in many of the stations). Lets say for this hypothetical station the temps are 10.1 missing 10.0 (and on into 9.8, 10.4 etc.). I would replace the 10.1 with a zero (as FD has ‘zero difference’ in the first value found) replace the second value with zero (as it is missing, there is ‘no difference’ found) and then the third value be compared and a 1/10 C difference found. It says, in essence, these two Januaries have changed by -0.1 C so regardless of what happened in the middle year, the slope is 0.1 C over 2 years. A value of -0.1 C is entered for the 3rd value. So which is more accurate? To say that NO change happened over 3 years that start with 10.0 and end with 10.1, or say that a change of 0.1 happened? I think that on the face of it the ‘bridging’ of dropouts is clearly more accurate; though it would benefit from more formal testing.
For individual stations, this takes what would otherwise become a series of “resets” (and thus, artificial “splice artifacts” when they are re-integrated into a trend) and instead has one set of anomalies that starts with the most recent known value, ends with the last known value, and any dropout is bridged with an interpolated slope. It will clip out bogus excursions caused by resets on missing data (as the new end point data gets effectively ignored in classical FD) and will preserve what is a known existing change between known data. Using the data above, classical FD would find 0, 0, 0, -0.2 for an overall change of 0.2 cooler moving left to right over 4 years. By bridging the gap instead of taking a reset, my method gets 0, 0, -0.1, -0.2 and finds an overall change of -0.3 going from 10.1 to 9.8 (which sure looks a whole lot more correct to me…)
But if you want to hang your hat on that as “causal”, go right ahead. It won’t be very productive, but a thorough examination of the effects on larger assemblies of data would be beneficial. So while I’d assert that point 2 is, well, pointless; being a new minor variation on the theme it could use more proving up.
Per point 3, I’ve given you the version above. Point now null. (And frankly, if you wish to assert that particular versions of GHCN v3 vary that much, we’ve got bigger problems than I identified with version changes.)
In the end, this ‘critique’ has one point with some validity (but it isn’t enumerated). That is the “grid / box” comparison of the data. However, what it ignores is the use of the Reference Station Method in codes such as GIStemp (and I believe also in NCDC et. al. as they reference each others methods and techniques – though a pointer to their published codes would allow me to check that. If their software is published…) So, you compare the data ONLY in each grid cell and say “They Match pretty closely”, and what I am saying is “The Climate Codes may homogenize and spread a data item via The Reference Station Method over 1200 km in any one step, at may do it serially so up to 3600 km of influence can be had. This matters.”
Basically, my code finds the potential for the assemblage to have spurious influence via the RSM over long distances and says “Maybe this matters.” while your ‘test’ tosses out that issue up front as says ‘if we use a constraint not in the climate codes the data are close’. No, I don’t know exactly how much the data smearing via the RSM actually skews the end product (nor does anyone else, near as I can tell) but the perpetually rosy red arctic in GIStemp output where there is only ice and no thermometers does not lend comfort…
So as a first cut look at the critique, what I see is a set of tests looking at the things I particularly did not do (station matching) and then more tests looking at things with very tight and artificial constraints (not using the RSM as the climate codes do) that doesn’t find the specific issue I think is most important as it excludes the potential up front. Then a couple of spurious points ( such as version of v3) that are kind of obviously of low relevance.
My assertion all along has been that it is the interaction between records and within records that causes the changes in the data to “leak through” the climate codes. That RSM and homogenizing can take new or changed records and spread their influence over thousands of kilometers. That data dropouts and the infilling of those dropouts from other records can change the results. That it is the effective “splicing” of those records via the codes (in things like the grid / box averages) that creates an incorrect warming signal. By removing that homogenizing, the RSM, and the infilling, you find that they have no impact. What a surprise… Then, effectively, I’m criticized because my code doesn’t do that.
Well this behaviour is by design. I want to see what the impact of the assemblage is when ‘smeared together’ since that is what the climate codes do. Yes, I use a different method of doing the blending (selection on ranges of country code and / or WMO number) than ‘grid box’ of geographic dimensions area weighted. This, too, is by design. It lets me flexibly look at specific types of effect. (So, for example, I compared inside Australia with New Zealand. I can also compare parts of New Zealand to each other, or the assembly of Australia AND New Zealand to the rest of the Pacific – or most any other mix desired). Since one of the goals is to explore how the spreading of influence of record changes in an assemblage (via things like RSM and homogenizing) can cause changes in the outcome, this is a very beneficial thing.
This critique just ignores that whole problem and says that if we ignore the data influence of items outside their small box there is no influence outside their small box.
Oddly, the one place where I think there is a very valid criticism of the FD method is completely ignored. Perhaps because it is generic to all FD and not directed at me? Who knows… FD is very sensitive to the very first data item found. IF, for example, our series above started with 11.5, then 10.1, missing, 10.0, 9.8 etc. We get an additional -1.4 of offset to the whole history. That offset stays in place until the next year arrives. If it comes in at 10.1, then that whole offset goes away. The very first data item has very large importance. I depend on that being ‘averaged out’ over the total of thermometers, but it is always possible the data set ends in an unusual year. (And, btw, both v3 and v1 ought to show that same unusual year…) But any ‘new data’ that showed up in 1990 for v3 that was missing in v1 when it was closed out will have an excessive impact on the comparison.
At some point I’m planning to create a way to “fix that” but haven’t settled on a method. The ‘baseline’ method is used in this critique (and in the climate codes) and it has attractions. However, when looking for potential problems or errors in a computer code, it is beneficial to make the comparison to a ‘different method’ as that highlights potential problems. That makes me a bit reluctant to use a benchmark range method. Doing a new and self created method is open to attacks (such as done here on the bridging of gaps technique) even if unwarranted. One way I’ve considered is to just take the mean of the last 10% of data items and use it as the ‘starting value’. This essentially makes each series benchmarked via the first 10% of valid data mean. But introducing that kind of complexity in a novel form may bring more heat than light to the discussion.
In the end, I find this critique rather mindlessly like so many things in ‘climate science’. Of the form “If we narrow the comparisons enough we find things in tiny boxes are the same; so the big box must be the same too, even if we do different things in it.” Makes for nice hot comments threads, but doesn’t inform much.
I’m now going to have morning tea and come back to read the comments. I would suggest, however, that words like “error” and “flawed” when doing comparisons of different things in different ways are mostly just pejorative insults, not analysis. I don’t assert that the ‘grid / box’ method is in ‘error’ or ‘flawed’ because it will miss the impact of smearing data from an assembly of records over long distances. It is just limited in what it can do. Similarly it is not an ‘error’ or ‘flawed’ to use a variable time window in FD. (That is, in effect, what I do. Standard FD just accepts the time window in the data frequency, such as monthly, I just say ‘the window may be variable length, skip over dropouts’.) It is just going to find slightly different things. And, IMHO, more accurate things. It is more about knowing what you are trying to find, and which tool is likely to find it. As I’m looking for “assemblage of data impacts”, using a variable “box size” to do various aggregations (via CC/WMO patterns) lets me do that; while using a fixed geographic grid box without RSM and homogenizing will never find it. That isn’t an error or flaw in non-homogenized non-RMS small grid boxes, it is just a limitation of the tool. That the tool is used wrongly (if your goal is to find those effects) would be the error.
So, in short, the analysis here says I didn’t find what I wasn’t looking for, and they don’t find what I did look for because they don’t look for it.

June 23, 2012 10:05 am

Steve and Zeke, you’ve gone above-and-beyond here, and it’s much appreciated. I realize that you don’t owe us anything else, and I think it’s a good idea to push people to delve into the data for themselves.
At the same time, I have quite a few data sets on my laptop where I could bang out a new graph in a minute or two that would take you quite some time to figure out my R code, what data files I’ve given you and what you have to obtain from elsewhere, etc. In like manner, I’m wading through your Stata (a truly ugly language, like SAS, SPSS, gretl, and that whole command-oriented genre of statistical programs), trying to figure out what you’ve done, what files you’ve supplied, what files you have not, etc. I then suspect I’ll have to make some enormous data downloads, and finally, once I’ve reproduced your results in R I’ll be able to see how your Figure 3 histogram changes over time and location.
I think it’s a very intelligent observation that an apparent canceling of adjustments over the entire period might actually obscure a trend of adjustments over time, or over latitude, etc. You don’t have to pursue that line of reasoning to all of its boundaries, but I would guess that you could fairly quickly do a lattice-style histogram by, say, decades to settle the issue to a first approximation.
Thanks again!

phi
June 23, 2012 10:05 am

Steven Mosher,
Ok, that’s right. Tobs applies only to US : You have nothing.
“Of course you can assess the magnitude. Its a switch in the code you can turn off
or turn on.”
No, you can’t disable this with BEST. What you say does not make sense.