Comparing GHCN V1 and V3

Much Ado About Very Little

Guest post by Zeke Hausfather and Steve Mosher

E.M. Smith has claimed (see full post here: Summary Report on v1 vs v3 GHCN ) to find numerous differences between GHCN version 1 and version 3, differences that, in his words, constitute “a degree of shift of the input data of roughly the same order of scale as the reputed Global Warming”. His analysis is flawed, however, as the raw data in GHCN v1 and v3 are nearly identical, and trends in the globally gridded raw data for both are effectively the same as those found in the published NCDC and GISTemp land records.

clip_image002

Figure 1: Comparison of station-months of data over time between GHCN v1 and GHCN v3.

First, a little background on the Global Historical Climatology Network (GHCN). GHCN was created in the late 1980s after a large effort by the World Meteorological Organization (WMO) to collect all available temperature data from member countries. Many of these were in the form of logbooks or other non-digital records (this being the 1980s), and many man-hours were required to process them into a digital form.

Meanwhile, the WMO set up a process to automate the submission of data going forward, setting up a network of around 1,200 geographically distributed stations that would provide monthly updates via CLIMAT reports. Periodically NCDC undertakes efforts to collect more historical monthly data not submitted via CLIMAT reports, and more recently has set up a daily product with automated updates from tens of thousands of stations (GHCN-Daily). This structure of GHCN as a periodically updated retroactive compilation with a subset of automatically reporting stations has in the past led to some confusion over “station die-offs”.

GHCN has gone through three major iterations. V1 was released in 1992 and included around 6,000 stations with only mean temperatures available and no adjustments or homogenization. Version 2 was released in 1997 and added in a number of new stations, minimum and maximum temperatures, and manually homogenized data. V3 was released last year and added many new stations (both in the distant past and post-1992, where Version 2 showed a sharp drop-off in available records), and switched the homogenization process to the Menne and Williams Pairwise Homogenization Algorithm (PHA) previously used in USHCN. Figure 1, above, shows the number of stations records available for each month in GHCN v1 and v3.

We can perform a number of tests to see if GHCN v1 and 3 differ. The simplest one is to compare the observations in both data files for the same stations. This is somewhat complicated by the fact that station identity numbers have changed since v1 and v3, and we have been unable to locate translation between the two. We can, however, match stations between the two sets using their latitude and longitude coordinates. This gives us 1,267,763 station-months of data whose stations match between the two sets with a precision of two decimal places.

When we calculate the difference between the two sets and plot the distribution, we get Figure 2, below:

clip_image004

Figure 2: Difference between GHCN v1 and GHCN v3 records matched by station lat/lon.

The vast majority of observations are identical between GHCN v1 and v3. If we exclude identical observations and just look at the distribution of non-zero differences, we get Figure 3:

clip_image006

Figure 3: Difference between GHCN v1 and GHCN v3 records matched by station lat/lon, excluding cases of zero difference.

This shows that while the raw data in GHCN v1 and v3 is not identical (at least via this method of station matching), there is little bias in the mean. Differences between the two might be explained by the resolution of duplicate measurements in the same location (called imods in GHCN version 2), by updates to the data from various national MET offices, or by refinements in station lat/lon over time.

Another way to test if GHCN v1 and GHCN v3 differ is to convert the data of each into anomalies (with baseline years of 1960-1989 chosen to maximize overlap in the common anomaly period), assign each to a 5 by 5 lat/lon grid cell, average anomalies in each grid cell, and create a land-area weighted global temperature estimate. This is similar to the method that NCDC uses in their reconstruction.

clip_image008

Figure 4: Comparison of GHCN v1 and GHCN v3 spatially gridded anomalies. Note that GHCN v1 ends in 1990 because that is the last year of available data.

When we do this for both GHCN v1 and GHCN v3 raw data, we get the figure above. While we would expect some differences simply because GHCN v3 includes a number of stations not included in GHCN v1, the similarities are pretty remarkable. Over the century scale the trends in the two are nearly identical. This differs significantly from the picture painted by E.M. Smith; indeed, instead of the shift in input data being equivalent to 50% of the trend, as he suggests, we see that differences amount to a mere 1.5% difference in trend.

Now, astute skeptics might agree with me that the raw data files are, if not identical, overwhelmingly similar but point out that there is one difference I did not address: GHCN v1 had only raw data with no adjustments, while GHCN v3 has both adjusted and raw versions. Perhaps the warming the E.M. Smith attributed to changes in input data might in fact be due to changes in adjustment method?

This is not the case, as GHCN v3 adjustments have little impact on the global-scale trend vis-à-vis the raw data. We can see this in Figure 5 below, where both GHCN v1 and GHCN v3 are compared to published NCDC and GISTemp land records:

clip_image010

Figure 5: Comparison of GHCN v1 and GHCN v3 spatially gridded anomalies with NCDC and GISTemp published land reconstructions.

If we look at the trends over the 1880-1990 period, we find that both GHCN v1 and GHCN v3 are quite similar, and lie between the trends shown in GISTemp and NCDC records.

1880-1990 trends

GHCN v1 raw: 0.04845 C (0.03661 to 0.06024)

GHCN v3 raw: 0.04919 C (0.03737 to 0.06100)

NCDC adjusted: 0.05394 C (0.04418 to 0.06370)

GISTemp adjusted: 0.04676 C (0.03620 to 0.05731)

This analysis should make it abundantly clear that the change in raw input data (if any) between GHCN version 1 and GHCN version 3 had little to no effect on global temperature trends. The exact cause of Smith’s mistaken conclusion is unknown; however, a review of his code does indicate a few areas that seem problematic. They are:

1. An apparent reliance on station Ids to match stations. Station Ids can differ between versions of GHCN.

2. Use of First Differences. Smith uses first differences, however he has made idiosyncratic changes to the method, especially in cases where there are temporal lacuna in the data. The method which used to be used by NCDC has known issues and biases – detailed by Jeff Id. Smith’s implementation and his method of handling gaps in the data is unproven and may be the cause.

3. It’s unclear from the code which version of GHCN V3 that Smith used.

STATA code and data used in creating the figures in this post can be found here: https://www.dropbox.com/sh/b9rz83cu7ds9lq8/IKUGoHk5qc

Playing around with it is strongly encouraged for those interested.

Advertisements

  Subscribe  
newest oldest most voted
Notify of

3. From the last thread:
E.M.Smith says:
June 22, 2012 at 1:19 am
@Nick Stokes:
I use ghcn v3 unadjusted.

One actually needs the name of the dataset. and actually the code that downloads and reads it in.

“effectively the same” is not good enough. Again, global warming is is founded upon 1/10ths of a degree. It is not founded upon large amounts of whole integers—i.e., it’s barely perceptible, especially to the untrained eye.
But more than that, using strictly “anomalies” isn’t good enough either because “global warming” scientists can be tricky with anomalies.
Have a look for yourself at these two videos. You’ll see there’s lots of play room available in actual temperature when looking only at anomalies of two, or more, data sets:
How ClimateGate scientists do the anomaly trick, PART 1

How ClimateGate scientists do the anomaly trick, PART 2

phi

A central feature in these comparisons is constituted by adjustments and by choices that are made to integrate or not in the same data series segments of the same station.
National offices generally choose to form the longest possible series and homogenize them. I believe GHCN preserves the segmentation. This means that the reconstructions performed based on the GHCN data run slightly on the principle adopted by BEST. Segments are homogenized de facto at the stage where all is averaged (in the cases presented here, within cells). Quantifying the actual adjustments can be made only if the series of stations were previously merged so according to the methodology of NMS. The magnitude of the actual adjustments are remarkably stable and it is about 0.5 ° C for the twentieth century.

Richard T. Fowler

“Smith’s implementation and his method of handling gaps in the data is unproven and may be the cause. ”
“3. It’s unclear from the code which version of GHCN V3 that Smith used. ”
These two statements appear to contradict each other. If the code is available, how can Smith’s “implementation and his method of handling gaps in the data” be unproven?
Zeke or Steve, would you care to elaborate? Thank you.
RTF

Nice curve ball Steve. What is it in the data dance world; three strikes and you’re out?
The frequency bar bell charts look to heavily favor positive anomalies in both charts. Looks to be warmed up temps well outnumber cooler mods. Any chance the cooler mods are before 1970 while those positive adjustments tend towards the end on the 20th century and the beginning of the 21st? Of course, you are avoiding showing the changes by year.
The anomaly spatially gridded line comparison charts, nice but why did you have to force the data through a grid blender first?

“…When we do this for both GHCN v1 and GHCN v3 raw data, we get the figure above. While we would expect some differences simply because GHCN v3 includes a number of stations not included in GHCN v1,…”

As I understand your gridded database, you are knowingly comparing apples to oranges and then you follow that little twist of illogic with.

“the similarities are pretty remarkable”

I must say, that last little tidbit just might be the truest thing you’ve posted. And you are brazen enough to say

“…3. It’s unclear from the code which version of GHCN V3 that Smith used…”

You’re out!

mfo

Saturday morning. What a time to post this response to EM :o(
The First Difference Method in comparison with others was written about by Hu McCulloch in 2010 at Climate Audit in response to an essay about calculating global temperature by Zeke Hausfarther and Steven Mosher at WUWT.
http://climateaudit.org/2010/08/19/the-first-difference-method/
http://wattsupwiththat.com/2010/07/13/calculating-global-temperature/

Paul in Sweden

E.M. Smith, Zeke Hausfather and Steve Mosher & all of you other highly talented individuals with your own fine web sites that grind this data up – we all know who you are :),
There is a lot of work and a great deal of expenditure of time and finances going on refining the major global temperature databases for the purpose of establishing a global mean temperature trend. I imagine the same amount of resources could be dedicated towards refining various global databases regarding precipitation, wind speed, polar ice extent, sea level, barometric pressure or UserID passwd for the purpose of establishing a global mean average trend.
How do we justify the financial and resource allocations dedicated to generating and refining these global means?
I cannot fathom a practical purpose for planetary mean averages unless we are in the field of astronomy. Here on earth global mean averages for specific metrics regarding climate have no practical value(unless we are solely trying to begin to validate databases).
Regional data by zone for the purpose of agricultural and civic planning are all that I see as valuable. Errors distributed throughout entire global databases in an even manner give me little solace.

david_in_ct

so since you have all the data why don t u do exactly what smith did and see if u get the same plots, instead of producing a different analysis. his main point is that the past was cooled relative to the present. why not take all the station differences that u found and bin them by year, then plot a running sum of the average of the differences year by year. if he is correct that graph will be a u shape. if the graph is flat as it should be then maybe he/you can find the differences in the data/code that each of u has used.

Did this Australian comment from Blair Trewin of the BoM become incorporated in any international adata set? Consequences?
> Up until 1994 CLIMAT mean temperatures for Australia used (Tx+Tn)/2. In
> 1994, apparently as part of a shift to generating CLIMAT messages
> automatically from what was then the new database (previously they were
> calculated on-station), a change was made to calculating as the mean of
> all available three-hourly observations (apparently without regard to
> data completeness, which made for some interesting results in a couple
> of months when one station wasn’t staffed overnight).
>
> What was supposed to happen (once we noticed this problem in 2003 or
> thereabouts) was that we were going to revert to (tx+Tn)/2, for
> historical consistency, and resend values from the 1994-2003 period. I
> have, however, discovered that the reversion never happened.
>
> In a 2004 paper I found that using the mean of all three-hourly
> observations rather than (Tx+Tn)/2 produced a bias of approximately
> -0.15 C in mean temperatures averaged over Australia (at individual
> stations the bias is quite station-specific, being a function of the
> position of stations (and local sunrise/sunset times) within their time
> zone.

Louis Hooffstetter

Informative post – thanks.
I’ve often wondered how and why temperatures are adjusted in the first place, and whether or not the adjustments are scientifically valid. If this has been adequately discussed somewhere, can someone direct me to it? If not, Steve, is this something you might consider posting here at WUWT?

wayne

In Figure 3: http://wattsupwiththat.files.wordpress.com/2012/06/clip_image006.png

“This shows that while the raw data in GHCN v1 and v3 is not identical (at least via this method of station matching), there is little bias in the mean. Differences between the two might be explained by the resolution of duplicate measurements in the same location (called imods in GHCN version 2), by updates to the data from various national MET offices, or by refinements in station lat/lon over time.”

Zeke, that is not a correct statement above, “there is little bias”. I performed a separation of the bars right of zero from the bars on the left of zero and did an exact pixel count of each of the two portions.
To the right of zero (warmer) there are 9,222 pixels contained within the bars and on the left of zero (cooler) there are 6,834 pixels of area within. That is makes the warm side adjustments 135% of those to the cooler side. Now I do not count that as “basically the same” or “insignificant”. Do you? Really?
It seems your analysis has a bias to warm itself, ignoring the actual data presented. The warm side *has* been skewed as E.M. was pointing out. The overlying bias is always a skew to warmer temperatures, always, I have yet in three years to see one to the contrary, and that is how everyone deems this as junk science. To some, a softer term, cargo cult science.

“National offices generally choose to form the longest possible series and homogenize them. I believe GHCN preserves the segmentation. This means that the reconstructions performed based on the GHCN data run slightly on the principle adopted by BEST.”
The Berkeley Earth Method does not preserve segmentations, quite the opposite. It segments time series into smaller components,

phi

Steven Mosher,
“The Berkeley Earth Method does not preserve segmentations, quite the opposite. It segments time series into smaller components,”
What is the opposite of BEST is the NMS methodology which aggregates segments before homogenizing. What you did with the GHCN series is between these two extremes. In fact, you’re closer to BEST because segmentation present in GHCN generally corresponds to stations moves and it is these particular discontinuities which are biased.

“Richard T. Fowler says:
June 23, 2012 at 2:24 am (Edit)
“Smith’s implementation and his method of handling gaps in the data is unproven and may be the cause. ”
“3. It’s unclear from the code which version of GHCN V3 that Smith used. ”
These two statements appear to contradict each other. If the code is available, how can Smith’s “implementation and his method of handling gaps in the data” be unproven?
Zeke or Steve, would you care to elaborate? Thank you.”
Sure. In EM’s post on his method he describes his method of handling gaps in the record in words. His description is not very clear, but it is clear that he doesnt follow the standard approach used in FDM which is to reset the offset to 0. And in his post it wasnt clear what exact file he downloads. For example, if you read turnkey code by Mcintyre you can actually see which file is downloaded because their is an explict download.file() command. In what I could find of Smith’s it wasnt clear.

Louis Hooffstetter says:
June 23, 2012 at 5:35 am (Edit)
Informative post – thanks.
I’ve often wondered how and why temperatures are adjusted in the first place, and whether or not the adjustments are scientifically valid. If this has been adequately discussed somewhere, can someone direct me to it? If not, Steve, is this something you might consider posting here at WUWT?
#################
Sure. back in 2007 I started as a skeptic of adjustments. After plowing through piles of raw and adjusted data and the code to do adjustments. I conclude
A. Raw data has errors in it
B. These errors are evident to anyone who takes the time to look.
C. these errors have known causes and can be corrected or accounted for
The most important adjustment is TOBS. We dedicated a thread to it on Climate audit.
Tobs is the single largest adjustment made to most records. It happens to be a warming adjustment.

atheok
My guess is that you did not look at EM smiths code.
http://chiefio.wordpress.com/2012/06/08/ghcn-v1-vs-v3-some-code/
When you look through that Fortran and the scripts.. well, perhaps you can help me and figure
out which file he downloaded. Look for a reference in that code that shows he downloaded
this file
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v3/ghcnm.tavg.latest.qcu.tar.gz
basically, if somebody asks me why someone comes to wrong conclusions it could be
the wrong data or the wrong method. basic forensics work.
Wrong data can be
1. wrong file
2. read in wrong
3. formated wrong
Wrong method can be a lot of things. So, basically, I suggest starting at step zero when trying to figure these things out. Perhaps your fortran is better than mine and you can find the line in hat code that shows what file he downloads. Its a simple check,

“A. Raw data has errors in it”
Steven Mosher, elsewhere over the years you have claimed there is no raw data.
So 2 questions:
1. Is there raw data or not?
2. How did you come to determine there were errors in it? Data is normally just data. The error occurs in the way it’s handled. Care to explain?
Andrew

“so since you have all the data why don t u do exactly what smith did and see if u get the same plots, instead of producing a different analysis. his main point is that the past was cooled relative to the present. why not take all the station differences that u found and bin them by year, then plot a running sum of the average of the differences year by year. if he is correct that graph will be a u shape. if the graph is flat as it should be then maybe he/you can find the differences in the data/code that each of u has used.”
Zeke has provided the code he used to do this analysis. So, you are free to go do that. If you dont like that code, you can go use the R packages that I maintain. Everything can be freely downloaded from the CRAN repository. The package is called RghcnV3.
My preference is to avoid GHCN V3 altogether, and work with raw daily data. You get the same answers that we posted here for monthly data and avoid all the confusion and controversy surrounding GHCN V1,V2 and V3. That dataset has 26,000 stations ( actually 80K when you start )

phi

Steven Mosher,
“A. Raw data has errors in it
B. These errors are evident to anyone who takes the time to look.
C. these errors have known causes and can be corrected or accounted for”
Corrected errors are discontinuities. The main discontinuities that cause bias are those related to stations moves. They should not be regarded as errors but as corrections of increasing perturbations since the 1920s.
“The most important adjustment is TOBS. We dedicated a thread to it on Climate audit.”
Only valid for US.

mfo
Yes, you will find in the past that I used to be HUGE FAN of the first difference method.
read through that climate audit post. Skeptic Jeff Id, convinced believers Hu and Steve
that First differences was fatally flawed. EM did not get the memo.
That is how things work. I was convinced that First differences would solve all our problems.
I was wrong. Jeff Id made a great case and everybody with any statistical sense moved on to methods exactly like those created by Roman M and JeffId. That list includes: Tamino, Nick Stokes and Berkeley Earth. See Hu’s final comment:
“Update 8/29 Just for the record, as noted below at http://climateaudit.org/2010/08/19/the-first-difference-method/#comment-240064, Jeff Id has convinced me that while FDM solves one problem, it just creates other problems, and hence is not the way to go.
Instead, one should use RomanM’s “Plan B” — see
http://statpad.wordpress.com/2010/02/19/combining-stations-plan-b/, http://climateaudit.org/2010/08/19/the-first-difference-method/#comment-240129 , with appropriate covariance weighting — see http://climateaudit.org/2010/08/26/kriging-on-a-geoid/ .”

Phi.
Interesting that you think Tobs only applies to the US. It doesn’t.
With regard to station moves, I prefer the BEST methodology. although in practice we know that explicit adjustments give the same result.

Andrew
So 2 questions:
1. Is there raw data or not?
2. How did you come to determine there were errors in it? Data is normally just data. The error occurs in the way it’s handled. Care to explain?
Andrew
###############
Philosophically there is no raw data. Practically, what we have is what you could call
“first report” So, I’m using “raw” in the sense that most of you do.
2. How do you determine that there are errors in the data? Good question.
Here are some examples; Tmin is reported and being great than Tmax, tmax is reported as being less than Tmin. temperatures of +15000C being reported, temperatures of -200C
being reported. There are scads of errors like this. data items being repeated over and over again. In a recent case where I was looking at heat wave data we found one station reporting freezing temperatures. When people die in July in the midwest and a stations “raw data” says that it is sub zero, I have a choice: believe the doctor who said they died of heat stroke or believe the raw data of a temperature station. hmm. Tougher examples are subtle changes like
a) station moves
b) instrument changes
c) time of observation change
d) and toughest of all gradual changes over time to the enviroment

pouncer

Hi Steve,
Does this analysis address the point of “fitness for purpose”? The purpose of all such historic reviews, as I understand it, is to proximate the changes in black body model atmospheric temperatures for use in a (changing) radiation budget. The “simple physics” is simple. Measuring the data is more complicated.
Chiefio claims differences over time are of comparable size (a) between versions of the data set, (b) as “splice” and other artifacts of measuring methods, (c) deliberate adjustments intended to compensate for the data artifacts, and (d) actual physical measures.
If the real difference over a century is under two degrees and the variations for versions,data artifacts, and adjustments distort measurement of that difference, how can that difference be claimed to decimal point accuracy? (Precision, I grant, from the large number of measurements. But Chiefio’s point that the various sources of noise are NOT random and therefore can NOT be assumed to cancel is, as far as I can tell, not explicitly addressed.) If the intended purpose does require that level of accuracy and if the measurement does not provide it, can the data set be said to be useful for that purpose? (Useful for many other purposed, including those for which it was originally gathered, don’t seem to me to be germaine.)
I see your analysis as a claim that the differences make little difference. I agree. But we are talking about very little differences in the whole picture.

phi

Steven Mosher,
“Interesting that you think Tobs only applies to the US. It doesn’t.”
If you say this is that you have a case in mind. Have you a reference?
“With regard to station moves, I prefer the BEST methodology.”
It has the disadvantage of not allowing to assess the magnitude of the adjustments.
“although in practice we know that explicit adjustments give the same result.”
Yes, explicitly or implicitly all global temperatures curves are homogenized.

steven mosher says:
…….
Hi Steven
Thanks for comment on the other thread. I noticed the Santa Fe BEST (page 10) shows similar spectral response, but I am not certain if using 5yr smoothing is a good idea.

Richard Fowler,
Here, perhaps this can help somewhat. This is EM’s description of what he does.
“2) Missing data handling. For classical First Differences, if there are missing data, you just re-zero and reset. As there are a lot of missing data in some places, that gives some very strange results. I just assume that if you have January for some years in a row, and leave out a couple, then get some more, that somewhere in between you passed through the space between the two. So if you had a series that was +1/10, +1/10, -1/10, missing, missing, +2/10; I just hold onto the last anomaly and wait while skipping missing data. When I hit the +2, I just account for all the change in that year. So you would have 1/10, 2/10, 1/10, 3/10 as the series. This, IMHO, more accurately reflects the reality of the temperatures recorded. That last year WAS 2/10 higher than the prior value, so throwing it away and using a zero is just wrong. In this way this code is much more robust to data dropouts and also more accurate.”
So, my concern is this.
1. we know from Jeff Id’s fine work ( Jeff is the skeptic who tore Steig’s work to shreds ) that First differences is a flawed method.
2. EM departs from this method and “invents” his own approach.
That approach is untested ( have a look at the synthetics tests that Jeff Id did in first differences)
If you ask me why EM gets odd results quite logically I can only point to two possibilities
data or method. Assuming he used ghcn v1 raw and ghcn v3 raw, that logically leaves method as the reason. I look at his method and I see that he uses a method that has been discredited by leading skeptics and that he has made untested changes to the method ( while trying to say its “peer reviewed”). I kinda shrug and suggest that maybe there is an issue there. For me, I use the better methods as suggested by Jeff and Roman. I used to think First differences was the best. I was wrong, One thing I have always appreciated here and at climate audit is people’s insistence that we use the best methods.

Sure thing vuk.
You can expect some updates to that Sante fe chart in the coming months. I suspect folks who do spectral analysis will be very interested.

Steven Mosher,
“Interesting that you think Tobs only applies to the US. It doesn’t.”
If you say this is that you have a case in mind. Have you a reference?
#############
yes. do more reading and post your results.
“With regard to station moves, I prefer the BEST methodology.”
It has the disadvantage of not allowing to assess the magnitude of the adjustments.
###############
Of course you can assess the magnitude. Its a switch in the code you can turn off
or turn on.
“although in practice we know that explicit adjustments give the same result.”
Yes, explicitly or implicitly all global temperatures curves are homogenized.
####
The wonderful thing about having new data is that you can actually test a method
by withholding data. You can test the ability with and without.

Pamela Gray

Question. Are the stations between the two data versions still the exact same stations or has there been station drop out or upgrades (and in some cases deteriorating stations) as time went by? Keeping due diligence over the study plots (tracking how they have changed with careful observations), and having a gold standard control group that is kept in pristine condition to compare them with is vital before homogenization methods can be developed. I don’t think that has been done to any extent. Therefore the raw and homogenized results are probably worthless. The homogenization methods are a shot in the dark.

(Moderator — feel free to snip, but I think this is relevant)
I tried to compare BEST to Environment Canada data for one station near where I live.
I realized the data in BEST was crappy.
http://sunshinehours.wordpress.com/2012/03/13/auditing-the-latest-best-and-ec-data-for-malahat/
“To start with I am looking at one station. In BEST it is StationID 7973 – “MALAHAT, BC”. In EC it is station MALAHAT which is Station_No 1014820.
I am comparing the BEST SV (Single Valued) data to the BEST QC (Quality Controlled) data.
The first minor problem is that the EC data has records from the 1920s and 1930s that BEST does not have (that I have found). Thats no big deal. The next problem is that out of 166 Month/Year records, not one of them matched exactly. BEST SV and QC data is to 3 decimal points while EC is to 1.
For example. Jan 1992 has QC = 5.677, as does SV, while EC = 5.8. Close. But not an exact match.
However, the real problem is that there are 5 records that have been discarded between SV and QC. Two out of the five make no sense at all, and one is iffy.
Where it says “No Row” it means BEST has discarded the record completely between SV and QC.
1991 is iffy. EC has it has 4.5, SV has 3.841. Close, but not that close
1993 makes no sense at all.
2002 is fine. Thats a huge error. But where the heck did BEST get the -13.79 number in the first place.
2003 is fine. But again, where the heck did BEST get the -4.45 number in the first place.
Finally, 2005 makes no sense at all. There is little difference between -1.1 and -1.148. Certainly most records are that different.
And those are just the discarded records!
There are another 48 record with a difference of .1C or greater and here are the greater than .2C ones.”

DocMartyn

steven mosher, I have, as always, a simple request.
In your figure 3 you have all the stations that have been mathematically warmed and cooled during the revision, and also have their location.
If it is not too much trouble could you color-code them and plonk them on a whole Earth map and let us see if all the warm ones are clustered around the areas where we are observing ‘unprecedented’ warming?
Call me a cynical old-fool, but a bit of warming here and a bit of cooling there and pretty soon you can be talking Catastrophic Anthropogenic Global Warming.

pouncer.
“fit for purpose”
From my standpoint the temperature record has nothing to do with knowing that radiative physics is true. We know that from engineering tests. If you add more GHGs to the atmosphere your change the effective radiating level of the atmosphere and that over time will result in the surface cooling less rapidily.
So it depends upon what purposes you are talking about. People are also confused about what the temperature record really is and what it can really tell us and how it is used. Chief amongst those confusions is the idea that we claim to know it to within tenths or hundreths.
Let me see if I can make this simple. Suppose I do a reconstruction and I say that the
“average” global anomaly in 1929 is .245 C. What does that mean?
It means this: It means that if you find hidden data record from 1929 to the present and calculate its anomaly, the best estimate of its 1929 anomaly will be .245C. That is, this estimate will minimize the error in your estimate. We collect all the data that we do have
and we create a field. That field actually is an estimate that minimizes the error, such that if you show up with new data my estimate will be closer to the new data than any other estimate.
The global average isnt really a measure of a physical thing. You can compare it to itself and see how its changing, its only a diagnostic, an index. Of course in PR land it gets twisted into something else.
So fit for purpose? You can use it for many purposes. you could use it to tell a GCM molder that he got something wrong. You could use it to calibrate a reconstruction and have a rough guess at past temperatures. You could use it to do crude estimates of sensitivity. Its fit for many purposes. I wouldnt use it to plan a picnic.
Let me give you a concrete example. I’ve come across some new data from the 19th century
Hundreds of stations never seen before.Millions of new data points preserved photos of written records on micro fiche. What method would you use to predict the temperatures I hold in my hand? Well, the best method you have is to form some kind of average of all the data that you hold and use that to estimate the data that I hold in my hands.
The best averaging method is not First differences. Its not the common anomaly method. Its not the reference station method. The best method is going to be something like Jeff Ids method, or Nick Stokes method, or the Berkeley Earth method. When they estimate .245678987654C
that estimate will be closer to the data that I hold than any other method. The precision of that guess has nothing to do with the quality of the data, it has to do with minimizing the error of prediction given the data.

Hi Doc,
One reason why Zeke provided the data and code in the public drop box is to allow people with questions to answer the questions for themselves. By releasing the data and the code we effectively give people the power to prove their points.
When I fought to get hansen to release the code and when I sent FOIA into Jones it wasnt to get them to answer my questions. It was to get their tools so I could ask the questions I wanted to ask.

Rob Dawg

• “We can perform a number of tests to see if GHCN v1 and 3 differ. The simplest one is to compare the observations in both data files for the same stations. This is somewhat complicated by the fact that station identity numbers have changed since v1 and v3, and we have been unable to locate translation between the two. ”
——
Simply a stunning admission. What judge in any court would, upon hearing the prosecutors admit to this, not throw the case out of court?

Mosher: “Well, the best method you have is to form some kind of average of all the data that you hold and use that to estimate the data that I hold in my hands.”
I wouldn’t use all data, I would use data for a reasonably sized region and compare the data and see what the differences are.
For example, if all of your old data was 1C or 2C warmer than NOAA’s adjusted old data, then the odds are your data is right.

Michael R

I certainly do not have the kind of expertise to take sides in the argument. One thing that does give me pause however is that I have seen the type of analysis done in Figure three previously, also used as evidence of no bias in adjustments, however as was pointed out then, that kind of graph is isn’t supportive of anything.
Should most or all of those adjustments that are on the right side of the bell curve happen to be in the more recent time and the opposite true for earlier time, then you artificially create a warming trend and indeed could make a huge warming trend while still showing equally cool and warming adjustments.
I cannot comment on the rest of it, but I would caution the use of the argument used in the above paragraph as he last time I saw it used was in an attempt to mislead people which automatically makes me untrusting of the following argument, which may very well be unfortunate rather than unwarranted.

sunshinehours the Env Canada data has quality control flags that you should apply before doing any comparison. If you download my package CHCN ( be sure to get the latest version ) that might help you some.
Lets start with environment canada for jan 1992 you really didnt tell people everything
now did you? here is the actual data from environment canada
Time Tmax Tmean Tmin
1992-01 8.60 E 5.80 E 2.60 E
As you note Environment canada has 5.8 and Best has 5.67
Best does not use environment canada as a source.
If you use my package however you can look at environment canada data.
When you do, here is what you find. What you find is that you failed to report
the quality flags. See that letter E that follows the Tmax and tmin and tmean
That flag means the value in their database is ESTIMATED
Now do the following calculation (8.6+2.6)/2 or (tmax + tmin)/2 see what you come up with?
is it 5.8? nope. looks like 5.6 to me. Without looking at Best data ( I ‘m tired of correcting your mistakes ) I would bet that the source for the BEST data is daily data. Enviroment canada also has daily data for that site it has daily data from
1920-01-01/2012-05-16
Looking at the environment canada excell file for that site and examining the quality flags you should note that a very very large percentage of the figures you rely on have a quality flag of
“E” or “I”
E stands for estimated
I stands for incomplete
Folks can verify that by looking at the csv file for the station.
Station Name MALAHAT
Province BRITISH COLUMBIA
Latitude 48.57
Longitude -123.53
Elevation 365.80
Climate Identifier 1014820
WMO Identifier 71774
TC Identifier WKH
Legend
[Empty] No Data Available
M Missing
E Estimated
B More Than One Occurrence and Estimated
I The value displayed is based on incomplete data
S More than One Occurrence
T Trace

1. “I’m using “raw” in the sense that most of you do.”
If it’s not philosophically “raw.” Then “raw” is not a very descriptive term for what it is. That’s a big problem.
2. “There are scads of errors like this”
Sounds like the data may not be very meaningful if there continues to be “scads” of errors in it.
Andrew

Pamela Gray

Steven, a significant portion of your time as a scientist should be spent explaining what you have said. To refuse to do so by telling others to figure it out for themselves seems a bit juvenile and overly dressed in “ex-spurt” clothes. If questions come your way, kindly doing your best to answer them seems the better approach, especially with such a mixed audience of arm-chair enthusiasts such as myself and truly learned contributors. I for one appreciate your post. Don’t spoil it.

Ripper

The first minor problem is that the EC data has records from the 1920s and 1930s that BEST does not have (that I have found).
=======================================================
Same thing here in western Australia.
Instead of looking at the records used ,It is time to look at the records that are not used.
It appears to me that the records with warm years in Western Australia in the 1920-40’s are just no used.
If someone can point me to where GHCN or CRU uses the long records from e.g.Geraldton town that starts in 1880
http://members.westnet.com.au/rippersc/gerojones1999line.jpg
or Kalgoorlie post office that starts in 1896
http://members.westnet.com.au/rippersc/kaljones1999line.jpg
I would be most grateful.
Or Carnarvon post office that starts in 1885… etc.
It appears to me that in the SH Phil Jones used very little data that was available from 1900 to 1940-50 odd, but selected different stations in each grid cell for 1895-1899.
e.g. instead of using those years from the above mentioned stations, he filled the grid cell with 1897-1899 (despite that record going to 1980) from Hamelin pool.

Pamela Gray

Station drop out has always intrigued me. Here’s why. Station dropout may have done so in non-random fasion. Therefore the raw data collected from stations over time may have non-random bias in them before they were homogenized, which may have caused the raw data to be even more biased.
How would one determine this? One way would be comparing ENSO-driven temperature trend patterns from analogue years and dated station dropout plots. I would start with the US since ENSO patterns are fairly well studied geographically. Overlaying a map of dated station dropout on temperature and precipitation pattern maps under El Nino, La Nina, and neutral ENSO conditions through the decades may be quite revealing and possibly demonstrate explained variance being related to non-random station dropout interacting with ENSO oscillations and patterns. If so, homogenization may have only served to accentuate the bias in the raw data itself.

climatebeagle

Rob Dawg says: June 23, 2012 at 8:05 am
“Simply a stunning admission.”
Exactly, that jumped out at me as well, how on earth did anyone think that was a good idea.
Thanks for posting it, it would be good if both sides could continue the dialog to resolution, whatever any outcome may be, (e.g. both have valid points or acceptance that one (or both) analysis is wrong/weak/strong/right/…)

Pamela Gray

By the way, local low-temperature records in Oregon are falling right and left, as are precipitation records. Why? Not because of global cooling but because of the ENSO oceanic and atmospheric conditions we currently have in and over the great pond to the West of us that uniquely affect our weather patterns. These oceanic and atmospheric conditions have been in place, more or less, for several years now, resulting in several predictable changes in flora and fauna response and broken records related to lows over the same number of years in Oregon.
What is interesting and connected to my earlier comment, is that these same conditions result in year in and year out heat waves and drought elsewhere. Which leads me to restate: Station dropout patterns over time need further study in relation to ENSO conditions and the fairly well established geographic temperature trends tied to ENSO conditions. I think the raw data may be biased.

dp

Start over and show where Smith’s error is. All you’ve shown is you have arrived at a different result with different methods. No surprise. I could do a third and be different again. It would not show that Smith or you are right (or wrong).

Andrew Greenfield

There is NO significant warming when will they learn?
http://www.drroyspencer.com/wp-content/uploads/UAH_LT_1979_thru_May_2012.png
There is no hotspot in the TLT

I haven’t had time to look through the comments, but to address a couple of points quickly:
The version of v3 which I used is:
ghcn.tavg.v3.1.0.20120511.qcu.dat
The version used ought not to have much effect. If it does, then there are far more variations in the data than IMHO could be reasonably justified.
The “attack” on station matching is just silly. I went out of my way, especially in comments, to make clear that the “shift” was not a result of individual data items, but a consequence of the assemblage of records. It isn’t a particular record that has changes, it is WHICH records are collected together. As I don’t do “station matching”, there isn’t any consequence from station IDs changing. The “country code” portion changes dramatically, but the WMO part less so. I do make assemblages of records by “country code”, assuring that the correct mapping of countries is done, as a different approach to getting geographical smaller units to compare That map is at:
http://chiefio.wordpress.com/2012/04/28/ghcn-country-codes-v1-v2-v3/
In large part, this article asserts ‘error’ in doing something that I specifically asserted I did not do. Match stations or compare them data item by data item. If asked, I would freely respond that most individual records from individual stations will have highly similar, if not identical, data for any given year / month. (There are some variations, but they are small). It ignores what I stated was the main point; that the collection of thermometer records used changes the results in a significant way.
In essence, “point 1” is irrelevant as it asserts a potential “error” in something I specifically do not do. Station match.
The chart showing an equal distribution to each side of zero for non-identical records ignores the time distribution. (I’m not making an assertion about what the result from a station by station comparison on the time domain, just pointing out that it is missing). For the assemblage of data, the time distribution of change matters. The deep past shows more warm shifting while the middle time shows more cold, then swapping back to warmth more recently. The data in aggregate have a cooler time near the periods used as the “baseline” in codes like GIStemp and by Hadley, but is warmer recently (and in the deep past). It would be beneficial to make that “by item” comparison 3 D with a time axis. (No, I’m not asserting what the result would be. As it is a “by item” comparison, it says nothing about the impact of the data set in total).
Point 2 has some validity in that I do handle data drop out differently from the classical way. I simply hold onto the last value and wait for a new data item to make the comparison. So, for example, if Jan 1990 has data, but Jan 1989 does not, while Jan 1988 does again, the classical method would give 0 0 0 as the anomaly series. (Since it would ‘reset’ on the dropout and each ‘new set’ starts with a zero). That is, IMHO, fairly useless especially in data with a large number of data dropouts (as are common in many of the stations). Lets say for this hypothetical station the temps are 10.1 missing 10.0 (and on into 9.8, 10.4 etc.). I would replace the 10.1 with a zero (as FD has ‘zero difference’ in the first value found) replace the second value with zero (as it is missing, there is ‘no difference’ found) and then the third value be compared and a 1/10 C difference found. It says, in essence, these two Januaries have changed by -0.1 C so regardless of what happened in the middle year, the slope is 0.1 C over 2 years. A value of -0.1 C is entered for the 3rd value. So which is more accurate? To say that NO change happened over 3 years that start with 10.0 and end with 10.1, or say that a change of 0.1 happened? I think that on the face of it the ‘bridging’ of dropouts is clearly more accurate; though it would benefit from more formal testing.
For individual stations, this takes what would otherwise become a series of “resets” (and thus, artificial “splice artifacts” when they are re-integrated into a trend) and instead has one set of anomalies that starts with the most recent known value, ends with the last known value, and any dropout is bridged with an interpolated slope. It will clip out bogus excursions caused by resets on missing data (as the new end point data gets effectively ignored in classical FD) and will preserve what is a known existing change between known data. Using the data above, classical FD would find 0, 0, 0, -0.2 for an overall change of 0.2 cooler moving left to right over 4 years. By bridging the gap instead of taking a reset, my method gets 0, 0, -0.1, -0.2 and finds an overall change of -0.3 going from 10.1 to 9.8 (which sure looks a whole lot more correct to me…)
But if you want to hang your hat on that as “causal”, go right ahead. It won’t be very productive, but a thorough examination of the effects on larger assemblies of data would be beneficial. So while I’d assert that point 2 is, well, pointless; being a new minor variation on the theme it could use more proving up.
Per point 3, I’ve given you the version above. Point now null. (And frankly, if you wish to assert that particular versions of GHCN v3 vary that much, we’ve got bigger problems than I identified with version changes.)
In the end, this ‘critique’ has one point with some validity (but it isn’t enumerated). That is the “grid / box” comparison of the data. However, what it ignores is the use of the Reference Station Method in codes such as GIStemp (and I believe also in NCDC et. al. as they reference each others methods and techniques – though a pointer to their published codes would allow me to check that. If their software is published…) So, you compare the data ONLY in each grid cell and say “They Match pretty closely”, and what I am saying is “The Climate Codes may homogenize and spread a data item via The Reference Station Method over 1200 km in any one step, at may do it serially so up to 3600 km of influence can be had. This matters.”
Basically, my code finds the potential for the assemblage to have spurious influence via the RSM over long distances and says “Maybe this matters.” while your ‘test’ tosses out that issue up front as says ‘if we use a constraint not in the climate codes the data are close’. No, I don’t know exactly how much the data smearing via the RSM actually skews the end product (nor does anyone else, near as I can tell) but the perpetually rosy red arctic in GIStemp output where there is only ice and no thermometers does not lend comfort…
So as a first cut look at the critique, what I see is a set of tests looking at the things I particularly did not do (station matching) and then more tests looking at things with very tight and artificial constraints (not using the RSM as the climate codes do) that doesn’t find the specific issue I think is most important as it excludes the potential up front. Then a couple of spurious points ( such as version of v3) that are kind of obviously of low relevance.
My assertion all along has been that it is the interaction between records and within records that causes the changes in the data to “leak through” the climate codes. That RSM and homogenizing can take new or changed records and spread their influence over thousands of kilometers. That data dropouts and the infilling of those dropouts from other records can change the results. That it is the effective “splicing” of those records via the codes (in things like the grid / box averages) that creates an incorrect warming signal. By removing that homogenizing, the RSM, and the infilling, you find that they have no impact. What a surprise… Then, effectively, I’m criticized because my code doesn’t do that.
Well this behaviour is by design. I want to see what the impact of the assemblage is when ‘smeared together’ since that is what the climate codes do. Yes, I use a different method of doing the blending (selection on ranges of country code and / or WMO number) than ‘grid box’ of geographic dimensions area weighted. This, too, is by design. It lets me flexibly look at specific types of effect. (So, for example, I compared inside Australia with New Zealand. I can also compare parts of New Zealand to each other, or the assembly of Australia AND New Zealand to the rest of the Pacific – or most any other mix desired). Since one of the goals is to explore how the spreading of influence of record changes in an assemblage (via things like RSM and homogenizing) can cause changes in the outcome, this is a very beneficial thing.
This critique just ignores that whole problem and says that if we ignore the data influence of items outside their small box there is no influence outside their small box.
Oddly, the one place where I think there is a very valid criticism of the FD method is completely ignored. Perhaps because it is generic to all FD and not directed at me? Who knows… FD is very sensitive to the very first data item found. IF, for example, our series above started with 11.5, then 10.1, missing, 10.0, 9.8 etc. We get an additional -1.4 of offset to the whole history. That offset stays in place until the next year arrives. If it comes in at 10.1, then that whole offset goes away. The very first data item has very large importance. I depend on that being ‘averaged out’ over the total of thermometers, but it is always possible the data set ends in an unusual year. (And, btw, both v3 and v1 ought to show that same unusual year…) But any ‘new data’ that showed up in 1990 for v3 that was missing in v1 when it was closed out will have an excessive impact on the comparison.
At some point I’m planning to create a way to “fix that” but haven’t settled on a method. The ‘baseline’ method is used in this critique (and in the climate codes) and it has attractions. However, when looking for potential problems or errors in a computer code, it is beneficial to make the comparison to a ‘different method’ as that highlights potential problems. That makes me a bit reluctant to use a benchmark range method. Doing a new and self created method is open to attacks (such as done here on the bridging of gaps technique) even if unwarranted. One way I’ve considered is to just take the mean of the last 10% of data items and use it as the ‘starting value’. This essentially makes each series benchmarked via the first 10% of valid data mean. But introducing that kind of complexity in a novel form may bring more heat than light to the discussion.
In the end, I find this critique rather mindlessly like so many things in ‘climate science’. Of the form “If we narrow the comparisons enough we find things in tiny boxes are the same; so the big box must be the same too, even if we do different things in it.” Makes for nice hot comments threads, but doesn’t inform much.
I’m now going to have morning tea and come back to read the comments. I would suggest, however, that words like “error” and “flawed” when doing comparisons of different things in different ways are mostly just pejorative insults, not analysis. I don’t assert that the ‘grid / box’ method is in ‘error’ or ‘flawed’ because it will miss the impact of smearing data from an assembly of records over long distances. It is just limited in what it can do. Similarly it is not an ‘error’ or ‘flawed’ to use a variable time window in FD. (That is, in effect, what I do. Standard FD just accepts the time window in the data frequency, such as monthly, I just say ‘the window may be variable length, skip over dropouts’.) It is just going to find slightly different things. And, IMHO, more accurate things. It is more about knowing what you are trying to find, and which tool is likely to find it. As I’m looking for “assemblage of data impacts”, using a variable “box size” to do various aggregations (via CC/WMO patterns) lets me do that; while using a fixed geographic grid box without RSM and homogenizing will never find it. That isn’t an error or flaw in non-homogenized non-RMS small grid boxes, it is just a limitation of the tool. That the tool is used wrongly (if your goal is to find those effects) would be the error.
So, in short, the analysis here says I didn’t find what I wasn’t looking for, and they don’t find what I did look for because they don’t look for it.

Steve and Zeke, you’ve gone above-and-beyond here, and it’s much appreciated. I realize that you don’t owe us anything else, and I think it’s a good idea to push people to delve into the data for themselves.
At the same time, I have quite a few data sets on my laptop where I could bang out a new graph in a minute or two that would take you quite some time to figure out my R code, what data files I’ve given you and what you have to obtain from elsewhere, etc. In like manner, I’m wading through your Stata (a truly ugly language, like SAS, SPSS, gretl, and that whole command-oriented genre of statistical programs), trying to figure out what you’ve done, what files you’ve supplied, what files you have not, etc. I then suspect I’ll have to make some enormous data downloads, and finally, once I’ve reproduced your results in R I’ll be able to see how your Figure 3 histogram changes over time and location.
I think it’s a very intelligent observation that an apparent canceling of adjustments over the entire period might actually obscure a trend of adjustments over time, or over latitude, etc. You don’t have to pursue that line of reasoning to all of its boundaries, but I would guess that you could fairly quickly do a lattice-style histogram by, say, decades to settle the issue to a first approximation.
Thanks again!

phi

Steven Mosher,
Ok, that’s right. Tobs applies only to US : You have nothing.
“Of course you can assess the magnitude. Its a switch in the code you can turn off
or turn on.”
No, you can’t disable this with BEST. What you say does not make sense.