Errors in GHCN metadata inventories show stations off by as much as 300 kilometers
Guest post by Steven Mosher
In the debate over the accuracy of the global temperature nothing is more evident than errors in the location data for stations in the GHCN inventory. That inventory is the primary source for all the temperature series.
One question is “do these mistakes make a difference?” If one believes as I do that the record is largely correct, then it’s obvious that these mistakes cannot make a huge difference. If one believes, as some do, that the record is flawed, then it’s obvious that these mistakes could be part of the problem. Up until know that is where these two sides of the debate stand.
Believers convinced that the small mistakes cannot make a difference; and dis-believers holding that these mistakes could in fact contribute to the bias in the record. Before I get to the question of whether or not these mistakes make a difference, I need to establish the mistakes, show how some of them originate, correct them where I can and then do some simple evaluations of the impact of the mistakes. This is not a simple process. Throughout this process I think we can say two things that are unassailable:
1. the mistakes are real. 2. we simply don’t know if they make a difference. Some believe they cannot (but they haven’t demonstrated that) and some believe they will (but they haven’t demonstrated that). The demonstration of either position requires real work. Up to now no one has done this work.
This matters primarily because to settle the matter of UHI stations must be categorized as urban or rural. That entails collecing some information about the character of the station, say its population or the characteristics of the land surface. So, location matters. Consider Nightlights which Hansen2010 uses to categorize stations into urban and rural. That determination is made by looking up the value of a pixel in an image. If it is bright, the site is urban. If it’s dark (mis-located in the ocean) the site is rural.
In the GHCN metadata the station may be reported at location xyz.xyN yzx.yxE. In reality it can be many miles from this location. That means the nightlights lookup or ANY georeferenced data ( impervious surfaces, gridded population, land cover) may be wrong. One of my readers alerted me to a project to correct the data. That project can be found here. That resource led to other resources including a 2 year long project to correct the data for all weather stations. Its a huge repository. That led to the WMO documents one of the putative sources for GHCN. This source also has errors. Luckily the WMO has asked all member nations to report more accurate data back in 2009. That process has yet to be completed and when it is done we should have data that is reported down to the arc second. Until then we are stuck trying to reconcile various sources.
The first problem to solve is the loss of precision problem. The WMO has reports that are down to the arc minute. It’s clear that when GHCN uses this data and transforms it into decimal degrees that they round and truncate. These truncations, on occasion, will move a station. I’ve documented that by examining the original WMO documents and the GHCN documents. In other cases it hard to see the exact error in GHCN, but they clearly dont track with WMO. First the WMO coordinates for WMO 60355 and then the GHCN coordinates:
WMO: 60355 SKIKDA 36 53N 06 54E [36.8833333, 6.9000]
GHCN: 10160355000 SKIKDA 36.93 6.95
GHCN places the station in the ocean. WMO places it on land as seen above.
To start correcting these locations I started working through the various sources. In this post I will start the work by correcting the GHCN inventory using WMO information as the basis. Aware, of course that WMO may have it own issue. The task is complicated by the lack of any GHCN documents showing how they used WMO documents. In the first step I’ve done this. I compared the GHCN inventory with the WMO inventory and looked at those records where GHCN and WMO have the same station number and station name. That is difficult in itself because of the way GHCN truncates names to fit a data field. It’s also complicated by the issue of re spelling, multiple names for each site and the issue of GHCN Imod flags and WMO station index sub numbers.
Here is what we find. If we start with the 7200 stations in the GHCN inventory and use the WMO identifier to look up the same stations in the WMO official inventory we get roughly 2500 matches. Here are the matching rules I used.
1. the WMO number must be the same
2. The GHCN name must match the WMO name (or alternate names match).
3. The GHCNID must not have any Imod variants. (no multiple stations per WMO)
4. The WMO station must not have any sub index variants. (107 WMO numbers have subindexes)
That’s a bit hard to explain but in short I try to match the stations that are unique in GHCN with those that are unique in the WMO records. Here is what a sample record looks like.WMO positions are translated from degrees and minutes to decimal degrees and the full precision is retained. You can check that against GHCN rounding. As we saw in previous posts slight movements in stations can move them from Bright to dark and from dark to bright pixels.
63401001000 JAN MAYEN 70.93 -8.67 1001 JAN MAYEN 70.93333 -8.666667
63401008000 SVALBARD LUFT 78.25 15.47 1008 SVALBARD AP 78.25000 15.466667
63401025000 TROMO/SKATTO 69.50 19.00 1025 TROMSO/LANGNES 69.68333 18.916667
63401028000 BJORNOYA 74.52 19.02 1028 BJORNOYA 74.51667 19.016667
63401049000 ALTA LUFTHAVN 69.98 23.37 1049 ALTA LUFTHAVN 69.98333 23.366667
You also see some of the name matching difficulties where the two records have the same WMO and slightly different names. If we collate all differences on lat and lon in matching stations we get the following:
And when we check the worst record we find the following
WMO: 60581 HASSI-MESSAOUD 31.66667 6.15
GHCN: 10160581000 HASSI-MESSOUD 31.7 2.9
GHCN has the station at longitude [smm] 2.9. According to GHCN the station is an airport:
The location in the WMO file
And the difference is roughly 300km.WMO is more correct than GHCN. GHCN is off by 300km
An old picture of the approach (weather station is to the left)

Now, why does this matter. Giss uses GHCN inventories to get Nightlights. Nightlights uses the location information to determine if the pixel is dark (rural) or bright (urban)
NASA thinks this site is dark. They think it is pitch dark. Of course they are looking 300km away from the real site. From the inventory used in H2010.
10160581000 HASSI-MESSOUD 31.70 2.90 398 630R HOT DESERT A 0





Well if mistakes don’t matter; and don’t affect the results; then we could stop taking the data all together; since the data doesn’t matter; and we could simply make up the results; and save a ton of money.
And in any case; why are we wasting so much monety to gather data; when the data doesn’t really affect anything anyhow.
Now, why does this matter. Giss uses GHCN inventories to get Nightlights. Nightlights uses the location information to determine if the pixel is dark (rural) or bright (urban).
Steven,
This issue will be of greater importance than just citing mistakes. GHCN uses the erroneously determined “rural” locations to adjust the “urban” locations for data gap infill within cells. This homogenizing may be additive to the original error. So if you need further additional peaceful activities, follow the data infilling.
It seems that every time temp data is examined in detail , by country or individual stations, errors are found. How can one have any confidence in a supposed rise of 1c ( or less) over a hundred years with equipment that often has errors larger than this, with unrevealed adjustments, UHI, in-filling ,extrapolation , recording mistakes etc.
Every long standing unadjusted record for the US that has been posted seems to show very small trends if any.
When you are lying or eating fish you have to be caring…
They are NOT LYING, ethimologically META-DATA means (Greek) BEYOND -DATA.
Steven Mosher says:
“It is therefore logical to assume that if a station is randomly mis-located, then there is a far greater chance that it will be wrongly categorised as a rural station when it should be urban than as an urban station when it should be rural.
@ur momisugly@@ur momisugly@@ur momisugly@@ur momisugly@@ur momisugly@@ur momisugly@@ur momisugly@@ur momisugly
You might think that. I’d rather prove it. one way or the other.
but proof is hard. conjecture is easy.”
I have looked into the Swedish GHCN stations (19). Only 2 have zero nightlights (Jokkmokk and Films kyrkby). These two are also the most dispaced stations in the set (about 20 and 30 kilometers), moving Jokkmokk from a minor airport to the middle of a lake and Films kyrkby from a village to the middle of a large uninhabited forest. Films kyrkby by the the way also has a spurious name (“Kreuzburg”) and a completely fictitious altitude (620 meters ASL instead of 50 meters).
The other stations’ position errors vary from a few hundred meters up to about 10 kilometers, with an average of 1-2 kilometers. The other metadata (airport/non airport, town population, vegetation type, altitude) are also in error for about half the stations.
While these errors are not enough to affect the large scale climate, in my opinion it is quite useless to try and correlate the GHCN data with any kind of geodata at a higher resolution than about 0.1-0.2 degrees (10-20 km).
Steven Mosher says:
October 31, 2010 at 11:49 pm
“You might think that. I’d rather prove it. one way or the other.
but proof is hard. conjecture is easy.”
_____________________________________________________________
Thanks for your response Stephen, and the best of luck to you with your investigations.
I still can’t help that everything is bass-ackwards however.
It is the proponents of CAGW, not you, who are proposing that the world should make major changes to its’ economy and the way that energy is generated, largely on the evidence provided by this database.
Surely the emphasis should be on THEM to PROVE that it as accurate as it is possible to make it…….
Wayne:
“I recently reread the paper Contiguous US Temperature Trends Using NCDC Raw and Adjusted Data for One-Per-State Rural and Urban Station Sets by Dr. Edward R. Long. In that paper, Dr. Long shows that when a least squares linear approximation is applied to a climate station data set, the rural stations seem to have a significantly more shallow slope than urban stations. In particular, the rural stations have a slope of .13C/century and .79C/century for urban stations.”
I Read that piece and was not very impressed with the methodology. From the data selection to the rural criteria to the math. I’ll go into detail if you like, but really have other things to do.
“It seems to me that all you have to do is do a least squares linear approximation to all station data sets and look for ones that have a shallow slope. Then examine the meta data for these sites to figure out if they are rural or not. The reverse can be done for urban sites.”
Thought about that. Did something similar. But the hazard is confirmation bias.
“Given that most of North America (actually most of the planet) is rural, plotting the rural data should be sufficient to identify any global warming trend. The raw data rural slope reported in the paper above is quite shallow (.13C/century.) It would be interesting to see if a more comprehensive set of rural stations yield a similar slope. If global warming is as pervasive as many in the scientific community claim, even selecting rural stations with low slopes should give a higher aggregate slope that .13C/century.”
Actually that is something I’ll aim at. But you wont find the .13C/century slope you expect. Aint gunna happen. First problem is using raw data. The raw data has errors and it all needs to be put on the same footing. (things like time of obsevation)
best you can hope for is 10% adjustment to the current numbers.
Gordon Ford says:
November 1, 2010 at 8:03 am (Edit)
A good first cut would be to check the station location on Google Earth or similar. This would catch rural stations in the middle of a Walmart parking lot or urban stations adjcent to a farm house.
What is apparent is that global temperature records have unresloved QA/QC problems and until they are resolved any conclusions drawn from the data should be filed under fiction.
%%%%%%%%%
I’ve posted google tours of all the locations so people can do this. 7280 stations. Not a one man job.
David Jones says:
November 1, 2010 at 6:19 am (Edit)
You’re not even trying Mosher, I found one that was several thousand kilometres out.
$$$$$$$
you have to be careful there are some where GHCN uses the historical data and WMO only publishes current data. In any case since v3 is in beta we can hopefully get them to fix the issue. Other people ( climate science types) are working this issue so hopefully the problem will get fixed.
early results, say it doesnt make a difference. I stress early. Since R is interactive I can often just take a quick look if the prelim work gives me any kind of indication.
Indication is this. the rural/urban count doesnt change much. early indication.
Murray,
the march of thermometers makes no difference. That’s been shown repeatedly.
In the coming months I suspect it will be shown again with a huge database of work.
Pamela
‘So we stick to our beliefs and refuse to budge out of fear of being wrong.”
yup. but that goes for doubters as well.
I, too, have some issues with Dr. Long’s methodology, that I why I’d like somebody else to take a crack at plotting rural data with a more robust station selection criteria (something like the category 1 & 2 stations identified in surfacestations.org.)
I do not understand the last sentence.
Actually, I’d just like to see the rural curve. I accept that time of observation adjustments are appropriate. I’m more skeptical of the code that attempts to do data infilling. Ultimately, I’m much more interested in the slopes of the curve from 1900-1930 vs the slope from 1960-2000+; the overall slope is much less interesting, since appears to be essentially a curve fit of a 1-1/2 cycle oscillation.
tty.
Thanks, checking all this stuff by hand is hard work. Lets look at one of yours
64502142000 JOKKMOKK 66.63 19.65 264 313R -9HIFOLA-9x-9WOODED TUNDRA A 0
WMO=02142
Imod = 000
and there are no other stations with that WMO. Sometimes GHCN will use the SAME WMO but indicate a different location using the IMOD flag. whcih is why I eliminated those case from my FIRST pass through the data.
Now lets see what WMO says:
02151 0 JOKKMOKK FPL 66 29N 20 10E 275
See the problem? Actually WMO has no entry for 02142
Now the GHCN lat indicates a station at 66.63 or 66’38”
WMO has
02141 0 TJAKAAPE 66 18N 19 12E 582
02161 0 NATTAVAARA 66 45N 20 55E
Now, That is not the end of the searching.
There is yet another master list that solves this mystery
SWE SE Sweden – Jokkmokk AFB SEaaESNJ ESNJ 2142 m ESNJ A ICA09 2142 21420 66.6333 19.65 C 264 264 Europe/Stockholm
is that confusing? Well, it tells me that GHCN gets the data for this from ICA09 documents, not from WMO.
Simply, the GHCN appear to get data from WMO and other sources. So, to audit them properly I have to figure out where they got the data from. My master list tells me exactly where the data comes from and the quality of the information. That should allow me to correct GHCN, add location precision, and account for data that is not precise. But it’s a huge mountain of checking. automating the process is a nightmare.
tty:
“While these errors are not enough to affect the large scale climate, in my opinion it is quite useless to try and correlate the GHCN data with any kind of geodata at a higher resolution than about 0.1-0.2 degrees (10-20 km).”
There are many stations where there are no lights for 60km in any direction.
So my approach will be this.
1. Correct GHCN as best I can given the other documents I have. Especially the rounding errors.
2. Characterize the distribution of the errors ( 95% within 5km… FOR EXAMPLE)
3. reclassify the stations using a bounding box approach. Not just the pixel, but surrounding pixels as well. All the code to do that is done and tested. The key is this
A. characterizing the average error in station location
B. looking at the pixel location error (1-2km)
C. adjusting my bounding box accordingly.
SO, in H2010 a station us rural if the pixel at its location is dark.
In my approach I will correct the stations as much as feasible and look at all the surounding pixels.. dark for 10km around the “location” of the station.
make sense?
Then I can screen them further with population data going back to 1850 for every 10km*10km grid.
That’s the plan.
1. correct the stations
2. characterize the error
3. screen using that error knowledge
George E. Smith says:
November 1, 2010 at 9:55 am (Edit)
Well if mistakes don’t matter; and don’t affect the results; then we could stop taking the data all together; since the data doesn’t matter; and we could simply make up the results; and save a ton of money.
$$$$$$$$$$
if I have 100 dollars in the bank and I write 5 checks
25.45
23.19
25.43
19.01
20.56.
And I do my accounting by rounding up numbers always, I can figure that
26+24+26+20+21 will tell me that I am overdrawn. Now, 117.0 is the wrong answer. But to my question “am I overdrawn?” this “mistake” makes no difference. The mistake doesnt make me a millionaire. I’m still broke. So, actually some mistakes make no difference.. TO the question being asked and the person asking the question.
Tomasz Kornaszewski says:
i thought it might the GP/DME as well but didnt have the patience to look through all the online tools. so I just gave some links to some of the stuff I found.
The lack of quality control in the databases impugns the credibility and competence of the people who are in charge of them. If they can’t get the easy stuff right and can’t build systems properly, why should we expect that their secret adjustment processes are sound?
Nightlights? I once flew into Riga, Latvia, in the middle of the night and the city was almost totally dark — fortunately they turned on the runway lights just before we landed! Before we make sweeping assumptions we should find out some “ground truth” as the geologists say. Now we know why there are complaints that “they don’t teach geography any more”.
Steven Mosher says:
October 31, 2010 at 11:38 pm
” unravelling the history of all the changes is tough work”
Actually it is conjecture, and not data, but inference disguised as data or “metadata”.
Mosh,
Good work. It’s good to see you picking up what Peter O’Neill started. I did note it on your blog but too up to my eyes these days…
While that may explain problems with determining the reliability of stations, or trying to find the record from any particular station, it doesn’t explain why the actual locations are wrong. These are scientists. Their currency is data. Not taking care of the data is akin to bankers mislaying the money (or more accurately, their financial transactions).
The important fact is not whether the globe warming. The important fact is that those who propose to turn our society upside down on the basis of supposed warming show a striking lack of interest in collecting and examining the data needed to determine whether it is warming or not.
richard verney says: “…In my opinion, we should only be looking at sea temperatures and satellite collected temperatures, or sea temperatures and wholly unadjusted rural data sets. All other data sets should be disregarded. Climate is mainly driven by the sea (which cover approx 70% of the Earth and the volume of which is a giant storage reservoir) . Thus sea temperature data is the most important single issue….”
The oceans have a thermal mass over 1000 times greater than the atmosphere. Trying to tease a small warming (or cooling) signal out of atmospheric data is like trying to measure changes in the weight of a bull by how hard he snorts.
Verity Jones says:
November 1, 2010 at 4:32 pm
Mosh,
Good work. It’s good to see you picking up what Peter O’Neill started. I did note it on your blog but too up to my eyes these days…
$$$$$$$$$
Peter, does fantastic original work. Ron Broberg also did some great work that highlighted the stations in the ocean problem. sorting this out is a big job.