WUWT Video – Zeke Hausfather explains the new BEST surface data set at AGU 2013

I mentioned the poster earlier here. Now I have the video interview and the poster in high detail. See below.

Here is the poster:

BEST_Eposter2013

And in PDF form here, where you can read everything in great detail:

AGU 2013 Poster ZH

 

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
122 Comments
Inline Feedbacks
View all comments
RichardLH
December 20, 2013 1:35 pm

Stephen; My fault really. I was quoting from a fully merged database of ALL of the Global data sets, TMAX, TMIN and TAVG. The TAVG alone values are much lower, as are values only using the smaller number of data sets that just cover the USA. I suspect that given the right query criteria we are talking the same numbers.
I have extracted all of the “data_characterization.txt” and “site_detail.txt” from the various zip files into an in memory data set now, I just need to tease out all of the relevant fields so that the query criteria are easier to manage and quote.
I think the biggest take home here is that number of long term data sets (say >150 years) of any real quality (i.e. 95% data coverage or better) is very, vey low. This poses the question of how valid it is to infill the rest of the Globe/USA at those longer time frames and thus derive an accurate long term temperature field. Surely the temperature field reconstruction must get less precise as the total number of reference points drops. The question is at what number does it become just a ‘guess’ rather than a ‘fact’?
Now to tackle the “data.txt” files to get a wider picture. I think I will start with the longest records and work downwards.
As a comment from the data archivist that lurks inside me, the referencing of 39 LATEST.zip files without versioning or other way of distinguishing between them is unlikely to be considered to be ‘best practice’ in a computing sense!. Fortunately a simple url parsing procedure produces a more logical set of local zip files (though still without versions 🙁 ).

Jeff Id
December 20, 2013 1:41 pm

Stephen Rasey
I believe there is a problem in the way Best calculates the CI, I have brought it to their attention multiple times without a single comment back. Not even a “we don’t agree” but the problem is real.
The Jackknife calculation ‘damages’ the dataset and looks for shifts in the resulting CI. It’s a neat technique but the rescaling best does with the algorithm minimizes the outliers and violates the basic assumption of the jackknife principle. The resulting CI becomes a factor of the probability function (or distribution shape if you prefer). I was curious if they had addressed the problem yet.

Jeff Id
December 20, 2013 1:49 pm

Stephen Rasey
I will reword this a little better:
I believe there is a problem in the way Best calculates the CI, I have brought it to their attention multiple times without a single comment back. Not even a “we don’t agree” but the problem is real.
The Jackknife calculation ‘damages’ the dataset and looks for shifts in the resulting CI. It’s a neat technique but the rescaling best does with the algorithm minimizes the outliers and violates the basic assumption of the jackknife concept. The resulting CI is altered in unpredictable ways by the shape of the probability distribution of the temp series comprising the data. It probably isn’t a big deal but we don’t know because it certainly wasn’t accurately defined in the original paper and the authors have not addressed it as yet.
I was curious if they had addressed the problem yet.

December 20, 2013 3:37 pm

The BEST description of “outliers” seems to correspond to a description of best data. Making BEST the worst.

December 20, 2013 5:28 pm

Stephen Ramsey,
Here is the count of stations used each month in the Berkeley Earth dataset: http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Figures/global-land-TAVG-Counts.pdf
It increases pretty monotonically; no real dropoffs apart from the last few months (where not all stations have reported yet). The recent year has the most stations of any year (~11,000) reporting.
Over there entire record there are ~40,000 unique stations.

December 20, 2013 5:32 pm

Also, you can get copies of the merged quality-controlled dataset here (which should make counting stations pretty easy): http://berkeleyearth.org/data

December 20, 2013 5:59 pm

@RichardLH at 1:35 pm
Have you found the files containing breakpoints, yet?
When BEST inserts breakpoints 2-5 years apart, it is hard to believe anything that follows.
I looked at DENVER STAPLETON AIRPORT. The raw data runs from 1873 to 2011. Berkley inserts break points at:
1941, 1947, 1968, 1980, 1982, 1984, 1986, 1994, 1996, 1999. End of record in 2011.
Contrast these tidbits of history from Wikipedia:
Created in 1919,
Opened in 1929 as Denver Municipal, Name changed after an expansion in 1944.
Runway 17/35 and a new terminal building opened in 1964
Concourse D in 1972.
New North South runways in the 1980s. Concouse E in 1988.
Closed in 1995 and operations moved to the new DENVER INTERNATIONAL AIRPORT about 15 miles further ENE on the open prarie, then the 2nd largest Airport in Colorado…. Stapleton had more gates and was closer to downtown. Government money at work!
Now turned into an industrial park.
Now, what station was contributing data to the record from 1995 to 2011.
Berkley’s breakpoints seem to be inversely correlated airport changes. Go figure. I grew up there from 1960 to 1981 and I can tell you there is a lot of urban incroachment (UHI) over Stapleton’s history.

December 20, 2013 6:48 pm

@Zeke Hausfather 5:28 pm, 5:32pm
Thanks, Zeke.
Not only is the number of stations important, but the Length of the usable records is important. I am not the only one who would like to see a distribution of station lengths between breakpoints at points in time or across the whole dataset.
Just above, I recounted some personal and documented changes to Stapleton (Denver,CO, USA). It is just one station (at least in theory), yet the station record has temperatures from before the station existed, after the station closed, and ten breakpoints, some as quick as 2 years in a record officially 130+ years long when the station itself probably existed only from after 1919 to 1995. That seems like an excessive number of breakpoints, especially when they don’t correlate will with documented airport expansion.
So, can you give a more directed link to the breakpoint related tables instead of somewhere off page

December 20, 2013 6:53 pm

@Zeke at 5:32 pm
The recent year has the most stations of any year (~11,000) reporting.
It looks about 18,000. It passes 10,000 at 1950.

timetochooseagain
December 20, 2013 7:47 pm

I have something of a major bone to pick with only presenting trend maps since 1979. Much of the American Southeast has seen a long term cooling trend-Warming in the last 30 years not withstanding. Presenting the data the way you do enhances the impression that everywhere warms and cools together-this is not the case.

Jeff Id
December 20, 2013 7:57 pm

I have no idea why my comment has been completely ignored. It seems to be the only legitimate problem with the series.
I feel like Oliver now. Is it rude?

RichardLH
December 21, 2013 2:12 am

As it turns out the 40,000 station result is probably the answer to an improperly framed question.
A better and possibly more relevant question to understanding the underlying accuracy of the BEST sampling methodology might be
“How many of the 1 degree latitude and longitude grid cells have measured as opposed to estimated data in them and what is their Global and temporal distribution?”
This is because this is a merge from separate data sets and a simple query on the combined result shows that what BEST calls ‘Station ID’ is not a unique identifier of a place but of a record from a published source.
This means that querying the database for unique ‘Station ID’ over represents the number of actual inputs to the later method steps.
To be fair they have never claimed otherwise but it can easily lead to a misunderstanding of the figures given.

RichardLH
December 21, 2013 2:53 am

The answer to the first part of my question (Global distribution of grid cells with coverage) at first pass is
GHCN Monthly version 3 TAVG
detail lines 7429
Global Summary of the Day TAVG – Monthly
detail lines 24612
GSN Monthly Summaries from NOAA TAVG
detail lines 1060
Hadley Centre _ CRU TAVG
detail lines 5261
Monthly Climatic Data of the World TAVG
detail lines 2919
Scientific Committee on Antarctic Research TAVG
detail lines 256
US Cooperative Summary of the Day TAVG – Monthly
detail lines 3451
US Cooperative Summary of the Month TAVG
detail lines 13034
US First Order Summary of the Day TAVG – Monthly
detail lines 766
US Historical Climatology Network – Monthly TAVG
detail lines 1367
World Monthly Surface Station Climatology TAVG
detail lines 4795
World Weather Records TAVG
detail lines 2009
1 degree grid cells with measured coverage = 9,495 of 129,600

RichardLH
December 21, 2013 3:16 am

My apologies, that should read
1 degree grid cells with measured coverage = 9,495 of 38,880
to fairly reflect that this is Land only coverage. The previous post included Ocean cells which are not part of the BEST study of course.

December 21, 2013 11:50 pm

@RichardLH at 2:12 am, 3:16 am
As it turns out the 40,000 station result is probably the answer to an improperly framed question.
An interesting point. However, I think it has been clear that our question has always been, “How many stations and how long are the temperature records. Counting StationID has always been a quick and cheap first cut at an upper bound.
You do ask a very good question of how many 1×1 grid cells have at least one StationID. 9,495 cells out of 38,880 is a bit pessimistic though since most high latitude cells will be unrepresented, but are much narrower than tropical cells. Still, it is a good first pass estimate.
So next, do the census in bands of 15 degrees latitude.
And then, how many cells have at least 30 years of data?
There are a lot of good questions that can be asked.
@2:53 am
A very handy crib sheet. This ought to be on BEST’s data source page.
BTW, have you loaded these files into an relational Database? If so, what flavor?

RichardLH
December 22, 2013 2:19 am

Stephen Rasey says:
December 21, 2013 at 11:50 pm
” However, I think it has been clear that our question has always been, How many stations and how long are the temperature records.”
Indeed, but I think it does need to be clear in order not to fool yourself, even accidentally. If all of those records are in half the surface area then the claims made need to be less precise or at least more qualified.
” 9,495 cells out of 38,880 is a bit pessimistic ”
and, of course wrong! Should be 180 * 360 * 0.29 = 18,792 though you could argue about the exact figure and what you count as a land cell. As to pessimistic, they are claiming that this is representative of the Global temperature figure by extrapolation of the data into cells with no measurements, so a handle on how much of that is done is, I think, valid and relevant information.
These are cells with ANY information at all in them with no regard to quality or length to get to the 9,495 figure. The truly useable cells will be a lot lower, especially when record lengths and any gaps are taken into account. I am trying to work out what would be the best way of describing how the percentage/quality coverage varies with time.
“BTW, have you loaded these files into an relational Database? If so, what flavor?”
An in memory C# set of tables keyed on Station ID. Bit of a mess right now as I have not fully parsed out the records, just kept them as a line of text and parsed on retrieval. More work still to do.

December 22, 2013 8:10 am

@RichardLH
Keep at it. My brevity this weekend is not from lack of interest, it is lack of internet.

RichardLH
December 22, 2013 12:41 pm


Well this is probably my last day looking at this. Xmas calls :-).
Database input now nearly complete. So far we have (accumulative counts).
The validation fails are slightly surprising! The Lat/Long fails are mainly from transposed columns. I assume that this does not leak through to the calculation stages.
Import database: GCOS Monthly Summaries from DWD TAVG
Data records 96950
Data character records 1153
Data flag records 1
Flag records 96898
Site comp details records 1153
****** Validation fail ********* ID: 42 ZHGONGSHAN LatUncertainty: 1.85000
****** Validation fail ********* ID: 86 RIO DE JANEIRO (GALEAO AE LatUncertainty:22.82500
****** Validation fail ********* ID: 92 ARICA (CHACALLUTA AERO) LatUncertainty:18.35833
****** Validation fail ********* ID: 93 CHARANA LatUncertainty:17.58333
****** Validation fail ********* ID: 95 LUBANGO (SA DA BANDEIRA) LatUncertainty:14.93333
****** Validation fail ********* ID: 115 CHACHAPOYAS LatUncertainty: 6.20833
****** Validation fail ********* ID: 85284 PUERTO CASADO LatUncertainty:22.28333
****** Validation fail ********* ID: 85285 WAGGA WAGGA AIRPORT LatUncertainty:35.16667
****** Validation fail ********* ID: 85286 MORUYA HEADS PILOT STATIO LatUncertainty:35.91667
****** Validation fail ********* ID: 85287 PUNTA ARENAS (CARLOS IBAN LatUncertainty:53.00833
****** Validation fail ********* ID: 85288 CUNDERDIN AIRFIELD LatUncertainty:31.62500
****** Validation fail ********* ID: 85289 UNIV. WISC. #8931 (MARILY LatUncertainty:79.95833
Site Details records 1153
Site FlagDefs records 8
Site Flag records 1153
Site Summary records 1153
Source FlagDefs records 1
Source records 96893
Station change records 394
Import database: GHCN Daily TAVG – Monthly
Data records 5261798
Data character records 16222
Data flag records 22
Flag records 5261694
Site comp details records 16222
Site Details records 16222
Site FlagDefs records 9
Site Flag records 1673
Site Summary records 16222
Source FlagDefs records 9
Source records 5261684
Station change records 394
Import database: GHCN Monthly version 2 TAVG
Data records 12071139
Data character records 23502
Data flag records 22
Flag records 12070983
Site comp details records 23502
****** Validation fail ********* ID: 44543 AMUNDSEN-SCOT LatUncertainty: 5.00000
****** Validation fail ********* ID: 51722 SHIP N LatUncertainty: 5.00000
Site Details records 23502
Site FlagDefs records 9
Site Flag records 1698
Site Summary records 23502
Source FlagDefs records 10
Source records 12070968
Station change records 394
Import database: GHCN Monthly version 3 TAVG
Data records 17234198
Data character records 30782
Data flag records 28
Flag records 17233990
Site comp details records 30782
****** Validation fail ********* ID: 51823 AMUNDSEN-SCOT LatUncertainty: 5.00000
****** Validation fail ********* ID: 59001 SHIP N LatUncertainty: 5.00000
Site Details records 30782
Site FlagDefs records 9
Site Flag records 1707
Site Summary records 30782
Source FlagDefs records 31
Source records 17233970
Station change records 394
Import database: Global Summary of the Day TAVG – Monthly
Data records 21680106
Data character records 55245
Data flag records 36
Flag records 21679846
Site comp details records 55245
****** Validation fail ********* ID: 6281 MOBILE UA STN ATLANT LatUncertainty: 5.00000
****** Validation fail ********* ID: 131830 AMUNDSEN-SCOTT LatUncertainty: 5.00000
****** Validation fail ********* ID: 131831 AMUNDSEN-SCOTT LatUncertainty: 5.00000
****** Validation fail ********* ID: 131832 CLEAN AIR LatUncertainty: 5.00000
Site Details records 55245
Site FlagDefs records 9
Site Flag records 3381
Site Summary records 55245
Source FlagDefs records 32
Source records 21679821
Station change records 394
Import database: GSN Monthly Summaries from NOAA TAVG
Data records 22374430
Data character records 56156
Data flag records 36
Flag records 22374118
Site comp details records 56156
Site Details records 56156
Site FlagDefs records 9
Site Flag records 4016
Site Summary records 56156
Source FlagDefs records 33
Source records 22374088
Station change records 394
Import database: Hadley Centre _ CRU TAVG
Data records 26491184
Data character records 61268
Data flag records 36
Flag records 26490820
Site comp details records 61268
Site Details records 61268
Site FlagDefs records 9
Site Flag records 4156
Site Summary records 61268
Source FlagDefs records 34
Source records 26490785
Station change records 394
Import database: Monthly Climatic Data of the World TAVG
Data records 26890092
Data character records 64038
Data flag records 36
Flag records 26889676
Site comp details records 64038
Site Details records 64038
Site FlagDefs records 9
Site Flag records 6926
Site Summary records 64038
Source FlagDefs records 35
Source records 26889636
Station change records 394
Import database: Scientific Committee on Antarctic Research TAVG
Data records 26923756
Data character records 64145
Data flag records 36
Flag records 26923288
Site comp details records 64145
****** Validation fail ********* ID: 81315 Amundsen Scott LatUncertainty: 5.00000
****** Validation fail ********* ID: 81316 Clean Air LatUncertainty: 5.00000
****** Validation fail ********* ID: 81325 Byrd LatUncertainty: 5.00000
Site Details records 64145
Site FlagDefs records 9
Site Flag records 7033
Site Summary records 64145
Source FlagDefs records 36
Source records 26923243
Station change records 394
Import database: US Cooperative Summary of the Day TAVG – Monthly
Data records 27051117
Data character records 67443
Data flag records 42
Flag records 27050597
Site comp details records 67443
Site Details records 67443
Site FlagDefs records 9
Site Flag records 8072
Site Summary records 67443
Source FlagDefs records 38
Source records 27050547
Station change records 394
Import database: US Cooperative Summary of the Month TAVG
Data records 32156536
Data character records 80328
Data flag records 43
Flag records 32155964
Site comp details records 80328
Site Details records 80328
Site FlagDefs records 9
Site Flag records 9842
Site Summary records 80328
Source FlagDefs records 39
Source records 32155909
Station change records 394
Import database: US First Order Summary of the Day TAVG – Monthly
Data records 32298284
Data character records 80945
Data flag records 49
Flag records 32297660
Site comp details records 80945
Site Details records 80945
Site FlagDefs records 9
Site Flag records 9847
Site Summary records 80945
Source FlagDefs records 44
Source records 32297600
Station change records 394
Import database: US Historical Climatology Network – Monthly TAVG
Data records 33824769
Data character records 82163
Data flag records 50
Flag records 33824093
Site comp details records 82163
Site Details records 82163
Site FlagDefs records 11
Site Flag records 11065
Site Summary records 82163
Source FlagDefs records 45
Source records 33824028
Station change records 394
Import database: World Monthly Surface Station Climatology TAVG
Data records 35332211
Data character records 86790
Data flag records 50
Flag records 35331483
Site comp details records 86790
****** Validation fail ********* ID: 60235 AWS: BYRD (8903) LatUncertainty: 5.00000
****** Validation fail ********* ID: 61620 SHIP STATION N OCEAN WEATHER S LatUncertainty: 5.00000
Site Details records 86790
Site FlagDefs records 12
Site Flag records 14856
Site Summary records 86790
Source FlagDefs records 46
Source records 35331413
Station change records 925
Import database: World Weather Records TAVG
Data records 35538706
Data character records 88650
Data flag records 50
Flag records 35537926
****** Validation fail ********* ID: 15507 DABO-SINGKEP WMOID :9.600000e+035
****** Validation fail ********* ID: 15519 SIMPANGTIGA-PEKANBARU WMOID :960000000
****** Validation fail ********* ID: 86671 KIJANG TANJUNG PINANG WMOID :9.600000e+009
****** Validation fail ********* ID: 86675 TAREMPA WMOID :9.600000e+036
Site comp details records 88650
****** Validation fail ********* ID: 15507 DABO-SINGKEP WMOID :9.600000e+035
****** Validation fail ********* ID: 15516 PADANGKEMILING BENGKULU LatUncertainty: 3.76667
****** Validation fail ********* ID: 15519 SIMPANGTIGA-PEKANBARU WMOID :960000000
****** Validation fail ********* ID: 86671 KIJANG TANJUNG PINANG WMOID :9.600000e+009
****** Validation fail ********* ID: 86675 TAREMPA WMOID :9.600000e+036
Site Details records 88650
Site FlagDefs records 12
Site Flag records 14924
Site Summary records 88650
Source FlagDefs records 47
Source records 35537851
Station change records 925
1 degree Lat/Long grid cells with any coverage : 9660

Shub Niggurath
December 27, 2013 4:14 am

Zeke says in the beginning of the video that stations show patterns that are not ‘thermodynamically’ plausible.
This is data peeking, unless explicitly shown otherwise.

December 27, 2013 11:23 pm

“Data Peeking?” Is this how you meant it? To keep sampling until you get something significant and then stop.
http://hardsci.wordpress.com/2012/11/08/data-peeking-is-always-wrong-except-when-you-do-it-right/

1 3 4 5