Why Reanalysis "Data" Isn't

Guest Post by Willis Eschenbach

There is a new paper out by Xu and Powell, “Uncertainty of the stratospheric/tropospheric temperature trends in 1979–2008: multiple satellite MSU, radiosonde, and reanalysis datasets” (PDF, hereinafter XP2011). It shows the large differences between the satellite, balloon (radiosonde), and reanalysis temperatures for the troposphere and the stratosphere. The paper is well worth a read, and is not paywalled. Figure 1, from their paper, shows their tropospheric temperature trends by latitudinal band from each of the sources.

Figure 1. From XP2011 Fig. 3a. Original caption says: Inter-comparison of tropospheric temperature (TCH2) trends (K decade−1) for the MSU (RSS, UAH, STAR), Radiosonde (RATPAC, HADAT2, UK, RAOBCORE, RICH) and Reanalysis (JRA25, MERRA, NCEP- CFSR, NCEP-NCAR, NCEP-DOE) products for the period of 1979–2008. (a) Trend changes with latitude for each individual dataset;

In Figure 1, the three groups are divided by color. The satellite observations are in blue. The balloon-borne observations are in green. And the climate reanalysis model results are in orange. Now, bear in mind that these various results are all purporting to be measuring the same thing—which way and how much the temperature of the lower troposphere is trending. The paper closes with the following statement (emphasis mine):

In general, greater consistency is needed between the various data sets before a climate trend can be established in any region that would provide the reliability expected of a trusted authoritative source.

I can only heartily agree with that. However, there are a few conclusions that we can draw in the interim.

First, despite the fact that these are all plotted together as though they were equals, in fact only two of the groups represent observational data. The results shown in orange are all computer model outputs. Unfortunately, these model outputs are usually referred to as “reanalysis data”. They are not data. They are the output of a special kind of computer climate model. This kind of climate model attempts to match its output to the known datapoints at a given instant (temperatures, pressures, winds, etc.). It is fed a stream of historical data, including satellite MSU and other data as well as station reports from around the world. It then gives its best estimate of what is happening where we have no data, in between the stations and the observation times.

Given that the five different reanalysis products were all fed on a very similar diet of temperatures and pressures and the like, I had expected them to be much, much closer together. Instead, they are all over the map. So my first conclusion is that not only are the outputs of reanalysis models not data. As a group they are also not accurate. They don’t even agree with each other. To see what the rest of the data shows, I have removed the reanalysis model outputs in Figure 2.

Figure 2. Same as in Figure 1, but with the computer reanalysis model results removed, leaving satellite (blue) and balloon-borne (green) observations.

The agreement between the balloon datasets is not as good as that of the satellite data, as might be expected from the difference in coverage between the satellite data (basically global) and balloon data (in certain scattered locations).

Once the computer model results are removed, we find much better agreement between the actual observations. Figure 3 shows the correlation between the various datasets:

Figure 3. Correlations between the various observations (Satellite and Balloon) and computer model (Reanalysis) data. Red indicates the lowest correlation, blue shows the highest correlation. Bottom row shows the correlation of each dataset with the average of all datasets. HadAT is somewhat affected due to incomplete coverage (only to -50°S see Fig. 2), as is RAOBCORE to a lesser degree (coverage to -70°S).

Numerically, this supports the overall conclusion of Figure 1, which is that as a group the reanalysis model results do not agree well with each other. This certainly does not give confidence in the idea of blindly treating such model output as “data”.

Finally, Figure 4 shows the three satellite records, along with the MERRA reanalysis model output.

Figure 4. Same as in Figure 1, but with the balloon and computer reanalysis model results removed, leaving satellite (blue) and one reanalysis model (violet).

In general the three satellite records are in good agreement. The STAR and RSS datasets are extremely similar, somewhat disturbingly so, in fact. Their correlation is 1.00. It make me wonder if they are not sharing large portions of their underlying analysis mathematics. If so, one might hope that they would resolve whatever small differences remain between them.

I have read, but cannot now lay my hands upon, a document which said that the RSS team use climate model output as input to a part of their calculation of the temperature. In contrast, the UAH team do not use climate model for that aspect of their analysis, but do a more direct calculation. (I’m sure someone will be able to verify or falsify that.) [UPDATE: Stephen Singer points to the document here, which supports my memory. The RSS team uses the output of the CCSM3 climate model as input to their analysis.] If so, that could explain the similarity between MERRA and the RSS/STAR pair. On the other hand, the causation may be going the other way—the reanalysis model may be overweighting the RSS/STAR input … because remember, some dataset from among the satellite data, perhaps the RSS data, is used as input for the reanalysis models.

This leads to the interesting situation where the output of the CCSM3 is used as input to the RSS temperature estimate. Then the RSS temperature estimate is used as input to a reanalysis climate model … recursion, anyone?

Finally, this points to the difficulty in resolving the question of tropical tropospheric amplification. I have written about this question here. The various datasets give various answers regarding how much amplification exists in the tropics.

CONCLUSIONS? No strong ones. Reanalysis models are not ready for prime time. There is still a lot of variation in the different measurements of the global tropospheric temperature. This is sadly typical of the problems with the a number of the other observational datasets. In this case, this affects the measurement of tropical tropospheric amplification. Further funding is required …

Regards to all,

w.

DATA:

The data from Figure 1 is given below, in comma-separated format

Latitude,  STAR  ,  UAH  ,  RSS  ,  RATPAC  ,  HADAT  ,  IUK  ,  RAOBCORE  ,  RICH  ,  JRA25  ,  MERRA  ,  NCEP-CFSR  ,  NCEP-DOE  ,  NCEP-NCAR

-80, -0.104, -0.244, -0.134, 0.085,  , 0.023,  , 0.023, -0.243, -0.154, 0.028, 0.294, 0.304

-70, -0.074, -0.086, -0.094, 0.09,  , -0.035, 0.071, -0.034, -0.218, -0.115, -0.045, 0.147, 0.148

-60, -0.055, -0.142, -0.074, 0.09,  , -0.088, 0.1, -0.148, -0.285, -0.051, -0.094, 0.059, 0.104

-50, 0.005, -0.069, -0.043, -0.006, 0.138, 0.022, 0.081, 0.01, -0.232, 0.032, 0.029, 0.03, 0.114

-40, 0.07, -0.076, 0.026, -0.01, 0.118, -0.107, 0.08, 0.074, -0.081, 0.115, 0.116, -0.005, 0.077

-30, 0.143, 0.082, 0.087, 0.114, 0.123, 0.122, 0.127, 0.126, 0.047, 0.178, 0.22, 0.047, 0.108

-20, 0.182, 0.08, 0.13, 0.12, 0.085, 0.087, 0.143, 0.125, 0.116, 0.213, 0.289, 0.071, 0.097

-10, 0.199, 0.056, 0.153, 0.114, -0.02, 0.082, 0.116, 0.098, 0.069, 0.226, 0.313, -0.003, 0.053

0, 0.195, 0.038, 0.154, 0.089, 0.038, 0.028, 0.136, 0.089, 0.063, 0.284, 0.324, -0.007, 0.061

10, 0.179, 0.034, 0.144, 0.09, 0.064, 0.192, 0.162, 0.137, 0.087, 0.273, 0.328, 0.027, 0.065

20, 0.21, 0.093, 0.166, 0.09, 0.18, 0.16, 0.194, 0.207, 0.115, 0.245, 0.307, 0.115, 0.114

30, 0.23, 0.133, 0.162, 0.247, 0.239, 0.137, 0.238, 0.291, 0.152, 0.257, 0.307, 0.154, 0.153

40, 0.238, 0.164, 0.161, 0.237, 0.213, 0.189, 0.246, 0.3, 0.153, 0.244, 0.268, 0.161, 0.194

50, 0.241, 0.125, 0.161, 0.24, 0.314, 0.213, 0.247, 0.283, 0.166, 0.236, 0.238, 0.161, 0.201

60, 0.299, 0.167, 0.222, 0.283, 0.289, 0.207, 0.335, 0.324, 0.224, 0.288, 0.266, 0.202, 0.239

70, 0.317, 0.177, 0.245, 0.288, 0.289, 0.237, 0.427, 0.393, 0.254, 0.304, 0.269, 0.232, 0.254

80, 0.357, 0.276, 0.301, 0.278, 0.438, 0.384, 0.501, 0.323, 0.226, 0.328, 0.326, 0.235, 0.26
Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
96 Comments
Inline Feedbacks
View all comments
barry
November 8, 2011 2:52 pm

Gary Pearse here
Precision to tenths of a degree is strengthened by having many measurements. The power of large groups of numbers makes this possible. Averaging large groups of numbers statistically, gives a more precise value than having just one or averaging several values. Tamino had a post on it once…
Ah, here it is, from web archives.

it turns out to be a fundamental property of statistics that the average of a large number of estimates is more precise than any single estimate. The more data go into the average, the more precise is the average — even though the source data are all imprecise.

http://web.archive.org/web/20080402030712/tamino.wordpress.com/2007/07/05/the-power-of-large-numbers/
Check it out, it’s a pretty interesting post. Or read the wiki entry on the Law of Large Numbers

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

[Formatting fixed -w.]

November 8, 2011 4:43 pm

ferd berple says:
November 7, 2011 at 6:50 am
Isn’t the real question why the norther hemisphere is getting warmer and the southern hemisphere is getting colder if CO2 is well mixed? …This is a huge mystery completely unaccounted for in current theories of climate change… Why isn’t climate science all over this?
I’m surprised nobody else has answered you with this:
Svensmark.
With generally increasing solar radiation batting away cosmic-ray-inducing clouds, for the time period of the study 1979-2008, the landmasses warm a lot, and the sea temp rises less and falls less owing to larger thermal inertia. But the icecaps COOL because ice has an even higher albedo than clouds.
BTW I’m sure I’ve seen a damn good piece of analysis of 90S – 90N profiles, by Erl Happ, that reminds me of the clearly asymmetric profile here. Fascinating material about the Roaring Forties IIRC.

barry
November 8, 2011 5:41 pm

Thanks for fixing the formatting Willis.

First, note that despite having thousands and thousands of observations, the average only improved on the accuracy of individual measurements by an order of magnitude. Grant didn’t discuss that at all.

Trends are even more robust than data points, yes? I was responding to the doubt that trends to a tenth of a degree could be validated mathematically. Isn’t this the order of magnitude you are just now saying we have improved accuracy to? I see no disagreement here, if so.

Compare and contrast that with a number of men measuring different things …

Thousands of instruments (thermometers) measuring one thing (air temperature)? And several sensors on satellites reading the same thing (spectral radiance)? I do not know what you mean by ‘a number of men’ measuring different things. If reanalysis, then the base data is the thousands/millions of measurements of one thing by many instruments.
Tamino’s example is also of many people measuring the same thing with their imperfect instruments (eyes). With enough data, the noise is reduced and precise estimates can be gleaned. But in the end this isn’t based on anecdote.The Law of Large Numbers is a statistical phenomenon used in many fields. Error bars come from doubts about measurement bias, coverage etc, not from doubts about the power of averaging.

November 8, 2011 6:19 pm

Philip Bradley says: November 7, 2011 at 11:59 am
Aerosols do explain it… Long but interesting paper on aerosols, http://

I think the neatest answer to aerosols was provided by Warren Meyer. You can find it in the comments around slide 62 in my Climate Science presentation

barry
November 8, 2011 6:28 pm

Lucy,
Temperature trends in all data sets show warming in the Southern Hemisphere at the surface/lower troposphere.
Svensmark’s cloud/cosmic ray theory predicts that climate shifts should be opposite in sign between the North and South poles, not the Southern hemisphere. The jury is out on whether the South Pole is cooling, flat or warming for the last 30 years or so (UAH decadal trend is -0.05C/decade). But it is clear from satellite measurements and land-based sun spot counts that the sun, and therefore GCR trends, have not changed much over that period, and therefore GCR/cloud theory is unlikely to be substantiated by reference to the study above.
Solar output shows little trend for the last 60 years. Svensmark’s theory does not seem to be corroborated by the surface temperature records, which has a significant (statistically/magnitude) positive sign for that period.

barry
November 8, 2011 8:24 pm

We could have ten million people measure the bar with a tape measure, we won’t get any more accurate than that.

Not according to statistical probability. The larger the sample, the more the mean of the sample converges on the ‘true’ value. There is no ceiling where adding more samples makes no difference to the convergence. I find this to not only be a mathematical result, but also quite intuitive. It just makes sense.

November 8, 2011 8:51 pm

it turns out to be a fundamental property of statistics that the average of a large number of estimates is more precise than any single estimate. The more data go into the average, the more precise is the average — even though the source data are all imprecise.
Large numbers do indeed improve precision, but they have no effect on accuracy.
Unfortunately Tamino and others fail to understand this distinction.
from wikipedia
The accuracy[1] of a measurement system is the degree of closeness of measurements of a quantity to that quantity’s actual (true) value. The precision[1] of a measurement system, also called reproducibility or repeatability, is the degree to which repeated measurements under unchanged conditions show the same results.
Thanks Lucy for the link.

barry
November 8, 2011 9:29 pm

That’s a bit of a red herring, Philip. The accuracy of individual measurements is a different metric and doesn’t apply to large sample averaging per se. In temperature measurement systems, the metric you’re introducing here come from systemic biases, not averaging. I was careful to distinguish these in a previous post. Accuracy, in the context you mean, doesn’t impact on the query I was responding to.
Studies like the one Willis has brought here try to quantify systemic biases by comparing data packages. This is quite different to the concept of convergence from averaging. Let’s keep them distinct lest we muddy the waters.

barry
November 8, 2011 10:12 pm

Willis,
The mathematical point is that the larger the sample size the greater convergence to the true value. Individual measurements will vary, but the mean will converge on the true value, and this convergence increases with increasing measurements to infinity. There are numerous formulae, simple and more complex establishing this statistical fact. Some are shown here:
http://en.wikipedia.org/wiki/Law_of_large_numbers
You have stated that there is a limit to convergence, but offered no mathematical proof. Can you defend this mathematically? I would be intrigued to see how you do it.

Agile Aspect
November 8, 2011 11:46 pm

It would be nice to see an image of the “anomaly” where the mean value used to calculate the “anomaly” is the arithmetic mean of only the balloon and satellite data.
At first glance, it appears the mean used to calculate the so-called “anomaly” in Figure 1 may be polluting the results in Figure 2.

November 9, 2011 12:28 am

The mathematical point is that the larger the sample size the greater convergence to the true value. Individual measurements will vary, but the mean will converge on the true value
It doesn’t converge on the true value. It converges on the measured value.
The deeper point here is the unstated assumption that measurement errors are random. Some are, some aren’t. Large sample size does nothing to correct non-random errors.

barry
November 9, 2011 5:15 am

Willis,
the argument you are making is known as reductio ad absurdum. I concede to the logic. But it in the context of my reply to Gary, it comes with a false premise. My response to the original statement.

Looking at the graphs, we are talking about something around 0.2degC range between sets. Pamela Gray, November 7, 2011 at 6:43 am has said it best:
“…let’s round to the nearest whole degree and call anything less noise.

I didn’t see one other posting that took this on.

was to explain why averaging makes 10ths of a degree a valid value. Which you have agreed with.

barry
November 9, 2011 5:33 am

Philip,

The deeper point here is the unstated assumption that measurement errors are random.

No one is making this assumption. Non-random errors are tangential to Gary’s query.

barry
November 9, 2011 2:35 pm

Just occurred to me that the power of averaging the changes in data can be seen WRT the trends from the paper. Global surface trends are comprised of many more measurements than zonal (latitudinal) and than higher in the atmosphere. We spend so much time picking at the disagreement between data sets that perhaps we overlook the quite amazing agreement between them. In the case of satellite TLT and surface trends, the differences are less than tenths of a degree C. For statistically significant periods, say 30 years, the differences amount to no more than 5 hundredths of a degree C. That’s a pretty remarkable result when two completely different phenomena are being measured and different methods for all the data sets are employed.

barry
November 9, 2011 7:35 pm

The difference between the trends of the various datasets is, as you say, a tenth of a degree per decade … which is a whole degree per century.

I did not say that the difference in trends was a tenth of a degree per decade. I said, as you quoted me “the differences amount to no more than 5 hundredths of a degree C,” which is half a degree C per century.
But I was conservative in my rounding – I know that the difference between the means is actually smaller. The decadal linear trend for 4 temperature sets (2 satellite and 2 surface) between Jan 1979 and Dec 2010 is less than 3 hundredths of a degree C different from each other (source).
That amounts to a difference of less than 0.3C a century. And UAH is not calibrated to surface temperature, as Roy Spencer consistently points out. That is remarkable, agreement considering the many different issues troubling both surface and satellite data sets and the relative shortness of the time period (less than a third of a century).

barry
November 9, 2011 7:49 pm

And the deeper you look the more you can qualify the results. Satellite temperature records are influenced more strongly by el Nino/la Nina patterns than surface records. UAH has steadily converged with the other records, having been an outlier. While it is de rigeur in some parts to diss the official records, time keeps validating their robustness. Even Fall et al corroborated mean trends for the US – and this was a regional study, where you would expect greater variance with different methods if the average temp record was very flawed. BEST is just another example of corroborating evidence. The evidence against tends to be anecdotal, highly selective (handful of stations, or non-random selections), but when the numbers are crunched for large, random samples, people always seem surprised – outraged even – that the results tend to confirm the official records. Not perfectly, of course, but within the error bounds. Eventually one has to recognize this. Doesn’t mean there isn’t plenty of uncertainty to discuss or work to be done, but I think avoiding or downplaying the convergence of these results while critiquing the temp records takes the discussion beyond skepticism, which should balance evidence neutrally rather than directing doubt all in one direction.