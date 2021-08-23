Climate data

The Interpretation of Interpolation

2 hours ago
Willis Eschenbach
24 Comments

Guest Post by Willis Eschenbach

Over in the comments at a post on a totally different subject, you’ll find a debate there about interpolation for areas where you have no data. Let me give a few examples, names left off.

Kriging is nothing more than a spatially weighted averaging process. Interpolated data will therefore show lower variance than the observations.

The idea that interpolation could be better than observation is absurd. You only know things that you measure.

I’m not saying that interpolation is better than observation. I’m saying interpolation using locality based approach is better than one that uses a global approach. Do you disagree?

I disagree, generally interpolation in the context of global temperature does not make things better. For surface datasets I have always preferred HadCRUT4 over others because it’s not interpolated.

Once you interpolate you are analysing a hybrid of data+model, not data. What you are analysing then takes on characteristics of the model as much as the data. Bad.

How do you estimate the value of empty grid cells without doing some kind of interpolation?

YOU DON’T! You tell the people what you *know*. You don’t make up what you don’t know and try to pass it off as the truth.

If you only know the temp for 85% of the globe then just say “our metric for 85% of the earth is such and such. We don’t have good data for the other 15% and can only guess at its metric value.”.

If you don’t have the measurements, then you cannot assume anything about the missing data. If you do, then you’re making things up.

Hmmm … folks who know me know that I prefer experiment to theory. So I thought I’d see if I could fill in empty data and get a better answer than leaving the empty data untouched. Here’s my experiment. I start with the CERES estimate of the average temperature 2000 – 2020.

Figure 1. CERES surface temperature average, 2000-2020

Note that the average temperature of the globe is 15.2°C, the land is 8.7°C, and the ocean is 17.7°C. Note also that you can see that the Andes mountains on the left side of upper South America are much cooler than the other South American Land.

Next, I punch out a chunk of the data. Figure 2 shows that result.

Figure 2. CERES surface temperature average with removed data, 2000-2020

Note that average global temperatures are now cooler with the missing data, with the globe at 14.6°C versus 15.2°C for the full data, a significant error of about 0.6°C. Land and sea temperatures are too low as well, by 1.3°C and 0.4°C respectively.

Next, I use a mathematical analysis to fill up the hole. Here’s that result:

Figure 3. CERES surface temperature average with patched data, 2000-2020

Note that the errors for land temperature, sea temperature, and global temperature have all gotten smaller. In particular, the land error has gone from 1.4°C to 0.1°C. The estimate for the ocean is warm in some areas, as can be seen in Figure 3. However, the global average ocean temperature is still better than just leaving the data out (0.1°C error rather than 0.4°C error).

My point here is simple. There are often times when you can use knowledge about the overall parameters of the system to improve the situation when you are missing data.

And how did I create the patch to fill in the missing data?

Well … I think I’ll leave that unspecified at this time, to be revealed later. Although I’m sure that the readers of WUWT will suss it out soon enough …

My best wishes to all,

w.

PS—To avoid the misunderstandings that are the bane of the intarwebs, PLEASE quote the exact words that you are discussing.

Curious George
August 23, 2021 10:15 am

This is a misuse of the term “interpolation”. An old adage says: Interpolate at will. Extrapolate at your own peril.

commieBob
Reply to  Curious George
August 23, 2021 10:44 am

If an area is lacking temperature data, it is apparently unpopulated or something like that. Anyway it is somehow different than the areas that have temperature data. In that light, interpolating is probably not warranted.

Lance Flake
Reply to  Curious George
August 23, 2021 11:14 am

Extrapolation is assuming a trend continues past the data in a graph. Interpolation is filling in missing data between surrounding data points. This isn’t a misuse of the term.

Bernie1815
August 23, 2021 10:16 am

How is the land 10C cooler than the ocean?

TonyL
Reply to  Bernie1815
August 23, 2021 10:54 am

Just a guess:
What is warm:
The tropics. Here the ocean far outweighs the land in area. Result – Warm ocean.

What is cold:
Antarctica, 100% land. There is no ocean cold like it. Note that the central area of the continent has an altitude of up to 12,000+ ft. This makes it really cold. The Arctic ocean is cold but not that cold. Result – Cold land.

John Tillman
Reply to  TonyL
August 23, 2021 11:04 am

Also, SSTs are at sea level. Much of land is high, as wirh Antarctica.

Russell Klier
August 23, 2021 10:29 am

W, I’m a layman guessing how to fill in the hole, I would find geologically similar features with known numbers and just color them in.

Right-Handed Shark
Reply to  Russell Klier
August 23, 2021 11:17 am

Mickey Mann just called, you got the job!

Robert of Texas
August 23, 2021 10:33 am

You know better then this…Punching out a rectangle and performing the “experiment” one time does not prove anything – it just gives one false confidence that the method works.

The “experiment” should take out random chunks of data that better represent the real case, and be rerun numerous times using different amounts of missing data (as well as random locations and sizes). You should find cases where interpolation work, where it makes no difference, and where it fails to provide a good result. I would imagine that as the amount of known data to unknown data is reduced the interpolation goes wildly wrong:

Take the boundary conditions: You measure 100% of the area therefore interpolation accounts for 0% if the data, your result is as good as it gets. Next, you do not know 100% of the area so interpolation is 100% guess work (starting with educated guesses?) and your result is almost certainly wrong (but there is some tiny percent chance it is right. Now think of as you add in 10% actually measured data and rerun – the result gets better each time you add more data in.

Interpolation works if there is enough surrounding data and transitions are smooth. If that is not the case, you are likely going to make the result less certain.

Now add in the process of homogenization and you have a recipe for really screwed up results. Homogenization assumes transitions should be smooth – not lumpy, so it hides things like the UHI effect. Then interpolate and bingo, you (the generic you) just produced extra warming over a larger area. Congratulations – you are ready now to become a data mangler for climate science.

ResourceGuy
August 23, 2021 10:38 am

I hear the Sierra Nevada upland is pleasant this time of year.

Rud Istvan
August 23, 2021 10:46 am

Interpolation works in this case because the Ceres data is ‘homogenous’ in the sense that it is all ‘the same Ceres’. Interpolation (aka infilling) doesn’t work well on something like land station meteorological records because they are inherently inhomogenous. And using anomalies to make them trend comparable does not remove the underlying inhomogeneity.

So the comment to and fro arguments cited at the outset lack important context making them both true or not true depending…

Joel O'Bryan
August 23, 2021 10:49 am

I love the oft used term “synthetic data.”
I love it because it tells me something about the user they probably don’t realize, in the same way those who use “carbon pollution” informs me of the user’s analytic skills.

Note: I realize WE did NOT use that term here. But that is what his infilling method did make. I see that term used lots though with so much by Dark Art practitioners of CliSci though.

Joel O'Bryan
August 23, 2021 10:54 am

The obvious problem that WE creates with his “punch out” is he chose the big chunk of equatorial solar heated region.
If he did that extraction to Antarctica (or Greenland interior), where we really do have very few spatial measurements, the opposite would occur, the the average Globe would dramatically warmer, the SH (NH) would warm even more, and the NH (SH) would be unaffected.

Last edited 34 minutes ago by joelobryan
Rob_Dawg
August 23, 2021 11:09 am

I don’t mind interpolation as long as the methodology is transparent. Better than the discordance of blank spots in either a chart or map. That said, and interpolation is an END PRODUCT. You cannot use it for further analysis.

bdgwx
August 23, 2021 11:11 am

Thanks for doing the data denial experiment.

For those that are curious that debate is centered around the HadCRUTv4 vs HadCRUTv5 methods. v4 ignores empty grid cells in its global averaging procedure. v5 uses gaussian process regression which is similar to the kriging procedure Cowtan & Way and Berkeley Earth use or the local weighted linear regression approach by Nick Stokes.

Hausfather has some commentary on this as well.

Last edited 19 minutes ago by bdgwx
ThinkingScientist
August 23, 2021 11:18 am

Willis,

Well quite a few of those comments about kriging are mine.

I would suggest also bringing up the comments I made about OK, SK, stationarity and declustering as well. They are relevant.

Your example is quite trivial and relatively easy to fix. You are also working in absolute temperatures with that data I think, whereas HadCRUT4 is anomalies. It makes a difference and affects the stationarity assumption

For perspective I have tried to add examples of images of the girds of surface Obs (HadCRUT4) for Jan 1850.. The interpolation challenge is much more problematic in the early part. But it should be noted in the later time slices (eg 1950 and 2000) that the issue is there is no data in the polar regions. This means we are talking about extrapolation not interpolation

Jan1850.jpg
ThinkingScientist
Reply to  ThinkingScientist
August 23, 2021 11:19 am

This is observations for 1900:

Jan1900.jpg
ThinkingScientist
Reply to  ThinkingScientist
August 23, 2021 11:19 am

For 1950:

Jan1950.jpg
ThinkingScientist
Reply to  ThinkingScientist
August 23, 2021 11:19 am

And for 2000:

Jan2000.jpg
ThinkingScientist
Reply to  ThinkingScientist
August 23, 2021 11:21 am

Finally, this is the percent of temporal coverage for each grid cell for the period 1850 – 2017. The high percent temporal coverage is basically showing shipping lanes dating back to Victorian times in the sea (an animation of the time evolution is quite interesting)

TimeCoveragePerCent.jpg
Captain climate
August 23, 2021 11:22 am

You’re not adding information. You’re presuming that the best linear unbiased estimator is the midpoint between two locations, probably adjusting for altitude.

BCBill
August 23, 2021 11:23 am

Now repeat for twenty different locations and determine the average effect and sd and then the discussion can begin.

ASTONERII
August 23, 2021 11:25 am

Lower fake error by making up data where none exists is not a better result than using known numbers. It just gives a false feeling of accuracy where it does not exist.

You could have lowered the error bounds similarly by putting 100 degrees C for that entire cell and it would not have made any difference to error range.

Nick Stokes
August 23, 2021 11:32 am

“How do you estimate the value of empty grid cells without doing some kind of interpolation?”

How do you estimate the value of “full” grid cells without doing some kind of interpolation? In any continuum measurement, you only ever have a finite number of sample locations. Any extension beyond those points is some type of interpolation. Grid boundaries are your own construct, so “empty grid cells” are your own creation. They don’t change the main problem.

What is going on in Willis’ calculation is that when the chunk is removed and the average taken, the effect is that the chunk is treated as if it is an average part of the world. That would be cooler, so the average drops. It was a bad estimate; we know the region is warm. When Willis interpolates, he does so from nearby (warm) values. That is a better estimate than just global. The average is much closer to using actual data.

This point of using the best available estimate is poorly understood, as is the fact that “omitting” unknown cells is equivalent to assigning them the average value. HADCRUT used to make this error, and Cowtan and Way (1913) showed that kriging could overcome it. Actually just about any rational interpolation scheme will do that. HADECRUT 5 gets it right, although they still offer a wrong version for those who like that sort of thing.

More here

