Comparing climate models – Part One

Guest Essay by Geoffrey H Sherrington


This short note was inspired by Viscount Monckton writing on 13th June on WUWT about the leveling of global temperatures since 1997 or so, and the increasing mismatch with a number of climate models.

This graph by Drs Christy & Spencer of UAH is referenced in Anthony’s article ‘The Ultimate “Sceptical Science Cherry Pick of 10th June and will be referred to as ‘the spaghetti graph’.


In the Viscount Monckton article there was a significant contribution by rgbatduke  now elevated to a post titled  The “ensemble” of models is completely meaningless, statistically whose comments can be read with reference to this spaghetti graph (with thanks to the authors of it).

Basically rgbatduke argued that model methods to create spectra for chemical elements such as carbon had limitations; that there was a history of improvement of models; the average of such successive models was meaningless; they did not succeed without some computational judgment; and even then, they were not as good as the measured result. The same comments should be applied to the various climate model comparisons shown in the spaghetti graph, particularly the meaningless average. Climate models were stated to be far more complex than atomic spectral calculations.

Here is another model graph, one from my files.



The numerical raw data are here, with some early stage statistics derived from the raw data. I make no claims about the number of significant figures carried, the distributions of data, etc. The purpose of this essay is to invite your comments.

The graph has time on the X-axis. For this exercise, the interval does not matter except to note that data are equally spaced in time. The Y-axis has a dimensionless model score and is integral. It can increment by 0, 2 or 4 units at a time (like ‘no change= 0’, ‘some change =2 units’, much change = 4 units.’) The dark dots are the arithmetic average of each time slice.

You will see that the first part only of the graph is shown. The object of the exercise is to use the information content of the shown data, to calculate projections of each of the series out to 23 time spans. The correct answer, as derived from experiment, is known. It is with the WUWT team.

If, as you work, you seek more information, then please ask. No reasonable requests refused.

If, as you work, you feel you know the source of the data, please don’t tell the others.

Finally, what is the purpose of all of this? Answer is, to try to emulate a simple climate model projection. I do not know if all climate models follow the same routines to get from start to finish, whether they calculate year by year, similar to this example, or if the whole lot goes into a series of matrixes that are solved one after another.

However, consider that for this exercise you have 18 models that have each yielded 20 years of data. Let us all see how well you can project to end of year 23. In a week I’ll post the full graph and full set of figures, then I’ll make some comments on your methods of solving this problem.

As will, I hope, a few others.

Repeating: The exercise is to project the data to the end of time period 23.

That’s 3 more slots.


newest oldest most voted
Notify of
Mike McMillan


Mike McMillan

Do you have that data as a .csv or .txt file?

David in Cal

I would like to enter the contest, but I don’t understand the data. You say there are 18 different series and you’re looking for a projection for each at time 23. That’s 18 projections in all.
— What’s the relationship among the 18 series?
— Why is the average meaningful?
— Is there any reason to believe that each of the series will grow at comparable rates?
— Would it be reasonable to guess that each series grows sort-of linearly? That is, the series that have grown faster so far will continue to grow faster?
I think I see one point. The total has grown exactly 2 units per time period. It’s tempting to predict that it will continue to do so. However, if the series have nothing to do with each other, then the steady rate of growth of their average is likely just a coincidence.


I don’t see much point in this exercise. Without knowing anything about “physics” involved in this model it’s hard to expect anything but that the evolution will continue at approximately the same cosnstant rate as in the recorded period. Which of course could work unless the real model comes with some kind of effect which will kick in no sooner than at step 21. And that’s what I actually expect to happen, otherwise there would be no point in making this article.
I also don’t see any relation between this exercise and the Spencer/Christy graph and rgbduke’s post.

Dodgy Geezer

Surely the ‘average of the models’ has some meaning?
It’s not a real physical meaning, of course, corresponding to any physical phenomenon. But it measures something. It measures what the different modelers think they can get away with in their models.
The main problem here,and the obvious stupidity, which is shared with the concept of running ‘consensus surveys, is that you’re measuring a moving target. Back in the 1990s, a consensus study would probably show a valid 80-90% of papers supporting, and the model predictions would all be steep upward graphs. By now, the model graph’s gradients have come down a lot.
So a moving average of those gradients would show a trend – a trend from high-pitched alarmism down towards a more justifiable exaggeration….


At first glance I would need 8 possible outcomes for the next 3 time steps of each of the 18 series. Besides giving an 8 value range for time interval 23 for each of the 18 series, this has no predictive skill. Averages or trends have no predictive ability on data generated in this manner.


It can increment by 0, 2 or 4 units at a time

Every number in your table is a multiple of 4, which kind of contradicts the above.


and there are no steps of 8, so this appears to be a simple binary walk multiplied by 4. Why bother with the scaling?

Bloke down the pub

Mike McMillan says:
June 20, 2013 at 12:30 am
‘Good-bye, and thanks for all the fish.’

Bloke down the pub

The Y-axis has a dimensionless model score and is integral. It can increment by 0, 2 or 4 units at a time (like ‘no change= 0’, ‘some change =2 units’, much change = 4 units.’)
I don’t see any increments of 2 in the data . Am I being blind or is that an oddity?


Why not use meadian and quartiles?

Geoff Sherrington

Dodgy Geezer says: June 20, 2013 at 1:13 am “Surely the ‘average of the models’ has some meaning?”
Well, it’s not a straight line.
steveta_uk says:
June 20, 2013 at 1:33 am It can increment by 0, 2 or 4 units at a time
Every number in your table is a multiple of 4, which kind of contradicts the above.
No necessarily. The 2 score might not have appeared yet. Please see below.
David in Cal says: June 20, 2013 at 12:54 am “What’s the relationship among the 18 series?”
Each point can increment positively by 0, 2, or 4. No negative values.
It is obvious that some series grow faster than others. Looking ahead, one might assume that the probability of a change in growth rate for a series is low. But the average grows in rather constant steps. Note that each new point in a series is pegged to the last. That is a constraint on prediction; it makes it likely that the numbers do not suddenly diverge or converge. Using these three guides, can we try to predict the 18 values in the next time slot? Then the next and the next?
These are real data taken from an actual experiment. They are not generated by any algorithm, they are a measure of an energy if that helps, but it is a ranked measure. I have other similar tables, some in which the 2 growth option happens more often. I am leaving space and conditions open in case a point can be illustrated by using another set.

Geoff Sherrington

Data as .csv, comma delimited.

X Anomaly

Here is what I did:
I subtracted column 1 from column 20, divided by 20, then times by 3, to establish a rough approximation of the rate for each series (Column 23 =(((U2-B2)/20)*3)+U2). I then filled in column 21 and 22 with data greater or equal to Column 20, but not greater than Column 23, using only increments of 0 or 4. Since I generated some odd numbers in Column 23, these are rounded to the nearest even number, whilst trying to maintaining increments of 4.
The average for Column 23 is 43 (Rounded to 44)
I believe my method is a sufficient approximation of the data, was easy, and I look forward to comparing in part 2 .

Geoff Sherrington

X Anomaly
Thank you. What I am seeking is a score for each series at the end, rather than an average. Or, failing that, a final ranking of the order of the series from highest to lowest. Geoff.

Mike McMillan

Geoff Sherrington says: June 20, 2013 at 3:19 am
Data as .csv, comma delimited.
Thank you sir.
Just like on the SAT, only not.

David L. Hagen

Do I understand correctly that the black line is the data, and the others are from models trying to emulate the data?
F. Fred Singer notes published variations by an order of magnitude between runs of the same GCM. He finds about 400 model – years of runs required to reduce much of the chaotic variation. e.g. 20 runs for a 20 year model, (40 runs for 10 years, 10 runs for 40 years etc.) However, most IPCC models show only one run or are averaged over a few runs to 5 at the most.
Are we to assume that each line is an individual run?
Or are they averaged over multiple runs?
See Overcoming Chaotic Behavior of Climate Models

X Anomaly

Column 23

Geoff Sherrington

Thank you for the several last posts.
I think that you might need to review the .csv download as I have been having problems with back compatibility. Sorry for the inconvenience. Geoff.

Geoff Sherrington

What I see is individual values rising to some then nearly continuous figure. Actual height differs for each trace and the data is poorly quantised around those lines.
Close to the underlying physical behaviour?


Series Value BE ML
Number t=20 t=23 t=23
1 56 64.2 64
2 60 68.8 68
3 60 68.8 68
4 60 69.5 70
5 52 59.6 60
6 48 55.6 56
7 44 50.3 50
8 48 55.6 56
9 40 46.3 46
10 40 45.7 46
11 44 50.3 50
12 36 41.7 42
13 28 31.8 32
14 20 22.5 22
15 20 23.2 24
16 12 13.9 14
17 8 9.3 10
18 8 9.3 10
Value at t= 20 is from supplied table
BE at t=23 is best estimate at time step t=23
ML is most likely value at time step t=23

David L.

If it was possible to guess at a model with faulty and/or poor understanding of the underlying mechanisms and then project that into the future with any degree of accuracy, these guys wouldn’t be fooling around with climate models but with stock market models.
The forces behind the climate are clearly not understood. Therefore no model can possibly have any accuracy in exrapolation. These models aren’t even good at interpolating the real observed data.

“And my posts have shown and will continue to show that the climate models, using those model means, show no skill at hindcasting. If they show no skill at hindcasting, there’s no reason to believe their projections of future climate.”
‘no skill” is a technical term. Each model has skill.
The simple fact is your can measure the skill of every model. you can measure the skill
of the ensemble.
It matters little whether the ensemble mean has statistical meaning because it has practical
skill, a skill that is better than any individual model. the reason why the ensemble has more skill
is simple. The ensemble of models reduces weather noise and structural uncertainty.

David in Cal

It’s dangerous to project a trend when you don’t know what the numbers represent. E.g., guess the next number in the series: 46, 62, 73, 80, 47, 48, 53, 42…
These figures are the price of Bank of America stock at the beginnings of the years 2001 – 2008. The next figure in the series turned out to be 13.

It looks to me that you are manually fitting the “data” to a Fourier type harmonic type series with no negative signs (no negative feedbacks). You have 360 “data points” . You use 184 degrees of freedom in calculating the averages and standard deviations. One should not expect to get any meaningful results because the model is wrong to begin with and there are not enough “data” to establish any statistical significance. You would be better able to get “predictability” by statistically fitting the “real” data to a Fourier type harmonic series with the possibility of both positive and negative signs and use only the harmonics that have statistical significance.

David vun Kannon

@David L Hagen – Thanks for the link to Fred Singer’s paper. His 400 model year limit is interesting. I compared with
(top of p 491, col 2)

. If all ensemble members are considered, then nearly 2,800 yr on average were simulated per CMIP3 model, but the total years varied substantially from one model to another (500–8,400 yr with a median of 2,200 yr). The total amount of CMIP3 model data archived was
about 36 TB. In contrast for CMIP5, the long-term and near-term core experiments alone call for, at minimum, ~2,300 yr, approximately matching the number of years in CMIP3.

Based on the above, it would seem that all CMIP3 model submissions met Singer’s limit, and that all CMIP5 model submissions far exceed it.


Steven Mosher says:
June 20, 2013 at 8:07 am
“The ensemble of models reduces weather noise and structural uncertainty. ”
Steven, I detest assertions like this without evidence. IMHO it is substantially untrue of the ensemble of climate models being discussed. The main purpose of such a statement would seem to be to deflect attention from what causes GCMs to vary and the weaknesses that such attention would bring to light. It’s a form of arguing from authority (using the models as an authority) that as a statistician I’d think you’d shy away from.

Lance Wallace

Attached is an Excel file estimating values for steps 21-23 in series a1-a18. The slopes of each series were calculated using the first 20 values. These slopes ranged from 3.1 down to less than 1. Following instructions that only steps of 0,2,or 4 are allowed, the slopes were rounded to the nearest of these three values. So the slope of 3.1 was rounded to 4, 2.9 was rounded to 2, and anything <1 was rounded to zero. These values were then added to each series for the next 3 steps.
Of course, other approaches using the slopes could have been selected. For example, instead of rounding each slope first and then using that value to add in steps 21-23, each series could have been extended the three steps using the calculated slope. This would have resulted in slightly uneven additions. For example,the 2.9 slope was rounded to 2 and resulted in adding a total of 6 units over the 3 new time steps. But if the slope was applied before rounding, it would have led to a total of 8 units over the 3 new time steps.
The original series looked a lot like a cumulative series. The Dropbox file includes a first difference dataset. This dataset consists of all 0s and 4s, as noted by other commenters. The sum across each time step was 36 units (i.e., 9 "quanta" parceled out to the 18 series) except for times 11, 12, and 13, when there were only 24 units available. Possibly there was a clue there but I didn't see anything obvious to guide me.

Lance Wallace

Sorry, didn’t check my work. Here is the corrected output.
The graph in the Excel file includes steps 21-23.
The choice of rounding the slope first and then adding the rounded values to steps 21-23 resulted in having only 32 units to distribute among the 18 models instead of the 36 available in most of the time steps. Had I calculated the exact extensions of the slope through steps 21-23 and THEN rounded, I would have predicted somewhat different levels for the series and would have had a different number of units to distribute.

” . . .the reason why the ensemble has more skill is simple.”
Maybe I don’t understand the technical term “skill”, but it appears that the ensemble average has _less_ skill than the individual models that are below the average and closer to the actual measured temperature observations.

Doug Proctor

Impossible to see: is there any model run that takes us from where we are today to the Scenario A end at 2100, with a starting point of today’s observation?


Here is my guess.
21 22 23
1 56 60 64
2 64 68 72
3 64 68 72
4 64 68 72
5 52 56 56
6 48 52 56
7 48 52 52
8 48 48 52
9 44 48 52
10 44 44 48
11 44 44 44
12 36 40 40
13 32 32 32
14 24 24 24
15 20 20 20
16 12 16 16
17 8 8 12
18 8 8 8


A couple of questions about this “puzzle” / “test”:
In the preface of this article you described the Y-axis a the “Dimensionless Model Score”; later you wrote (June 20, 2013 at 2:55 am), “These are real data taken from an actual experiment.” Which is it: “normalized” model output or experimental data… I’m confused!!
Since this problem/test appears very ill-posed (and thereby frustrating). . . I’ll simply presume that the latter is correct; i.e., these are the results from 18 experiments. So now the questions are:
(1) Were all 18 experiments carried out independently?
(2) Is the underlying “physical ” state (at step 1) influence by state at (<=0)?
(3) Are " the 18 experiments" truly replicates. . . in that they share nominally equivalent process/design variables and other initial conditions?

son of mulder

This is impossible, here is my solution.

Geoff Sherrington

DanMet’al says: “Were all 18 experiments carried out independently” etc.
Good questions, thank you. For the intent of this article, the series are independent. In reality, there are pairs of series involved but that does not affect the outcome that will be discussed early next week.
All series are zero at time =0.
Yes, to a good enough approximation, all share nominally equivalent process/design variables and other initial conditions.
Those who have entered score might help me by putting the series number next to the 23 score, to save me some colour matching.
Arnost says: June 20, 2013 at 4:42 am “Here’s my guess.”
Arnost, can you please describe briefly why you chose series 14 and not series 15 for the final 2 point increment? Thanks Geoff.


For Geoff Sherrington:
Thanks for your response; but a couple more questions/comments.
(1) Were the 18 experiments run simultaneously or in succession (i.e., is t = n the same clock time for each run or not)
(2) If experiments were run in succession (1) were the experimental results listed in run order (1-18);
These questions are motivated by the observation that the first 10 time steps exhibit 9 state transitions among the 18 runs, time steps 11-13 show 6 transitions, and then steps 14-20 revert to 9 transitions. Yet the frequency/probability of transition drops significant as the run number increases. What confuses me is: if all experiments are identical (except for the “pair of series” you mentioned), my engineering mind wants to reject the likelihood of the correlation between “Y” and run order as shown in your spreadsheet. Also, the 9-6-9 column totals for state transitions, makes me question the assertion that there is no algorithm involved in the construction of these 18 series. . . the pattern seem to be too good to be real!
No complaint though, . . . just figuring that if I ask enough questions and stir the pot, maybe you’ll let enough slip, so that even I can get to the bottom of this puzzle!!
Thanks for the fun.


Sherro says: June 20, 2013 at 5:37pm
Arnost, can you please describe briefly why you chose series 14 and not series 15 for the final 2 point increment? Thanks Geoff.
As you may have guessed I had a very complex numerical solution to the problem …
For what it’s worth: I found series 15 a bit of a dog whilst series 14 was far more powerful [excess kurtosis ].
Logically this had to follow as series 14 had correlation coefficient of 1 with series 12 in the 23rd iteration [once you grabbed the (highly skewed distribution) tail of that beast figuratively speaking]


“Energy” you say, hmm …

21 22 23
1 60 64 68
2 64 64 68
3 60 64 68
4 60 64 68
5 52 56 56
6 52 52 56
7 48 52 52
8 52 56 60
9 40 44 46
10 40 44 44
11 44 44 44
12 36 38 42
13 32 32 32
14 20 22 22
15 20 20 20
16 12 16 16
17 8 12 12
18 8 8 10


… or paint drying 🙂

-- 21 22 23
01 60 64 66
02 64 66 70
03 60 64 68
04 60 64 68
05 52 56 56
06 48 52 56
07 44 48 50
08 48 52 56
09 40 44 44
10 40 44 44
11 46 46 46
12 36 38 42
13 32 32 32
14 20 24 24
15 20 22 22
16 12 16 16
17 8 10 10
18 8 8 10


1 64,40 68
2 69,00 72
3 69,00 72
4 69,00 72
5 59,80 58
6 55,20 54
7 50,60 50
8 55,20 54
9 46,00 46
10 46,00 46
11 50,60 50
12 41,40 42
13 32,20 34
14 23,00 22
15 23,00 22
16 13,80 14
17 9,20 8
18 9,20 8
Column 2 = Best estimate
Column 3 = Iterative, using steps 0 or 2 or 4

Geoff Sherrington

Answers: Each column in covered by the same time period.
The experiments are listed by final ranking. There’s a clue for you.
Good luck Geoff.
Answer on Monday, I hope.


Here are some potentially salient facts I’ve noticed about the data set presented which would have bearing on the projections.
First is as follows: while at first glance the data looks like a progression of coin flips, (as DanMet’al pointed out above) for all time steps except 10-11, 11-12, and 12-13, each time step consisted of exactly nine models incrementing by four and nine models remaining flat. For the three time steps mentioned, each time step consisted of six models incrementing by four and the remaining twelve staying flat. This is too consistent to be random chance, indicating that there is a strong anti-correlation between certain models and certain others.
Next, the anti-correlations are not pair-wise; there is one model which is still zero at time step 10, but no model which has reached 40 at that time step.
Geoff mentioned the word “energy”, which means that this growth behavior may be indicative of some sort of conservation law: that is, the total growth of the overall system of models must equal some specified input: 36 for each of the “large growth” timesteps, 24 for each of the “small growth” time steps.
Likewise, Geoff indicated that for individual models an mid-range growth value of two was possible, although not observed in his data set, indicating that it is very low probability – it is probably safe to assume that, unless the overall system has changed character after time step 20, we are unlikely to see an increment of 2 in the next three time steps.
The problem here is that the system exhibits at least two growth states – that exhibited between time steps 10-13, and that exhibited for all other time steps – and insufficient data to estimate a periodicity for that state change. There was a ten time-step period between the beginning of the model and the onset of the first state change, which naively indicates that we shouldn’t expect the onset of the next state change until at least time step 24, but there is no reason to believe that this state change has a simple periodic behavior.
Basically, as of this point, the problem is under-determined. (Of course, that’s always true; extrapolation from any series of data is impossible unless we have some outside indication as to what those data are supposed to represent and how to construct the appropriate basis elements for forming the series. The set of integer-valued functions on the space of integers is infinite.)
Given the rules you laid out (including knowledge of the final ordering), I could come up with a reasonably small parameter space of allowed final values of the system as a whole based on the continuation of the “36-energy” state or the onset of the “24-energy” state at either the 21st, 22nd, or 23rd time step. I could even estimate the likelihood of each given element within the parameter space by counting the number of paths to reach that element in comparison to the total number of allowed paths. I could do further weighting by estimating correlations between specific models, and three-model correlations, and four-model correlations, and so forth and so on – or correlations based on time-shifted “wiggle matching,” or combinations thereof, or several other possible techniques. I don’t see any particular reason to think that these further weightings are meaningful, though – cursory inspection of the data set seems to indicate that, other than the overall increment law, any further correlations between models are likely to be spurious and thus the best weighting is the naive path-number weighting.
For that matter, despite having taken the time to write up how I would go about producing an estimate, it seems like too much effort to actually do so for what amounts to a toy problem. Geoff’s point about the difficultly of forecasting is well-made, and I have explained here how I would go about the process. That seems more important than actually doing it.
As a final note, given that the system has two states, what reason is there to believe that it does not have more than two states? This goes back to the “under-determined” issue – in the absence of evidence we are forced to assume evidence of absence. Based on the utility as regards “proving a point,” however, I strongly suspect that we are in fact looking at a three state system with a third state which onsets at time-step 21. My reasoning for this is purely psychological – if I were setting up a toy problem to demonstrate the pitfalls of forecasting, that is exactly what I would do – but it reinforces my decision not to attempt to provide a forecast myself.


So, I lied. I was intrigued enough to write a quick and dirty Monte Carlo sim based on my analysis above. Here are the results (each list is the approximate expectation value – computed by averaging 1000 valid runs in the Monte Carlo sim – and is given in model order, 1-18, at time step 23).
For the case where the system remains in the high energy state through time step 23:
66.152, 65.42 , 64.572, 63.224, 58.584, 54.732, 52.26, 51.192, 48.468, 47.188, 46.328, 41.636, 34.128, 27.66, 24.688, 18.644, 14.824, 12.3
For onset of low energy at time-step 22:
65.54, 64.796, 64.068, 62.832, 57.744, 54., 51.544, 50.508, 47.704, 46.524, 45.828, 41.092, 33.568, 26.692, 23.94, 17.952, 14.096, 11.572
Onset at 21:
65.016, 64.188, 63.392, 62.16, 57.144, 53.152, 50.82, 49.796, 47.128, 45.96, 45.216, 40.372, 32.728, 25.808, 23.348, 17.324, 13.376, 11.072
Onset at 20:
64.42, 63.452, 62.66 , 61.424, 56.44, 52.516, 50.044, 49.18 , 46.476, 45.208, 44.624, 39.912, 32.024, 25.248, 22.596, 16.596, 12.676, 10.504
Averaging over the four cases, treating each as equally probable:
65.282, 64.464, 63.673, 62.41, 57.478, 53.6, 51.167, 50.169, 47.444, 46.22, 45.499, 40.753, 33.112, 26.352, 23.643, 17.629, 13.743, 11.362
Estimating a “most likely” scenario by rounding each value in the average to the nearest multiple of 4:
64, 64, 64, 64, 56, 52, 52, 52, 48, 48, 44, 40, 32, 28, 24, 16, 12, 12

Tim Clark

# 9, #9, #9……


So, a quick check via “hindcasting” shows that my simple model-of-models has insufficient dispersion. Starting from zero, at time step 20 I’m seeing an expected value of ~54.5 for the fastest growing model and an expected value of 21.7 for the slowest. This is compared with a max of 60 and a min of 8 for the test data.
This indicates that choosing models to increment at random, subject to the constraints I mentioned above, is not sufficient to approximately reproduce the real behavior, indicating that there is some sort of memory behavior or other weighting system driving the fastest growing lines to consistently grow more quickly than the slowest lines.


So, given that the simple model failed, new model time!
We’ll call this one the “if I’m bigger I eat more” model, which is actually a one-parameter family of models where the probability for a specific line to increment is determine by the availability of food (AKA, the total energy provided, in blocks of 4) and the size of that model in comparison to the total size of the system. The parameter is the “base” or initial size, because otherwise the probabilities are ill-defined at time step zero (and, after time step zero, any line with a size of zero will never eat).
So, schematically, the likelihood that a model increments is ((size + a)/(total size + 18*a))*(food supply), where a is our single parameter. To make sure that all of the food is consumed at each time step, we’ll work our way from largest to smallest, adjusting probabilities based on remaining food and the total size of the portion of the system that hasn’t had a chance to eat yet.
Actually, I’m not going to bother to code this model up, because I can already see where this is headed: I can control dispersion based on the parameter a (large a makes this model indistinguishable from the naive model; small a means only the big guys ever get to eat), but then I’d start testing for better fits at eat time step, or testing the dispersion in more detail than just looking at the envelope, and for each such discrepancy I found I’d need to add a parameter to fix it. With enough parameters, I could make the fit perfect – all without ever considering whether or not my initial guess for the “big eats” type of model was correct.
In other words, I’d be doing Climate Modelling! Yay!


I was unable to identify an approach that might lead to a unique solution to this problem. So for what it’s worth . . . not much really . . . I merely examined the experimental result matrix, looking for patterns, calculated some simple statistics, invoked the “hints” as I understood them, and then flat out guessed the step 21-23 results. My final result is not unique and beyond this I highly suspect it is wrong. . . because there is too much here I realize I just don’t know (e.g. what governs the occurrence of a +2 step).
1: 68
2: 68
3: 68
4: 68
5: 60
6: 56
7: 56
8: 56
9: 48
10: 48
11: 44
12: 40
13: 32
14: 24
15: 20
16: 16
17: 12
18: 8