An online and open exercise in stylometry/textometry: Crowdsourcing the Gleick "Climate Strategy Memo" authorship

Tonight, a prescient prediction made on WUWT shortly after Gleick posted his confession has come true in the form of DeSmog blog making yet another outrageous and unsupported claim in an effort to save their reputation and that of Dr. Peter Gleick as you can read here: Evaluation shows “Faked” Heartland Climate Strategy Memo is Authentic

In a desperate attempt at self vindication, the paid propagandists at DeSmog blog have become their own “verification bureau” for a document they have no way to properly verify. The source (Heartland) says it isn’t verified (and a fake) but that’s not good enough for the Smoggers and is a threat to them, so they spin it and hope the weak minded regugitators retweet it and blog it unquestioned. They didn’t even bother to get an independent opinion. It seems to be just climate news porn for the weak minded Suzuki followers upon which their blog is founded. As one WUWT commenter (Copner) put it – “triple face palm”.

Laughably, the Penn State sabbaticalized Dr. Mike Mann accepted it uncritically.

Twitter / @DeSmogBlog: Evaluation shows “Faked” H …

Evaluation shows “Faked” Heartland Climate Strategy Memo is Authentic bit.ly/y0Z7cL  – Retweeted by Michael E. Mann

Tonight in comments, Russ R. brought attention to his comment with prediction from two days ago:

I just read Desmog’s most recent argument claiming that the confidential strategy document is “authentic”. I can’t resist reposting this prediction from 2 days ago:

Russ R. says:

February 20, 2012 at 8:49 pm

Predictions:

1. Desmog and other alarmist outfits will rush to support Gleick, accepting his story uncritically, and offering up plausible defenses, contorting the evidence and timeline to explain how things could have transpired. They will also continue to act as if the strategy document were authentic. They will portray him simultaneously as a hero (David standing up to Goliath), and a victim (an innocent whistleblower being harassed by evil deniers and their lawyers).

2. It will become apparent that Gleick was in contact with Desmog prior to sending them the document cache. They knew he was the source, and they probably knew that he falsified the strategy document. They also likely received the documents ahead of the other 14 recipients, which is the only way they could have had a blog post up with all the documents AND a summary hyping up their talking points within hours of receiving them.

3. This will take months, or possibly years to fully resolve.

Russ R. is spot on, except maybe for number 3, and that’s where you WUWT readers and crowdsourcing come in. Welcome to the science of stylometry / textometry.

Since DeSmog blog (which is run by a Public Relations firm backed by the  David Suzuki foundation) has no scruples about calling WUWT, Heartland, and skeptics in general “anti-science”, let’s use science to show how they are wrong. Of course the hilarious thing about that is that these guys are just a bunch of PR hacks, and there isn’t a scientist among them. As Megan McArdle points out, you don’t have to be a scientist to figure out the “Climate Strategy” document is a fake, common sense will do just fine. She writes in her third story on the issue: The Most Surprising Heartland Fact: Not the Leaks, but the Leaker

… a few more questions about Gleick’s story:  How did his correspondent manage to send him a memo which was so neatly corroborated by the documents he managed to phish from Heartland?
How did he know that the board package he phished would contain the documents he wanted?  Did he just get lucky?

If Gleick obtained the other documents for the purposes of corroborating the memo, why didn’t he notice that there were substantial errors, such as saying the Kochs had donated $200,000 in 2011, when in fact that was Heartland’s target for their donation for 2012?  This seems like a very strange error for a senior Heartland staffer to make.  Didn’t it strike Gleick as suspicious?  Didn’t any of the other math errors?

So, let’s use science to show the world what they the common sense geniuses at DeSmog haven’t been able to do themselves. Of course I could do this analysis myself, and post my results, but the usual suspects would just say the usual things like “denier, anti-science, not qualified, not a linguist, not verified,” etc. Basically as PR hacks, they’ll say anything they could dream up and throw it at us to see if it sticks. But if we have multiple people take on the task, well then, their arguments won’t have much weight (not that they do now). Besides, it will be fun and we’ll all learn something.

Full disclosure: I don’t know how this experiment will turn out. I haven’t run it completely myself. I’ve only familiarized myself enough with the software and science of stylometry / textometry to write about it. I’ll leave the actual experiment to the readers of WUWT (and we know there are people on both sides of the aisle that read WUWT every day).

Thankfully, the open-source software community provides us with a cross-platform open source tool to do this. It is called JGAAP (Java Graphical Authorship Attribution Program). It was developed for the express purpose of examining unsigned manuscripts to determine a likely author attribution. Think of it like fingerprinting via word, phrase, and punctuation usage.

From the website main page and FAQs:

JGAAP is a Java-based, modular, program for textual analysis, text categorization, and authorship attribution i.e. stylometry / textometry. JGAAP is intended to tackle two different problems, firstly to allow people unfamiliar with machine learning and quantitative analysis the ability to use cutting edge techniques on their text based stylometry / textometry problems, and secondly to act as a framework for testing and comparing the effectiveness of different analytic techniques’ performance on text analysis quickly and easily.

What is JGAAP?

JGAAP is a software package designed to allow research and development into best practices in stylometric authorship attribution.

Okay, what is “stylometric authorship attribution”?

It’s a buzzword to describe the process of analyzing a document’s writing style with an eye to determining who wrote it. As an easy and accessible example, we’d expect Professor Albus Dumbledore to use bigger words and longer sentences than Ronald Weasley. As it happens (this is where the R&D comes in), word and sentence lengths tend not to be very accurate or reliable ways of doing this kind of analysis. So we’re looking for what other types of analysis we can do that would be more accurate and more reliable.

Why would I care?

Well, maybe you’re a scholar and you found an unsigned manuscript in a dusty library that you think might be a previously unknown Shakespeare sonnet. Or maybe you’re an investigative reporter and Deep Throat sent you a document by email that you need to validate. Or maybe you’re a defense attorney and you need to prove that your client didn’t write the threatening ransom note.

Sounds like the perfect tool for the job. And, best of all, it is FREE.

So here’s the experiment and how you can participate.

1. Download, and install the JGAAP software. Pretty easy, works on Mac/PC/Linux

If your computer does not already have Java installed, download the appropriate version of the Java Runtime Environment from Sun Microsystems. JGAAP should work with any version of Java at least as recent as version 6. If you are using a Mac, you may need to use the Software Update command built into your computer instead.

You can download the JGAAP software here. The jar will be named jgaap-5.2.0.jar, once it has finished downloading simply double click on it to launch JGAAP. I recommend copying it to a folder and launching it from there.

2. Read the tutorial here. Pay attention to the workflow process and steps required to “train” the software. Full documentation is here. Demos are here

3. Run some simple tests using some known documents to get familiar with the software. For example, you might run tests using some posts from WUWT (saved as text files) from different authors, and then put in one that you know who authored as a test, and see if it can be identified. Or run some tests from authors of newspaper articles from your local newspaper.

4. Download the Heartland files from Desmog Blog’s original post here. Do it fast, because this experiment is the one thing that may actually cause them to take them offline. Save them in a folder all together. Use the “properties” section of the PDF viewer to determine authorship. I suggest appending the author names (like J.Bast) to the end of the filename to help you keep things straight during analysis.

5. Run tests on the files with known authors based on what you learned in step 3.

6. Run tests of known Heartland authors (and maybe even throw in some non-heartland authors) against the “fake” document 2012 Climate Strategy.pdf 

You might also visit this thread on Lucia’s and get some of the documents Mosher used to compare visually to tag Gleick as the likely leaker/faker. Perhaps Mosher can provide a list of files he used. If he does, I’ll add them. Other Gleick authored documents can be found around the Internet and at the Pacific Institute. I won’t dictate any particular strategy, I’ll leave it up to our readers to devise their own tests for exclusion/inclusion.

7. Report your finding here in comments. Make screencaps of the results and use tinypic.com or photobucket (or any image drop web service) to leave the images in comments as URLs. Document your procedure so that others can test/replicate it.

8. I’ll then make a new post (probably this weekend) reporting the results of the experiment from readers.

As a final note, I welcome comments now in the early stages for any suggestions that may make the experiment better. The FBI and other law enforcement agencies investigating this have far better tools I’m told, but this experiment might provide some interesting results in advance of their findings.

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
233 Comments
Inline Feedbacks
View all comments
DocMartyn
February 23, 2012 7:35 pm

When anything is printed on a laser printer it suffers the George Costanza problem; shrinkage.
The printer works by using reflected light to ionize a selenium coated rod, the rod picks up magnetic ink, which is deposited on paper, the paper is placed in a strong magnetic field to remove the charge, finally, the ink is heat to melt it into the paper.
Different laser printers have different inks, and different heat treatments, and different amounts of shrinkage, However, cheap papers tend to shrink the most and also have much more asymmetry in the product, the edges shrink more than the center of the paper.
If the document was printed on high quality paper, shrinkage is minimal and justified lines in the center of the paper will have the same linearity and width as those on the top and bottom.
Lower quality papers are much more likely to have asymmetry about the center of the paper, so the bottom and top lines will have a curved form.
Paper has to be kept at the correct humidity to cut down on curling, the more humid the atmosphere and cheaper the paper, the more curl will be observed.
Does the text have smilees and frownies?

Mark T
February 23, 2012 8:27 pm

P. Solar: you are supposed spell out only single digit numbers. Two or more are inserted as digits. Spelling fifteen would be the only anomaly.
Mark

February 23, 2012 8:29 pm

@robin
“I wonder what http://ljzigerell.wordpress.com/2012/02/18/profiling-the-heartland-memo-author/ would find with Gleick’s writings compared to the Heartland speech he used.”
It appears that Gleick also uses the Oxford comma as well (check out his mea culpa speech). But then again he uses commas all over the place. Apparently, he, really, likes, commas.

eyesonu
February 23, 2012 8:34 pm

It is astounding at the number of commenters on this thread as well as others who have pitched in with their knowledge to decipher fakegate as well as the many other topics discussed on WUWT. It is truely an ‘amy of ones’. I doubt that defeat will ever be an option. Now that is a team you can believe in!

a dood
February 23, 2012 9:23 pm

No problem getting the software installed on my mac. So it looks like basically I need to load in text from several authors to compare the “anonymous” document against. Hmmmm I bet it could identify a Willis essay from a mile away. Will keep you posted.

a dood
February 23, 2012 9:30 pm

“Peter Kovachev says:
February 23, 2012 at 10:59 am
A dood, beg to respectfully differ with your smudge hypothesis. If you were to look again, you’ll note that the anomaly is unnaturally regular; rectangular, straight lined and with an apparent corner. I still place my wager on a shadow caused by a white tape or a label. On second look, it may actually be a shadow caused by the upper corner and edge of the document page, although the abrupt, un-tapered ending of the bottom bit of the line on the left hand side would militate against that.”
Agreed. I was just thinking that whenever I had the same mark on all the pages of a scan it would often be something on the scanner glass.

sunspot
February 23, 2012 9:53 pm

ha……………………
check this out http://scienceblogs.com/gregladen/2012/02/is_the_heartland_strategy_memo.php
pathetic !!!!

peetee
February 23, 2012 10:22 pm

Hey now! Using Mr. Watts recommended software => most likely author of strategy document: Heartland’s Joe Bast (with bonus consideration: Mosher sent the document to Gleick)
The most likely author of the Heartland Institute climate strategy memo? => http://www.shawnotto.com/neorenaissance/blog20120223.html

REPLY:
Yawn, predictable. It’s HuffPo, what did you expect? And he just throws out the numbers there, not knowing how to interpret them, and jumps to the conclusion he wants, like you do, providing scads of caveats. The scores he published suggest otherwise. The authors of the program are helping WUWT do an analysis, we’ll wait for their input on how to properly configure the program. Which we’ll share. -Anthony

February 23, 2012 10:50 pm

I do not have the capability of using the recommended software and doing a textual analysis. Nevertheless I think it very worthwhile to simply study the purported memo to see what one can see. In agreement with the message left by P. Solar I have the following observations which may or may not prove useful.
First – the memo says “His effort will focus on providing curriculum that shows that the topic of climate change is controversial and uncertain – two key points that are effective at dissuading teachers from teaching science.”
Second – the memo says “This influential audience has usually been reliably anti-climate and it is important to keep opposing voices out.”
Now ponder these quotes a minute or so. Ask yourself, have you ever ever ever heard such blatant self-incrimination as “dissuading teachers from teaching science”, “reliably anti-climate” and “important to keep opposing voices out”? That is way beyond even “hide the decline”. Any such phrases, if authenticated, would comprise direct evidence of fraud. But even low level fraudsters are sufficiently careful to never put such self-incrimination into writing. It just does not add up that an official document, however secret, would be produced which contained proof of abysmally guilty intent, not to mention guilty action.
I am reminded of an event cited in Richard Frank’s book GUADALCANAL where some Marines heard a noise in the distant brush and hollered out “who goes there?” After a few seconds came an answer “we are American Marines returning to report on the evening’s activities.”
Sometimes words are so blatantly phony that they give themselves utterly away.
Press on and best of luck.

Rick Bradford
February 23, 2012 10:57 pm

I’ve been playing around with JGAAP, and can see why it’s definitely beta software.
If you get your combination of analysis methods and event drivers wrong, it just fails to do the computation.
When I was using Training and Test sets (in the approved style), it correctly identified Gleick’s writing on matters of Sentence Length and ‘Words as Events’, whatever that is supposed to mean.
It’s an interesting topic, so I might go away and write my own analysis tool — as it stands JGAAP is too opaque right now.

Draig Du
February 23, 2012 11:13 pm

A thread brimming with the requested screen shots….the usual over whelming evidence.

Shevva
February 24, 2012 12:58 am

@Draig Du says:
February 23, 2012 at 11:13 pm
Let me help you there.
*Picks up rock*
Are you OK getting back under there or do you need help?

Draig Du
February 24, 2012 1:34 am

[snip . . OT . . kbmod]

P. Solar
February 24, 2012 2:16 am

James Hill picks up on “important to keep opposing voices out”?
That is one I spotted and forgot to comment on. This is such a blatant parallel to the Machiavellian gatekeeping revealed in climategate that it seems to be an obvious plant of “proof that the other side are just as bad, so don’t criticise us any more”.
This document reeks of forgery from top to bottom. The fact that the last para is a bit more obviously so probably means the author started by trying to weave in the material he had purloined for H.I. and by the end was just letting it flow.

Mac
February 24, 2012 2:27 am

The prominence of Gleick(isms) and Bast(isms) indicates that Gleick wrote the strategy document based on summaries of the stolen documents which were probably written in large parts by Bast. That is a clincher.

P. Solar
February 24, 2012 3:01 am

Memo refers to “the Anonymous Donor” rather than “our”. This suggests it is written by an outsider to the organisation.
memo states their climate work is of special interest to this donor. This seems to be an obsession of Gleick and his insistence of this being published as a condition of him speaking at the dinner.

David L
February 24, 2012 3:07 am

Mark T on February 23, 2012 at 8:27 pm said:
P. Solar: you are supposed spell out only single digit numbers. Two or more are inserted as digits. Spelling fifteen would be the only anomaly.
Mark
—————
I was not taught that rule in college technical writing class. I was taught that you keep comsisent. If you’re spelling numbers then you spell them all. If not, then they’re all numeric. Not saying this is the absolute truth, just there are variations to the rule just like I was taught to use the Oxford comma but not using it is also okay.

February 24, 2012 4:27 am

@kim2000: I do not buy it. Compare the size of the left “margin” in the two documents.
docA: http://img813.imageshack.us/img813/5586/startdocpg2.jpg
docB: http://pacinst.org/reports/success_stories/new_ag_water_success_stories.pdf
The horizontal distance between text and pacinst logo-box in docA, is much smaller than the
distance between text and pixel-clutter in docB. They obviously do not match.
Here is an overlay where i’ve matched the font size and first line of text on page 2 from both documents: – feel free to try yourself.
@watts: Could you please share your aspect-ratio calculation? I see only two edges to measure from, and I am curious how you measure the aspect ratio from those.
To me the distance between the text and the pixel clutter is exactly where I would expect the paper margin to be. The margin looks just like the standard margins in word processors. That is also the parsimonious explanation. The idea that somebody should have used their institute template with a header, and then later removed it when creating a forgery is just unbelievable. Who would do such an idiotic thing? So @watts, @Kovachev, @wilson, and @Engelbeen: Talk about confirmation bias!

Rogelio
February 24, 2012 4:55 am

Agree with Mac. ANY semblance to Gleick would probably mean that he wrote it.

Jeb
February 24, 2012 5:07 am

Oh Anthony. I have a feeling this idea could come back and bite you [snip].
I rather hope not though. The alternative would be far more interesting.

February 24, 2012 5:07 am

Mosher is smarter than this program. I’ll wait for his input.

Jeb
February 24, 2012 5:40 am

A potential problem withthis is that it might be detecting the style of the writing more so than the author.
Think about it this way. The “fake” memo is written in the style of an official document. If the only examples of Bast’s writings you are putting into the program are the other official documents, then naturally it is going to peg him as the author when compared to Gleick’s writings, which are more journalistic in tone.
Has anyone found any reports written by Gleick when he was head of the scientific integrity panel?

Sean
February 24, 2012 6:15 am

Unknown Document:
Expanded Climate Communications section of fake memo (the obviously adlibbed section of the fake memo)
Joe Bast documents:
http://news.heartland.org/newspaper-article/2011/09/02/are-we-doomed
http://news.heartland.org/newspaper-article/2011/08/10/heartland-replies-science
http://news.heartland.org/newspaper-article/2011/07/28/heartland-replies-nature
http://heartland.org/editorial/2011/01/31/writer-owes-schmitt-readers-apology
Peter Gleick Documents:
E-mail to Barry, E-mail to Tamsin
Expanded climate communications.doc C:\Users\Sean\Documents\Expanded climate communications.doc
Canonicizers: none
Analyzed by Nearest Neighbor Driver with metric Canberra Distance using Character 2Grams as events
1. Expanded Climate Communications 0.0
2. Peter Gleick 276.85855690431157
3. Joe Bast 341.0642160873912
4. Peter Gleick 342.9695238767721
5. Joe Bast 387.49597772052016
6. Joe Bast 398.2288354385788
7. Joe Bast 595.1870344738793
Expanded climate communications.doc C:\Users\Sean\Documents\Expanded climate communications.doc
Canonicizers: none
Analyzed by Nearest Neighbor Driver with metric Canberra Distance using Character 4Grams as events
1. Expanded Climate Communications 0.0
2. Peter Gleick 1697.8838678571715
3. Peter Gleick 2212.55268946424
4. Joe Bast 2498.548796452392
5. Joe Bast 2746.177655844846
6. Joe Bast 2904.130159851305
7. Joe Bast 4659.464731336592
Expanded climate communications.doc C:\Users\Sean\Documents\Expanded climate communications.doc
Canonicizers: none
Analyzed by Nearest Neighbor Driver with metric Canberra Distance using Word 2Grams as events
1. Expanded Climate Communications 0.0
2. Peter Gleick 428.2207207207207
3. Peter Gleick 542.1495016611295
4. Joe Bast 625.1890034364261
5. Joe Bast 720.8237103084341
6. Joe Bast 816.7965029955717
7. Joe Bast 1496.2399713441844
Expanded climate communications.doc C:\Users\Sean\Documents\Expanded climate communications.doc
Canonicizers: none
Analyzed by Nearest Neighbor Driver with metric Canberra Distance using Word 4Grams as events
1. Expanded Climate Communications 0.0
2. Peter Gleick 439.0
3. Peter Gleick 553.0
4. Joe Bast 641.0
5. Joe Bast 753.0
6. Joe Bast 876.0
7. Joe Bast 1627.0
Expanded climate communications.doc C:\Users\Sean\Documents\Expanded climate communications.doc
Canonicizers: none
Analyzed by Nearest Neighbor Driver with metric Canberra Distance using Word stems as events
1. Expanded Climate Communications 0.0
2. Peter Gleick 245.90269295532852
3. Peter Gleick 329.25374822603555
4. Joe Bast 374.28204336681557
5. Joe Bast 422.0761532754034
6. Joe Bast 433.8851473207132
7. Joe Bast 718.2862711836141
Conclusion: Gleick much more likely than Bast. However, I need more understanding of the settings.

Francisco
February 24, 2012 6:24 am

“This influential audience has usually been reliably anti-climate and it is important to keep opposing voices out.”
“Reliably anti-climate” sounds like a clear giveaway slip. The word is always used in a derogatory way, and that’s how Gleik uses it in his own writings, as when he calls the hockey stick a “bugaboo of the anti-climate change crowd.”
Heartland describing its own audience as “reliably anti-climate” would be similar to describing them as “reliably denialist”. Or, from the opposite perspective, it would be as if a warmist/alarmist organization referred to its own audience as “reliably alarmist”.

February 24, 2012 6:31 am

Sean,
I’m too busy to look them all up right now, but other posters like Robin [@12:34 above] have come to the opposite conclusion. It appears that the program is a cherrypicker’s delight. So I’ll wait for Mosher’s input. He was the one who ID’d Gleick, forcing Gleick to admit to criminal activity in his noble cause corruption. I trust Mosher’s reasoning much more than any GIGO program in beta testing.