An online and open exercise in stylometry/textometry: Crowdsourcing the Gleick "Climate Strategy Memo" authorship

Tonight, a prescient prediction made on WUWT shortly after Gleick posted his confession has come true in the form of DeSmog blog making yet another outrageous and unsupported claim in an effort to save their reputation and that of Dr. Peter Gleick as you can read here: Evaluation shows “Faked” Heartland Climate Strategy Memo is Authentic

In a desperate attempt at self vindication, the paid propagandists at DeSmog blog have become their own “verification bureau” for a document they have no way to properly verify. The source (Heartland) says it isn’t verified (and a fake) but that’s not good enough for the Smoggers and is a threat to them, so they spin it and hope the weak minded regugitators retweet it and blog it unquestioned. They didn’t even bother to get an independent opinion. It seems to be just climate news porn for the weak minded Suzuki followers upon which their blog is founded. As one WUWT commenter (Copner) put it – “triple face palm”.

Laughably, the Penn State sabbaticalized Dr. Mike Mann accepted it uncritically.

Twitter / @DeSmogBlog: Evaluation shows “Faked” H …

Evaluation shows “Faked” Heartland Climate Strategy Memo is Authentic bit.ly/y0Z7cL  – Retweeted by Michael E. Mann

Tonight in comments, Russ R. brought attention to his comment with prediction from two days ago:

I just read Desmog’s most recent argument claiming that the confidential strategy document is “authentic”. I can’t resist reposting this prediction from 2 days ago:

Russ R. says:

February 20, 2012 at 8:49 pm

Predictions:

1. Desmog and other alarmist outfits will rush to support Gleick, accepting his story uncritically, and offering up plausible defenses, contorting the evidence and timeline to explain how things could have transpired. They will also continue to act as if the strategy document were authentic. They will portray him simultaneously as a hero (David standing up to Goliath), and a victim (an innocent whistleblower being harassed by evil deniers and their lawyers).

2. It will become apparent that Gleick was in contact with Desmog prior to sending them the document cache. They knew he was the source, and they probably knew that he falsified the strategy document. They also likely received the documents ahead of the other 14 recipients, which is the only way they could have had a blog post up with all the documents AND a summary hyping up their talking points within hours of receiving them.

3. This will take months, or possibly years to fully resolve.

Russ R. is spot on, except maybe for number 3, and that’s where you WUWT readers and crowdsourcing come in. Welcome to the science of stylometry / textometry.

Since DeSmog blog (which is run by a Public Relations firm backed by the  David Suzuki foundation) has no scruples about calling WUWT, Heartland, and skeptics in general “anti-science”, let’s use science to show how they are wrong. Of course the hilarious thing about that is that these guys are just a bunch of PR hacks, and there isn’t a scientist among them. As Megan McArdle points out, you don’t have to be a scientist to figure out the “Climate Strategy” document is a fake, common sense will do just fine. She writes in her third story on the issue: The Most Surprising Heartland Fact: Not the Leaks, but the Leaker

… a few more questions about Gleick’s story:  How did his correspondent manage to send him a memo which was so neatly corroborated by the documents he managed to phish from Heartland?
How did he know that the board package he phished would contain the documents he wanted?  Did he just get lucky?

If Gleick obtained the other documents for the purposes of corroborating the memo, why didn’t he notice that there were substantial errors, such as saying the Kochs had donated $200,000 in 2011, when in fact that was Heartland’s target for their donation for 2012?  This seems like a very strange error for a senior Heartland staffer to make.  Didn’t it strike Gleick as suspicious?  Didn’t any of the other math errors?

So, let’s use science to show the world what they the common sense geniuses at DeSmog haven’t been able to do themselves. Of course I could do this analysis myself, and post my results, but the usual suspects would just say the usual things like “denier, anti-science, not qualified, not a linguist, not verified,” etc. Basically as PR hacks, they’ll say anything they could dream up and throw it at us to see if it sticks. But if we have multiple people take on the task, well then, their arguments won’t have much weight (not that they do now). Besides, it will be fun and we’ll all learn something.

Full disclosure: I don’t know how this experiment will turn out. I haven’t run it completely myself. I’ve only familiarized myself enough with the software and science of stylometry / textometry to write about it. I’ll leave the actual experiment to the readers of WUWT (and we know there are people on both sides of the aisle that read WUWT every day).

Thankfully, the open-source software community provides us with a cross-platform open source tool to do this. It is called JGAAP (Java Graphical Authorship Attribution Program). It was developed for the express purpose of examining unsigned manuscripts to determine a likely author attribution. Think of it like fingerprinting via word, phrase, and punctuation usage.

From the website main page and FAQs:

JGAAP is a Java-based, modular, program for textual analysis, text categorization, and authorship attribution i.e. stylometry / textometry. JGAAP is intended to tackle two different problems, firstly to allow people unfamiliar with machine learning and quantitative analysis the ability to use cutting edge techniques on their text based stylometry / textometry problems, and secondly to act as a framework for testing and comparing the effectiveness of different analytic techniques’ performance on text analysis quickly and easily.

What is JGAAP?

JGAAP is a software package designed to allow research and development into best practices in stylometric authorship attribution.

Okay, what is “stylometric authorship attribution”?

It’s a buzzword to describe the process of analyzing a document’s writing style with an eye to determining who wrote it. As an easy and accessible example, we’d expect Professor Albus Dumbledore to use bigger words and longer sentences than Ronald Weasley. As it happens (this is where the R&D comes in), word and sentence lengths tend not to be very accurate or reliable ways of doing this kind of analysis. So we’re looking for what other types of analysis we can do that would be more accurate and more reliable.

Why would I care?

Well, maybe you’re a scholar and you found an unsigned manuscript in a dusty library that you think might be a previously unknown Shakespeare sonnet. Or maybe you’re an investigative reporter and Deep Throat sent you a document by email that you need to validate. Or maybe you’re a defense attorney and you need to prove that your client didn’t write the threatening ransom note.

Sounds like the perfect tool for the job. And, best of all, it is FREE.

So here’s the experiment and how you can participate.

1. Download, and install the JGAAP software. Pretty easy, works on Mac/PC/Linux

If your computer does not already have Java installed, download the appropriate version of the Java Runtime Environment from Sun Microsystems. JGAAP should work with any version of Java at least as recent as version 6. If you are using a Mac, you may need to use the Software Update command built into your computer instead.

You can download the JGAAP software here. The jar will be named jgaap-5.2.0.jar, once it has finished downloading simply double click on it to launch JGAAP. I recommend copying it to a folder and launching it from there.

2. Read the tutorial here. Pay attention to the workflow process and steps required to “train” the software. Full documentation is here. Demos are here

3. Run some simple tests using some known documents to get familiar with the software. For example, you might run tests using some posts from WUWT (saved as text files) from different authors, and then put in one that you know who authored as a test, and see if it can be identified. Or run some tests from authors of newspaper articles from your local newspaper.

4. Download the Heartland files from Desmog Blog’s original post here. Do it fast, because this experiment is the one thing that may actually cause them to take them offline. Save them in a folder all together. Use the “properties” section of the PDF viewer to determine authorship. I suggest appending the author names (like J.Bast) to the end of the filename to help you keep things straight during analysis.

5. Run tests on the files with known authors based on what you learned in step 3.

6. Run tests of known Heartland authors (and maybe even throw in some non-heartland authors) against the “fake” document 2012 Climate Strategy.pdf 

You might also visit this thread on Lucia’s and get some of the documents Mosher used to compare visually to tag Gleick as the likely leaker/faker. Perhaps Mosher can provide a list of files he used. If he does, I’ll add them. Other Gleick authored documents can be found around the Internet and at the Pacific Institute. I won’t dictate any particular strategy, I’ll leave it up to our readers to devise their own tests for exclusion/inclusion.

7. Report your finding here in comments. Make screencaps of the results and use tinypic.com or photobucket (or any image drop web service) to leave the images in comments as URLs. Document your procedure so that others can test/replicate it.

8. I’ll then make a new post (probably this weekend) reporting the results of the experiment from readers.

As a final note, I welcome comments now in the early stages for any suggestions that may make the experiment better. The FBI and other law enforcement agencies investigating this have far better tools I’m told, but this experiment might provide some interesting results in advance of their findings.

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
233 Comments
Inline Feedbacks
View all comments
P. Solar
February 23, 2012 7:54 am

“Who elected Connolley to decide what is “content free”? This isn’t Wikipedia, where Connolley can make decisions. He is only another commenter here. And as we can see, he is usually wrong.”
I have no love for Connelley , who is a eco-bigot who has done much damage to the usefulness of Wikipedia. However, once again I have to say he’s right.
This effort at crowd sourcing seems to have a very poor response rate. Lot’s of chaff about pdf”s but if no one is analysing them comment is immaterial.
I think the idea is interesting but I have some climate science I’d rather do instead of invest time in what Gleick is going to have to come clean on at some stage.
He looks like a kind and decent guy , I don’t think he will stand up well to the prospect of doing serious time for forgery.

Eric Gisin
February 23, 2012 8:05 am

Is desmog located in Canada or the US? It matters if criminal investigation starts.
I wonder what Suzuki’s links to desmog are. He blogs at HuffyPost, and assumes the fake memo is also genuine (written after Gleick confessed).
David Suzuki: Denying Climate Change is Worse Than Stealing – http://www.huffingtonpost.ca/david-suzuki/documents-strike-at-heart_b_1292343.html

William M. Connolley
February 23, 2012 8:09 am

Up to 98 comments. Lots of people spouting off, but no-one has actually done any work.

February 23, 2012 8:14 am

Rude: No dice, Glieck implies he got the memo separately, from an anonymous source, by snail mail. Very convenient, no?
Sherlock Mosher has already pointed out that this is highly unlikely due to the fact that there are no crease lines from folding in the fake document….

msjake
February 23, 2012 8:15 am

Nope, all that stuff I did took no work at all.
Thanks Mr. Connilley.

February 23, 2012 8:20 am

One suggestion on the crowdsourcing — I am not familiar with the tool you use, but analyses like these are subject to both false negatives and false positives. If the tools seems to point to Gleick as a likely author, we are not worried about false negatives. But there is still the problem of false positives. The tool needs to be run against other climate authors to see how they score as authors. If I were a defender of Gleick, and your tool came up with a “positive” match, the first thing I would try to defend my guy is to run the tool against other authors and see if I can get other positives. If they are able to get positives against, say, yourself or even Madonna as authors, that it is going to undermine your results. So I would recommend that you understand the false positive rate of your tool.

Morph
February 23, 2012 8:21 am

You’re obviously not busy then William 😉

Aaron
February 23, 2012 8:34 am

Mr Connolley seems to imagine that in the time between late last night and this morning that many of us should have found the time to perform this experiment already. I guess we should have forgone sleep. After all, if we can find the time to spend 30 seconds writing a blog posting, we should have been able to perform this experiment by now.
This cannot possibly be done as quickly as Mr. Connolley seems to think it should be done. Something like this effort will likely require free time over a weekend or something like that. Counting the number of posts which occur before the first results are is embarrassingly childish in nature.

A Lovell
February 23, 2012 8:40 am

The money Lefebvre supplies to DeSmogBlog is not dirty. After all, he is well known for laundering it.

Jake
February 23, 2012 8:56 am

,
I would even go further to suggest that people who are doing this checking, should hold onto their results until they have done several runs using different options within the software.
Then when you DO publish, be prepared to discuss BOTH negative and positive results.
Afterall, I believe that is how “science” is supposed to be done.

Warren in Minnesota
February 23, 2012 8:58 am

The document and the envelope: Was it sent by US mail? I would guess that Peter Gleick would say that he threw it away and it is gone. But if it were anonymously mailed, my common sense would want to know from where the letter came. Maybe Peter Gleick has similar common sense and knows the place of mailing and even kept the envelope.

February 23, 2012 9:02 am

Sonicfrog, (February 23, 2012 at 7:52 am), I agree with you that the use of the terms “anti-climate” and “influential” you highlighted may not make a strong case for forensically identifying the author, but the use of the former has raised many eyebrows. I cannot imagine a skeptic calling his position “anti-climate.” It would be akin to, for example, an antisemite today refering to himself as an antisemite, when the common and favoured terms are “critic of Israel” or “anti-Zionist.” My guess is that the doc has been put together by someone who may be otherwise a bright person, but an utter dufus when it comes to IT-based forgery, which our digitized environment have made harder.

CW
February 23, 2012 9:03 am

Michael J says:
February 23, 2012 at 12:39 am
I have a new theory about the fake document.
I suspect that it was sent to him by a colleague or, more likely, an opponent for the specific purpose of yanking his chain.
They hoped to get a laugh as Dr Gleick’s anger and hatred blinded him to the document’s obvious faults. However even the provocateur(s) could not have anticipated Dr Gleick’s actions.
================
This gave me a good laugh. As did imagining the panic that would have gripped the prankster when Gleick told him about it.

February 23, 2012 9:06 am

William M. Connolley says February 23, 2012 at 8:09 am
Up to 98 comments. Lots of people spouting off, but no-one has actually done any work.

That, is what is called “a drive-by”; it was also contentless as well as being witlessly done.
Somehow one would expect more from the big kahuna-combination of Climate + Wiki …
Say, Big Kahuna, how do you know what is taking place behind the scenes? Have you no imagination, no ability think beyond your own 6′ x 8′ cubicle?
.

Scrutineer
February 23, 2012 9:06 am

Has Andrew Revkin or any other journalist made an effort to directly contact Gleick and ask him if he fabricated (or had any involvement in the fabrication of) the “Climate Strategy” memo?

Mac
February 23, 2012 9:08 am

William M Connolley has form in faking information. Perhaps he give us pointers on how he does it? It seems a pity to waste all that experience when it is close at hand.

February 23, 2012 9:16 am

“Up to 98 comments. Lots of people spouting off, but no-one has actually done any work.” –William M. Connolley (February 23, 2012 at 8:09 am)
And right indeed you are, Mr Connolley, we’re all a pile of lazy loafs here, it would seem. Shame on us. All the more reason for you to take up my challenge to you, to roll up your sleeves and show us how to do a man’s job properly. An interesting experiment to see how devoted you are to the muse of science, and o, what a delightful screamer it would be if it were to be you who conclusively nailed the identity of the fraudster! I bet you still have the mojo for that.
Still, in defense of us lazy bones, some of us like me are lucky to manage a PC for email and to even tart up our simpletonian postings with html, and all here appear to be cursed with day jobs which limit us to quick missives throughout the workday. Perhaps if the folks who keep you in supply of your daily pint and bangers-‘n-beans would throw some green paper stuff our way….

a dood
February 23, 2012 9:27 am

One thing to note about the scanned PDF: The ‘readable’ text was created by the EPSON’s OCR software as invisible overlay. This is pretty common – it makes the scan’s text searchable. But the OCR software will generate errors in spacing and spelling that are NOT in the original text. For a cleaner analysis the text should be cleaned up to actually match the original before running the JGAAP software….. Anyway, that’s just a thought. I’ll try and remember to try this out later!
Also, I keep seeing people bringing up the ‘yellow dot tracking’ – it doesn’t look like the original scan was made at a resolution where those would be detectable. They’re just low res black and white bitmaps.

Luther Wu
February 23, 2012 9:28 am

William M. Connolley says:
February 23, 2012 at 8:09 am
Up to 98 comments. Lots of people spouting off, but no-one has actually done any work.
_________________________
By your own standard, you are guilty of doing absolutely nothing. but spouting off.
Not that it matters, your Wiki antics have lent you about as much credibility as the statement: “Guaranteed by Enron”.

Duke C.
February 23, 2012 9:31 am

I think there might have been a header of some sort that was cropped out (sloppily, I might add) on the 2012 Climate Strategy.pdf:
http://img813.imageshack.us/img813/5586/startdocpg2.jpg
There is also a slight trace of the same artifact on page 1.
After comparing page 2 with a boatload of scanned docs on my drive, a cannot find a similar type of artifact. If page 2 was simply misaligned when it was scanned, the vertical line would have extended the entire length of the page.
Since it occurs in the earliest instance of the memo, it would rule out any ex post facto alterations by DeSmog. It doesn’t rule out any involvement they might have had in the original creation, though.
That being said, feel free to shoot it down with a better theory. 🙂

February 23, 2012 9:34 am

Thought I’d put another comment on for Connelly to count. OCD types just can’t help themselves. I’ll get to the “work” when MY work allows it.

Shevva
February 23, 2012 9:42 am

OK Bill lets try one more time.
‘Besides, it will be fun and we’ll all learn something.’
I did apply for a cliamte grant but as I couldn’t promise that the answer would be CAGW they turned me down.

RomanM
February 23, 2012 9:43 am

Mr. Cannoli, I suggest that before you shoot your mouth off yet one more time, you do the following to teach everyone how it is done (if you aren’t too busy guarding your stash of misinformation on Wiki) :
Download the program.
Learn from scratch how it works. Determine the various choices to be made in the large variety of analysis methods available within the program – there are many. Of course, your thorough knowledge of statistical techniques including the principles of cluster analysis will come in handy here.
Gather together sufficient samples of text to work on. The program requires not only those created by the author under investigation, but also others with which comparisons need to be be made. The results order the authors according to the likelihood of ownership of the unknown text so external comparisons are important.
Do the analyses (plural, not singular), for properly evaluating the results.
Come back within 24 hours with the answers.


Although the discussion here has been helpful, what would be even more useful is an assembled collection of texts written by Gleick as well as several others persons possibly including the De SmugBlog posters who should not be dismissed as possible contributors. The program can supposedly read Word and pdf files, but mine seemed to give error messages when encountering docx files. Text files seem to run reasonably well.

Tom
February 23, 2012 9:52 am

William seems to be setting the bar at the standard set by Gavin who worked right through the Superbowl to beat McIntyre to a GCHN error that McIntyre discovered.

Ian Hoder
February 23, 2012 9:57 am

I’ve downloaded the program and tried to process 3 times but just keep on getting an error of “Experiment failed to complete”. Perhaps someone else could actually try to give it a go? It takes about 5 minutes to download and run. All you need to do is add in the fake document to the “Unknown Author” section, then add in documents from known authors including Gleick.
The updated user guide is here. http://evllabs.com/jgaap/5.2/JGAAP_User_Guide.pdf
I honestly didn’t know what all the different types of analysis meant so I just clicked “All” on every tab and then hit “Process” on the last tab. Maybe that was my mistake.

1 3 4 5 6 7 10