Study demonstrates a pattern in 'how scientists lie about their data'

Stanford researchers uncover patterns in how scientists lie about their data

When scientists falsify data, they try to cover it up by writing differently in their published works. A pair of Stanford researchers have devised a way of identifying these written clues.

white-coated doctor with hands behind his back; one hand has fingers crossed in gesture indicating he's lying
Andrey Popov/Shutterstock

Stanford communication scholars have devised an ‘obfuscation index’ that can help catch falsified scientific research before it is published.

Even the best poker players have “tells” that give away when they’re bluffing with a weak hand. Scientists who commit fraud have similar, but even more subtle, tells, and a pair of Stanford researchers have cracked the writing patterns of scientists who attempt to pass along falsified data.

The work, published in the Journal of Language and Social Psychology, could eventually help scientists identify falsified research before it is published.

There is a fair amount of research dedicated to understanding the ways liars lie. Studies have shown that liars generally tend to express more negative emotion terms and use fewer first-person pronouns. Fraudulent financial reports typically display higher levels of linguistic obfuscation – phrasing that is meant to distract from or conceal the fake data – than accurate reports.

To see if similar patterns exist in scientific academia, Jeff Hancock, a professor of communication at Stanford, and graduate student David Markowitz searched the archives of PubMed, a database of life sciences journals, from 1973 to 2013 for retracted papers. They identified 253, primarily from biomedical journals, that were retracted for documented fraud and compared the writing in these to unretracted papers from the same journals and publication years, and covering the same topics.

They then rated the level of fraud of each paper using a customized “obfuscation index,” which rated the degree to which the authors attempted to mask their false results. This was achieved through a summary score of causal terms, abstract language, jargon, positive emotion terms and a standardized ease of reading score.

“We believe the underlying idea behind obfuscation is to muddle the truth,” said Markowitz, the lead author on the paper. “Scientists faking data know that they are committing a misconduct and do not want to get caught. Therefore, one strategy to evade this may be to obscure parts of the paper. We suggest that language can be one of many variables to differentiate between fraudulent and genuine science.”

The results showed that fraudulent retracted papers scored significantly higher on the obfuscation index than papers retracted for other reasons. For example, fraudulent papers contained approximately 1.5 percent more jargon than unretracted papers.

“Fradulent papers had about 60 more jargon-like words per paper compared to unretracted papers,” Markowitz said. “This is a non-trivial amount.”

The researchers say that scientists might commit data fraud for a variety of reasons. Previous research points to a “publish or perish” mentality that may motivate researchers to manipulate their findings or fake studies altogether. But the change the researchers found in the writing, however, is directly related to the author’s goals of covering up lies through the manipulation of language. For instance, a fraudulent author may use fewer positive emotion terms to curb praise for the data, for fear of triggering inquiry.

In the future, a computerized system based on this work might be able to flag a submitted paper so that editors could give it a more critical review before publication, depending on the journal’s threshold for obfuscated language. But the authors warn that this approach isn’t currently feasible given the false-positive rate.

“Science fraud is of increasing concern in academia, and automatic tools for identifying fraud might be useful,” Hancock said. “But much more research is needed before considering this kind of approach. Obviously, there is a very high error rate that would need to be improved, but also science is based on trust, and introducing a ‘fraud detection’ tool into the publication process might undermine that trust.”

###

Get notified when a new post is published.
Subscribe today!
5 1 vote
Article Rating
177 Comments
Inline Feedbacks
View all comments
Mark from the Midwest
November 25, 2015 6:52 am

I’ve done a fair amount of text analysis and I don’t really buy into the author’s conclusions, particularly the fact that they’re talking about a modest difference, (1.5%), in a newly created scale. In many documents the reason for jargon and vague terms is that the author is poorly informed about the subject matter, rather than an intent to deceive.
Of course either way it’s bad, you’re either stupid or a liar. My lawyers always tell me to choose the former, after all, calling someone a liar can land you in court.

Bruce Cobb
November 25, 2015 7:06 am

Wait. Scientists? Lie? I just can’t fathom that.
It would be like priests molesting children.
Oh wait.

MarkW
November 25, 2015 7:20 am

Perhaps they could just count weasel words like “could”, “might”, “possibly”?

Reply to  MarkW
November 25, 2015 9:32 am

And the word “model(s)”

Dave N
Reply to  Matthew W
November 26, 2015 5:12 am

I have no objections to scientists using the word models; it’s the context they’re used in that I object to, e.g. the amount of confidence they have in them, particularly after they suffer epic fails.
Using “might, could” etc then to have no objection to policy being based on their work is an even more epic failure.
Using the “d” word is instant loss of all credibility; it’s a total cop out from defending their work like true scientists do.

Tucci78
Reply to  Dave N
November 26, 2015 7:53 am

Dave N writes:

I have no objections to scientists using the word models; it’s the context they’re used in that I object to, e.g. the amount of confidence they have in them, particularly after they suffer epic fails.

The confusion here comes of the popular (legacy, lamestream, leftard, luser “root weevil”) media having conflated the overblown, overpriced, incompetently and mendaciously programmed global climate computer simulations with the term “models” so that hoi polloi have slipped a cog in their understanding of the word’s use in the sciences. Let me pull from physicist Jeff Glassman’s wonderfully robust (ouch!) and useful 2007 essay “Conjecture, Hypothesis, Theory, Law: The Basis of Rational Argument”:

Science is all about models of the real world, whether natural (basic science) or manmade (applied science, or technology). These models are not discovered in nature, for nature has no numbers, no coordinate systems, no parameters, no equations, no logic, no predictions, neither linearity nor non-linearity, nor many of the other attributes of science. Models are man’s creations, written in the languages of science: natural language, logic, and mathematics. They are built upon the structure of a specified factual domain.

jorgekafkazar
Reply to  MarkW
November 25, 2015 11:32 am

Something like that. Sixty extra weasel words is a lot. But, as has been pointed out before, weasel words are the stock in trade of ALL scientists, good and bad.

RH
November 25, 2015 7:34 am

The study simply quantifies what most objective people do naturally using the built-in BS detector. I would bet that most skeptically minded people, like the majority of WUWT readers, could identify the fraudulent papers nearly as well as whatever algorithm was used in the study.

November 25, 2015 7:39 am

I don’t believe that more than a handful of “scientists” whose work is funded by the government, in one of the many ways they do that, are trying to do honest science. In fact, I think the funding schemes in place today guarantee abuse and lying. The worst liars may be those in medical research and in climate research: but they are not the only criminals. Not by a long shot.

knr
November 25, 2015 8:33 am

‘Science fraud is of increasing concern in academia’, to be fair that is only the case where such fraud is consider a bad thing rather then normal practice, as in climate ‘science ‘ so this approach may not be useful.
After all if lying is the normal approach, then it is the lack of lying that suggest there is something wrong with your work.

TonyL
November 25, 2015 8:36 am

As others noted, a 1.5% difference is hardly Earth-shaking. Also, with their high false-positive rate, it is a sure thing they did not achieve a Wee p-value, the Holy Grail of social science research.
Others see what is there, I perceive that which is missing.
The authors failed to account for Tech-Speak.
Presumably, the authors consider jargon to be at least questionable, while tech-speak is always above reproach.
They should fractionate the lingo into tech-speak and true jargon, then parameterize the data sets. Then they can do a N-Factorial experimental analysis on properly parameterized and vectored data. This approach is a sure bet to yield the much coveted Wee p-value.
Note that in the social sciences, one takes the data first, and only then chooses the experimental design which will produce the desired result. The fact that this is exactly opposite to the way we do things in the physical sciences is only a coincidence.

November 25, 2015 9:03 am

Given the current state of science, funded by bureaucracies with built in incentives to produce the policy support being paid for,are those really false positives?
A serious dig here, the high rate of positives, indicates either a faulty detection system or a far deeper mess.
Right now, especially in the soft (pseudo) sciences, I would not be surprised if most of the published material is rubbish and known to be rubbish by the people writing it.

Paul Westhaver
November 25, 2015 9:25 am

Everybody Lies.
I believe it is because that little voice in their heads, their “best friend”, convinces them that they need something. Usually it comes down to self worth and projected image of the self. Why would a scientist be immune to normal human failings?

JohnKnight
Reply to  Paul Westhaver
November 25, 2015 1:47 pm

I swear I didn’t do it . . my ego did ; )

Paul Westhaver
Reply to  JohnKnight
November 25, 2015 1:59 pm

JohnKnight,
HA… yeah… isn’t that the truth. If you ever get the chance, take in a movie night and watch “Revolver” by Guy Ritchie. It is excellent. Not a Chick Flic…and not really a gangster movie either. It gives a interesting perspective on the bombastic power projector types.

Terry
November 25, 2015 9:30 am

Somebody needs to write an app that parses studies and rates them using this methodology.
Would be interesting to take these alarmist studies and grade them on an ongoing basis.

RWturner
November 25, 2015 10:10 am

And sometimes the fraud is clearly spelled out…
… ship data are systematically warmer than the buoy data (15–17). … the bias correction involved calculating the average difference between collocated buoy and ship SSTs. The average difference globally was −0.12°C, a correction that is applied to the buoy SSTs at every grid cell in ERSST version 4. … buoy data have been proven to be more accurate and reliable than ship data, with better-known instrument characteristics and automated sampling

jorgekafkazar
Reply to  RWturner
November 25, 2015 11:35 am

Ouch.

Crispin in Waterloo but really in Beijing
November 25, 2015 10:35 am

I attended a webinar about 24 hours ago during which I asked a question. The answer was exactly as described in the article: vague, slightly negative about the quality, confounding issues that made exactness problematic and generally eschewing investigation because it was going to involve a lot of work and yield inconclusive answers for a novice.
I know for a fact there was no underlying data collected during the period in question! So the correct answer was, ‘We didn’t consider that and we don’t know.’ As the ‘don’t know’ was key to the whole purpose of spending the million$ it was ‘necessary’ to ‘explain it’ in a way that did not invite a close look at the data, just as the article above explains.
Very helpful. Thanks.

David S
November 25, 2015 10:50 am

As a spin doctor for a bio tech company I am acutely aware that when making announcements in relation to scientific data the key is whether the writer of that article believes that what they are writing is true . In other words are they aware that the information that is being written is false. Therefore I would suggest that the key to avoiding the tells is for scientists to convey all information to a third party who believes ( because of their own ignorance) that the information is true and get them to write about it.

NZ Willy
November 25, 2015 10:58 am

What is needed is to check the “obfuscation index” for the same researcher in his retracted work compared with his genuine work. Otherwise you’re just comparing people who cheat with those who don’t, in which case the differences across personalities subsume the “obfuscation index”. The index should be quantified, and doing so would lead to interesting advances in social analysis.

Alba
November 25, 2015 11:23 am
jorgekafkazar
November 25, 2015 11:28 am

I suspect that link is about a noaa official telling the congressional inquiry on climate change that he was going to order his scientists not to produce science that meets anyone’s standards but his own. But I could be wrong.

Brian H
Reply to  jorgekafkazar
December 4, 2015 4:06 am

she

Pat Paulsen
November 25, 2015 11:29 am

Maybe they count inactive verbs like “could” “might” “ought to” etc.., That’s one of the quickest ways I spot false claims in media articles – I note how many couldawouldashoulda’s there are.

jorgekafkazar
November 25, 2015 11:42 am

I think it’s time for us to revisit the internationally respected rubberducky.org and their fabulous “Chomsky Bot:”
http://rubberducky.org/cgi-bin/chomsky.pl

Barry
November 25, 2015 11:57 am

Luckily web blogs are not held to such ethical standards!

Reply to  Barry
November 25, 2015 6:14 pm

……and fortunately WUWT is one of the best of them. Barry

Louis
November 25, 2015 12:04 pm

“The researchers say that scientists might commit data fraud for a variety of reasons.”
That’s not possible. Advocates of climate change have assured us repeatedly that all scientists can be trusted without question — as long as they don’t receive funding from dirty fossil-fuel sources. Surely those are the only scientists who commit fraud, and so it’s the only test needed to detect it. /Sarc

Russell H
Reply to  Louis
November 25, 2015 12:36 pm
knr
Reply to  Louis
November 26, 2015 1:04 am

to be fair if they do ‘receive funding from dirty fossil-fuel sources’ but produce ‘research’ that supports CAGW , then just has normal water becomes holy water through a few magic words, then the dirty money becomes ‘clean again ‘

RD
November 25, 2015 1:20 pm

Narrative trumps facts. Crapweasels et al.

TCE
November 25, 2015 1:32 pm

But .. But
It’s Official!
Global average temperatures in 2015 are likely to be the warmest on record, according to the World Meteorological Organisation (WMO).
Data until the end of October showed this year’s temperatures running “well above” any previous 12 month period.
The researchers say the five year period from 2011 to 2015 was also the warmest on record.
The rise, they state, was due to a combination of a strong El Nino and human-induced global warming.
The WMO said their preliminary estimate, based on data from January to October, showed that the global average surface temperature for 2015 was 0.73 degrees C above the 1961-1990 average.

November 25, 2015 1:38 pm

This reminded of an old comment. I looked for it but couldn’t find it.
Someone had put up a link to a climate research paper generator that used random phrases suggested by the readers.
Parts of some the “papers” were pretty humorous.

Logoswrench
November 25, 2015 1:46 pm

I don’t know this article contained a lot of negative emotion and few first person pronouns. Lol.

November 25, 2015 2:36 pm

“Skill,” “skilful,” and “skilfully” come to mind–used 14 times by the authors of MBH97 to describe their unprecedented statistical prowess. –AGF