The Vast Majority of Raw Data From Old Scientific Studies May Now Be Missing

From the people that know how to save and care for things of importance, comes this essay from The Smithsonian:

One of the foundations of the scientific method is the reproducibility of results. In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.

This is why the findings of a study published today in Current Biology are so concerning. When a group of researchers tried to email the authors of 516 biological studies published between 1991 and 2011 and ask for the raw data, they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible. In total, even including papers published as recently as 2011, they were only able to track down the data for 23 percent.

“Everybody kind of knows that if you ask a researcher for data from old studies, they’ll hem and haw, because they don’t know where it is,” says Timothy Vines, a zoologist at the University of British Columbia, who led the effort. “But there really hadn’t ever been systematic estimates of how quickly the data held by authors actually disappears.”

To make their estimate, his group chose a type of data that’s been relatively consistent over time—anatomical measurements of plants and animals—and dug up between 25 and 40 papers for each odd year during the period that used this sort of data, to see if they could hunt down the raw numbers.

A surprising amount of their inquiries were halted at the very first step: for 25 percent of the studies, active email addresses couldn’t be found, with defunct addresses listed on the paper itself and web searches not turning up any current ones. For another 38 percent of studies, their queries led to no response. Another 7 percent of the data sets were lost or inaccessible.

“Some of the time, for instance, it was saved on three-and-a-half inch floppy disks, so no one could access it, because they no longer had the proper drives,” Vines says. Because the basic idea of keeping data is so that it can be used by others in future research, this sort of obsolescence essentially renders the data useless.

These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.

===============================================================

The paper:

The Availability of Research Data Declines Rapidly with Article Age


Highlights

• We examined the availability of data from 516 studies between 2 and 22 years old

• The odds of a data set being reported as extant fell by 17% per year

• Broken e-mails and obsolete storage devices were the main obstacles to data sharing

• Policies mandating data archiving at publication are clearly needed


Summary

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8, 9, 10 and 11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.


Results

We investigated how research data availability changes with article age. To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). Our final data set consisted of 516 articles published between 1991 and 2011. We found at least one apparently working e-mail for 385 papers (74%), either in the article itself or by searching online. We received 101 data sets (19%) and were told that another 20 (4%) were still in use and could not be shared, such that a total of 121 data sets (23%) were confirmed as extant. Table 1 provides a breakdown of the data by year.

We used logistic regression to formally investigate the relationships between the age of the paper and (1) the probability that at least one e-mail appeared to work (i.e., did not generate an error message), (2) the conditional probability of a response given that at least one e-mail appeared to work, (3) the conditional probability of getting a response that indicated the status of the data (data lost, data exist but unwilling to share, or data shared) given that a response was received, and, finally, (4) the conditional probability that the data were extant (either “shared” or “exists but unwilling to share”) given that an informative response was received.

There was a negative relationship between the age of the paper and the probability of finding at least one apparently working e-mail either in the paper or by searching online (odds ratio [OR] = 0.93 [0.90–0.96, 95% confidence interval (CI)], p < 0.00001). The odds ratio suggests that for every year since publication, the odds of finding at least one apparently working e-mail decreased by 7%

See more discussion and graphs here:

http://www.sciencedirect.com/science/article/pii/S0960982213014000

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

144 Comments
Inline Feedbacks
View all comments
Bryan A
December 20, 2013 10:16 pm

Now where did I leave that Bernoulli Cartridge with my cop of Volkswriter? I know my Overunity Design plans were stored somewhere. Perhaps it isnstillninnmy Bernoulli Box. Come to think of it, my unified theory is probably on the same disk. DRATS

magicjava
December 20, 2013 10:24 pm

anna v says:
December 20, 2013 at 9:50 pm
We have a different definition of experiment.. I am a physicist, and experiment means that one sets up a new experimental setup and gets new data.
——————————-
In the case of satellites, how would one set up a new experiment to verify satellite readings? With another satellite? Few can do that and even when it’s done *that* data and code cannot be fully verified either.
In the case of urban heat islands, how does one set up an experiment to verify or refute a claim that was never made? It can be easily demonstrated that urban heat islands raise temperatures. That does not mean some other factor rightfully required the final temperature to be adjusted up even higher.
In the case of computer models of the life of the solar system, how does one set up an experiment that can only be performed on a computer? And if your computer model, or 100 computer models, get different results than someone else’s, what does that demonstrate? That all your models are wrong? You cannot refute the core assumptions of a model without knowing what they are.
Experiments mean nothing if they cannot be verified.

Jim G
December 20, 2013 10:30 pm

Without the data….
the study could not be reproduced,
but more importantly, it can’t be falsified either.

jorgekafkazar
December 20, 2013 10:51 pm

You silly people thought this was about Science? Pish-tush! It was all about Publication! Academic publish or perish. Once you’ve published, you’re done. Science had nothing to do with it. Archiving? That’s something they do on another planet.
Fifty years ago, I worked in aerospace R&D. When a co-worker was laid off, we heard there was money left in her project for archiving, but no way to access the funds. She’d worked near me for several years, doing studies on [can’t tell you]. On her last day, she put all her notebooks in her lab bench drawer. I’m fairly certain (since others were also laid off) that her bench was eventually taken to field storage and left there, notebooks and all, in the humid Santa Monica air.

OLD DATA
December 20, 2013 10:52 pm

Why do Atheists put up Christmas lights?
Why do they leave them up permanently?
Today’s epiphany progressive: I should have used Cliffs Notes for the classics. They’d worked wonderfully for both years of calculus, physics, chemistry, mechanics of materials, et cetera.
Why did I waste so much time?

Bemused
December 20, 2013 11:33 pm

They couldn’t find email addresses for biology research papers older the 20 years?!!! OMG, I’m going to have to report that to some kind of scientific probity society!!!

Alan Robertson
December 20, 2013 11:36 pm

Not a problem. Just think of the studies that must be re- done. Grants, grants, grants.
Where did I leave those MathCAD floppies…

Mindert Eiting
December 20, 2013 11:51 pm

This problem is really urgent in history research depending on our archives. These contain huge amounts of written sources but almost all information of the past decades is digital. Some archives try to keep old computers with old drives and programs because they cannot copy gigabytes of information with each innovation. The whole thing is hopeless. It means that the period behind us will be called by historians the Dark Decades.

Bemused
December 20, 2013 11:54 pm

[snip -pointless, stupid, and insulting rant – mod]

Greg
December 20, 2013 11:58 pm

Claivus says: Stone is the only answer. Poor data density but great shelf life.
You’re wrong. I tried backing up my software as a binary record marked on granite once. My selves didn’t last 5 minutes.

Greg
December 21, 2013 12:02 am

anna v says:
December 20, 2013 at 9:50 pm
We have a different definition of experiment.. I am a physicist, and experiment means that one sets up a new experimental setup and gets new data.
===
Then your comments are irrelevant to this discussion , which I think is what others have been trying to say.

ROM
December 21, 2013 12:18 am

From the poor old taxpayer’s angle the failure to ensure comprehensive storage of all data relevant to the immense numbers of science articles published over the last couple of decades is proving to be just another example of the public’s trust and the immense almost no strings attached, financial largesse that has been showered on science of every type in ever increasing quantities over the last few decades, now becoming nothing more than another deep black financial, integrity and accountability free rat hole of steadily decreasing value to society.
It is becoming apparent that a very large percentage of scientists are now grabbing every dollar of public’s money they can but have blithely and arrogantly assuming that they do not have to meet any standards of integrity, accountability or responsibility to society in return.
It can only end in tears for much of science unless they get their house in order as there is an increasing sentiment that maybe our society doesn’t need the numbers of scientists we currently and quite lavishly support.
The increasing public perception of science, driven primarily by the bad image climate science is developing, is that most so called scientists are in it today for the money and prestige rather that a deep passion about science
To quote one of my close relatives who got a degree at a well known university here in Australia, In science we pay ninety nine dickwits to get the hundredth guy or gal who can really make a difference.
Maybe we as a society only needs to pay nine dickwits to still get that tenth guy or gal who can really make a difference.

ROM
December 21, 2013 12:51 am

We are told that even science that does not appear to have any useful application at all is valuable because it Adds to the Sum of Human Knowledge.
Yeh ! Right! It Adds to the Sum of Human Knowledge as long as the data format is still around or the discs are not lost or are thrown into the rubbish bin or go moldy, a period that now seems to be down to only half a decade or so.
So in short society through no fault of anybody except the scientists involved has completely done it’s dough when it backed that bit of supposed research and those scientists.
The only real beneficiaries are the scientists involved.
The rest of us have done our dough big time.
An excellent reason not to back those scientists or that research again until society and the tax payer can be categorically assured that ALL the relevant data tied to that research will be around permanently in a format that posterity into the far future can still view and sort through and check and verify.
Society only has limited resources to spread around amongst it’s various important sectors and if full accountability is not assumed by the recipients of society’s largesse they should not be surprised if they find themselves out looking for a job as street sweeper.

Mike McMIllan
December 21, 2013 12:53 am

Gerry says: December 20, 2013 at 6:10 pm
Should I be proud or embarrassed to say that I would have no problem accessing data on 3 1/2 (or 5 /14) inch drives?

Ditto. My 60 Mb tapes are toast, though, because the rubber pinch wheel on the drive turned to bubble gum.
I have my high school PSAT scores on a Hollerith card, but I found out there’s a bunch of different coding schemes for punch cards.

Larry Ledwick
December 21, 2013 12:57 am

This is a problem which has been well known among librarians and archivists for decades. There is major concern that all the digital data from the 1970’s through the early 2000’s will be lost to history soon. You as mentioned above have multiple lines of attack that destroys the data. Physical loss (I forgot where I put it or it got thrown out while I was on vacation etc.). Degradation of the media itself such as the gradual loss of magnetic domains on the tape etc. Loss of the mechanical devices necessary to read the media. Loss of the supporting software and formats of the data so you can make sense of the raw bits even if you can read them off the media.
The library of Congress has storage rooms full of old media recovery equipment from beta max tape readers to vinyl record turn tables so they can read data storage resources they acquire.
I the late 1990’s I worked in a tape library for a large data processing company here in Colorado they had 500,000 3480 tape cartridges and racks and racks of both large and small reel tapes. Many customers had archival tape reels stored there (keep for ever tapes) which if you looked closely were probably useless as they had tape cinches deep inside the reel where shrinkage of the tape and caused segments of the tape to be folded over and permanently creased.
We weekly ran into data recovery problems with these tapes. We had two brands of reel to reel tape drives IBM and Storage Tek. Sometimes the Storage Tek drives would refuse to even load the reels, but often the IBM drives would load and read the tapes. All because of differences in both the operational methods used in each drive to read bad spots on the tape but just simple mechanical issues like slight differences in the tape head alignments. The tape was unreadable on one drive but moved to a different identical drive it could “usually” be read if you could get it to load.
I also have boxes of 3.5 inch disks sitting under my desk as I write this post. A year or so ago I went through those boxes (several hundred disks) and one by one loaded in my desktop computer and wrote the files out to a hard drive to refresh the data, and then wrote the data to CD’s. I am currently planning on picking up an M-disk compatible drive because I have several terabytes of photos that I need to backup on a permanent media. I also have hundreds of silver halide slides and film negatives that I have slowly been trying to digitize. Funny the film is a far better archival storage medium than modern digital systems.
The fact is almost no one takes data preservation seriously. We cry and moan about all the film lost to history as the early celluloid films crumbled to dust in the vaults of the movie studios but the exact same thing is happening to both personal and commercial digital data as we debate this problem. Right now it looks like M-disk is the only truly archival means of storing digital data other than optically on silver halide micro film
We may find that 50 – 100 years from now suddenly paper books will be come enormously valuable as they will be the only surviving records from our era. I have cheap paper back books in my book shelves I bought 40 years ago that are as good as the day I bought them except for a little yellowing at the edge of the pages, where cassette tapes and early floppy drives only 1/2 that age are essentially all gone.
The best option right now for important data is to get it into archives like the wayback machine and their attempts to archive all manner of information including printed media via their scanning projects.
http://en.wikipedia.org/wiki/Internet_Archive

Mike McMIllan
December 21, 2013 12:58 am
Sceptical lefty
December 21, 2013 1:07 am

jorgekafkazar
“You silly people thought this was about Science? Pish-tush! It was all about Publication! Academic publish or perish. Once you’ve published, you’re done. Science had nothing to do with it. Archiving? That’s something they do on another planet.”
The above comment is nasty, but pretty well nails it. With the imperative to publish, quantity will inevitably trump quality. It is unreasonable to expect the authors of low-quality papers to leave data (evidence) lying around indefinitely as it increases the likelihood of their eventual exposure.
Genuinely high-quality scientific papers rarely ‘die’ because they are too widely copied. (Nobody mention Nikola Tesla.) If the academic emphasis ever shifts to quality it will be accompanied by a corresponding decrease in the quantity of papers published — and confirmed sightings of flying pigs.

Mindert Eiting
December 21, 2013 1:09 am

Duster said ‘At the very worst, the data may need to be re-entered, but that is cheap compared to losing it completely’.
It reminds me of research I did many years ago. I needed the data base of someone else and yes, all data was still available on punch tape. The lab owned a bizarre engine that could translate the code into normal print. It was quite defective, meaning that after the job the room was filled with spaghetti, but I had the data. Next, I had to re-enter everything on punch cards. That took many weeks and at the end I owned some boxes. If you let drop a box, you had an information disaster. Finally, the data arrived on the university computer and I got after many years the message that I had to copy it because they were cleaning the drives. Because the subject had become totally obsolete, I did not respond, which meant the end of a data base. Perhaps science should live with the fact that we cannot keep all that data, and we should only retain the most important from astronomic observations, for example.

norah4you
December 21, 2013 1:11 am

In 1995 I was asked to give correct readings for Sweden to Tema Vatten Linköpings University. How and why aren’t important here. Tema Vatten’s scientist answered: It’s easier to estimate the readings before 1990 than to write them into a computer …….
More records are preserved in unexpacted places at Archaeologic and Historic Institutions University Libraries. I know of one other place where almost all correct data can be found from 1890’s on….. guess I better keep that information for myself for the time being.

Txomin
December 21, 2013 1:15 am

Yep. Which is the reason why the publication of any manuscript should require submission of the data. Unfortunately, standards are so low that not even a comprehensible methodology is even required when reviewers find their opinions validated.

Robert of Ottawa
December 21, 2013 2:05 am

So, in the modern computer age, we have a data half-life, which is what?
Interestingly, this was not a problem before computers, because all data had to be written. The past was obviously more permanent than previously thought.
This should lead to the establishment of sound data procedures, such as making the data publicly available in computer form, etc.

Jack Simmons
December 21, 2013 2:32 am

michaelwiseguy says:
December 20, 2013 at 7:30 pm

In other news, here’s a great interview;
COP19: Marc Morano, Executive Editor/Chief Correspondent, Climate Depot

michaelwiseguy,
Thank you very much for the wonderful link. Very good interview.

December 21, 2013 3:00 am

Information loss is a problem which has plagued mankind for all of human history.

“As for knowledge, it will pass away.”1 Cor 13:8

Claivus wrote, “Stone is the only answer. Poor data density but great shelf life.”
Or use M-Discs, which are probably the next best thing to “written in stone.” M-Discs are new technology, for inexpensive thousand-year data storage. The guys at Millenniata are heros. I hope they get very rich.
Data loss is a severe problem in climatology. The NSIDC falsely claims that “the satellite record [of sea-ice extent] only dates back to 1979,”, which was the end of a particularly cold period, characterized by above-normal Arctic sea ice. But, actually, Nimbus-5, Nimbus-6, and Seasat-1 all made sea ice measurements via passive microwave radiometry prior to 1979. Unfortunately, NASA has lost the Nimbus-6 and Seasat-1 data.
We still have good quality Nimbus-5 ESMR (passive microwave) measurement data from December 11, 1972 through May 16, 1977. Nimbus-5’s ESMR instrument continued to operate in a degraded mode through March 1983, but the 1977-1983 data doesn’t seem to be available on-line; perhaps it has been lost, too.
The early Nimbus satellite measurements showed that 1979 was probably near the peak for Arctic sea ice, a fact which was reflected in graphs in the IPCC’s First and Second Assessment Reports (in 1990 and 1995, respectively), but omitted in later Assessment Reports.
The other thing that can be done to preserve data is to get it onto a web site on the Internet, and archived by services like TheWaybackMachine, WebCite, AwesomeHighlighter, and CiteBite. But even that doesn’t guarantee that the knowledge won’t pass away. AwesomeHighlighter is now gone, along with all its archived data.
It would help if scientists in universities and research institutions didn’t use robots.txt exclusion rules to prevent their web pages from being archived, and deliberately delete and hide their data, like Jones, Mann, Briffa, etc.
The loss of early data is obviously very bad for science, but it can be convenient for propaganda. Starting the sea ice graphs at the 1979 peak maximizes the appearance of subsequent decline, to support the CAGW narrative. The loss of so much of the earlier data makes it easier to perpetrate that deception.

Jack Simmons
December 21, 2013 3:06 am

_Jim says:
December 20, 2013 at 7:54 pm

Too cool not to post directly, daveburton!

Jim,
Yes, very cool!
Just thinking out loud here, same basic technique could be used to convert cuneiform tablets or even hieroglyphics. After image is transferred to digital image, translate it.
Why don’t archaeologists simply carry around an app on their phone and snap the photos?
Somebody beat me to it: https://play.google.com/store/apps/details?id=com.eyelid.Nexus.Hieroglyphs&hl=en

Ronald
December 21, 2013 3:18 am

The only data is the raw data so if the raw data goes missing thats great news for the agwers because nobody cane tell if they are lying because the evidence is gone. So stop yepping and go to work. Look for the raw data and make it work now you cane.

norah4you
Reply to  Ronald
December 21, 2013 11:53 am

Well if it was true that no rawdata still existed, but that’s not true in a world were Swedes exists…. I know of an archive where all essential origin newspapers daily reports from around the world written from 1890’s been saved one way or an other (not on computers or servers even if some are digitiliazed as well)….. If we Swedes hadn’t had a master of administration and bureaucracy back in history (Axel Oxenstierna) we wouldn’t have so many archived papers of every kind there is….
Then in an other archive there is copies of raw data for temperatures on Northern Hemisphere as far back as early 1800’s together with dissertations.