The Vast Majority of Raw Data From Old Scientific Studies May Now Be Missing

From the people that know how to save and care for things of importance, comes this essay from The Smithsonian:

One of the foundations of the scientific method is the reproducibility of results. In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.

This is why the findings of a study published today in Current Biology are so concerning. When a group of researchers tried to email the authors of 516 biological studies published between 1991 and 2011 and ask for the raw data, they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible. In total, even including papers published as recently as 2011, they were only able to track down the data for 23 percent.

“Everybody kind of knows that if you ask a researcher for data from old studies, they’ll hem and haw, because they don’t know where it is,” says Timothy Vines, a zoologist at the University of British Columbia, who led the effort. “But there really hadn’t ever been systematic estimates of how quickly the data held by authors actually disappears.”

To make their estimate, his group chose a type of data that’s been relatively consistent over time—anatomical measurements of plants and animals—and dug up between 25 and 40 papers for each odd year during the period that used this sort of data, to see if they could hunt down the raw numbers.

A surprising amount of their inquiries were halted at the very first step: for 25 percent of the studies, active email addresses couldn’t be found, with defunct addresses listed on the paper itself and web searches not turning up any current ones. For another 38 percent of studies, their queries led to no response. Another 7 percent of the data sets were lost or inaccessible.

“Some of the time, for instance, it was saved on three-and-a-half inch floppy disks, so no one could access it, because they no longer had the proper drives,” Vines says. Because the basic idea of keeping data is so that it can be used by others in future research, this sort of obsolescence essentially renders the data useless.

These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.

===============================================================

The paper:

The Availability of Research Data Declines Rapidly with Article Age


Highlights

• We examined the availability of data from 516 studies between 2 and 22 years old

• The odds of a data set being reported as extant fell by 17% per year

• Broken e-mails and obsolete storage devices were the main obstacles to data sharing

• Policies mandating data archiving at publication are clearly needed


Summary

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8, 9, 10 and 11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.


Results

We investigated how research data availability changes with article age. To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). Our final data set consisted of 516 articles published between 1991 and 2011. We found at least one apparently working e-mail for 385 papers (74%), either in the article itself or by searching online. We received 101 data sets (19%) and were told that another 20 (4%) were still in use and could not be shared, such that a total of 121 data sets (23%) were confirmed as extant. Table 1 provides a breakdown of the data by year.

We used logistic regression to formally investigate the relationships between the age of the paper and (1) the probability that at least one e-mail appeared to work (i.e., did not generate an error message), (2) the conditional probability of a response given that at least one e-mail appeared to work, (3) the conditional probability of getting a response that indicated the status of the data (data lost, data exist but unwilling to share, or data shared) given that a response was received, and, finally, (4) the conditional probability that the data were extant (either “shared” or “exists but unwilling to share”) given that an informative response was received.

There was a negative relationship between the age of the paper and the probability of finding at least one apparently working e-mail either in the paper or by searching online (odds ratio [OR] = 0.93 [0.90–0.96, 95% confidence interval (CI)], p < 0.00001). The odds ratio suggests that for every year since publication, the odds of finding at least one apparently working e-mail decreased by 7%

See more discussion and graphs here:

http://www.sciencedirect.com/science/article/pii/S0960982213014000

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

144 Comments
Inline Feedbacks
View all comments
noaaprogrammer
December 20, 2013 7:12 pm

“To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals…” Yes, they knew better than to choose the area of climatology!
(I still have my boxes of FORTRAN II and FORTRAN IV programs and data on punched cards from the 1960s. Unfortunately I have several hundred 5 1/4 floppies that are waiting for an external drive-to-USB solution.)

Dougmanxx
December 20, 2013 7:28 pm

You can purchase an external USB 3.5 or 5.25 in. drive from Newegg for $14.99. Working in IT for a large public University, I suspect this is less a technical problem and more a “human” one. As usual, it’s the people, and not the technology causing the problem.

December 20, 2013 7:30 pm

I bet Google and the NSA could access the data. I was told, “On the Internet, Things Never Go Away Completely”. Was I lied to? Is that not a true statement?
http://link.springer.com/chapter/10.1007/978-0-387-79026-8_3
In other news, here’s a great interview;
COP19: Marc Morano, Executive Editor/Chief Correspondent, Climate Depot

December 20, 2013 7:32 pm

Magnetic media degrades with time. So even if you can find a drive, you might have a hard time reading the diskette or tape.
I have drives that can read 3.5″, 5.25″, and even soft-sectored 8″ diskettes. Unfortunately, most of the 8″ diskettes and many of the others are so degraded that I can’t read the data from them.
The big problem with the 8″ diskettes (except for the 3M brand!) is that the oxide flakes off when the drive’s heads rub the media. If anyone has a solution to that problem, please contact me!
For 3.5″ diskettes, even if the data seems to be unreadable, it still might read correctly with an LS-120 “SuperDisk” drive. I have an old machine with one of those drives, and it is absolutely amazingly good at recovering the data from old diskettes.
For future data archiving, the right solution is probably M-Discs (though you’ll need an LG brand DVD burner to write ’em).

tz
December 20, 2013 7:33 pm

Google ‘USB Floppy Drive”, under $20. Newegg or Amazon at the top, or go to your local tech store.

temp
December 20, 2013 7:40 pm

I bet climate science would score a 97% on this topic.

December 20, 2013 7:41 pm

noaaprogrammer, the data on your old Hollerith punch cards is probably recoverable by optically scanning them, but it won’t be very easy.
With a bit of effort, I could read paper tape, but punched cards are harder.

December 20, 2013 7:43 pm

Dougmanxx;
I suspect this is less a technical problem and more a “human” one.
>>>>>>>>>>>>>
Exactly. For anyone in the data management profession, there’s nothing novel or surprising in this study. The technology and the processes to protect data for the long term have been known for decades. The IT department merely needs the mandate and the funding to make it happen along with publication of the indexes and process for retrieval.

December 20, 2013 7:46 pm

… saw this ‘magnetic medium’ issue in the 70’s when I was working at the FWSH&TC (Ft Wayne State Hospital and Training Center) where I worked while going to school; we received a boatload of 1″ Ampex video mag tapes that had been stored ‘outside’ in non-temp/environment controlled atmosphere … at that time the magnetic medium was separating from the polyester tape base …
.

Neo
December 20, 2013 7:53 pm

The is one topic where even the private sector isn’t fully immune. I’ve seen some companies attempt to bury research results from internal efforts, not out of stupidity or malice, but because the tax code makes it painful if the effort isn’t a complete write off.

December 20, 2013 7:54 pm

Too cool not to post directly, daveburton!

GlynnMhor
December 20, 2013 7:59 pm

I still have my first PC, with both 3.5″ and 5.25″ floppy drives.
On rare occasions we need to use it to pull up old survey data stored on those formats.

December 20, 2013 8:05 pm

KevinK says December 20, 2013 at 7:06 pm

In the “old days” we used to make a final “released” drawing of our engineering designs with ink on “vellum” (originally dried sheep stomachs, but later polyester), with several copies created and numbered. One copy went into the “vault” …

I was going to say, whatever happened to “Drawing Control”? The master vellum copies were were ‘microfiched’ (at TI) for subsequent human access in ‘card libraries’ at the various company campuses … the 35mm (or so) microfiche film was placed into a ‘cutout’ on an IBM 80-column ‘card’ which had the drawing number encoded in the first half of the card. Doing this allowed the clerk, using a nearby IBM ‘card sorter’, to re-sort the cards after they had been pulled by engineering and production personnel during the course of work day …
.

shano
December 20, 2013 8:13 pm

If you wanted to justify why you couldn’t satisfy an FOIA request it would be useful to have a study like this one. You could say, “I’d be happy to give you my tree ring data but it seems the magnetic medium was accidentally demagnitized so you’re out of luck. Please stop persecuting me about it.”

December 20, 2013 8:17 pm

…. the 35mm (or so) microfiche film was placed into a ‘cutout’ on an IBM 80-column ‘card’ which had the drawing number encoded in the first half of the card. …..
________________________________________________________
These are called aperture cards. I worked on a project in 1985 to digitize all of the U.S. government (Army, USAF, Navy) Aperture cards. It was a massive contract to build the systems and deploy them (7 systems) and then a much larger task to run the cards.
The Aperture cards for the B1 Bomber numbered over 5 million. North American Rockwell estimated that it was going to cost $238 million dollars to digitize that data. We did it for $32 million.
Data migration is one of the two biggest issues in the preservation world. We are working with the National Archives and the Library of Congress and there is never enough money to get everything done.
Just our project to capture 1960’s era raw data, digitize it, and deliver it to the planetary data system from the five Lunar Orbiters is generating almost 40 terabytes of data.

anna v
December 20, 2013 8:42 pm

One of the foundations of the scientific method is the reproducibility of results.
Yes, true. That is why experiments are not one off, and many are repeated many times, improving the result. The Milikan oil drop experiment where one measures the charge of the electrons is an example of this, and of how measurements change through time. This was a lab experiment during my studies of physics back in 1960.
In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.
I think the last statement is an overstatement analyze the same data and notice the same patterns. overshoots the mark for the scientific method by hundreds of years. The way the scientific community made sure of the stability of knowledge was through publications which another researcher could study and repeat the experiment. Nobody required the same data to be chewed over again and again. It is the computer age that made the data available for chewing over and over, which in my opinion, is the wrong path to take. If there are doubts about data, experiments should be repeated, not checked like homework problems.
The reason lies that the complexity in any decent experiment is large, the probability of errors entering in the gathering is also large, as humans are fallible. Chewing over and over the same data may only show up these errors, which would explain discrepancies between experiments , or not , because similar blind spots could exist in the new analysis. It would not advance knowledge, particularly if this habit of rechecking made experiments one off, thinking that rechecking the same data is like a new experiment and makes it safe.

mbur
December 20, 2013 8:44 pm

From essay:
“These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.”
Mundane obstacles?what like tying your shoes?
this kind of data storage is part of ‘science’ isn’t it? Some are proud of it ,no?
IDK if it’s me?,but i seem to notice an administrative point of view to recent
articles ,studies ,news items.
From article summary:
“Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level”
Thanks for the interesting posts,articles and comments

TRM
December 20, 2013 8:55 pm

I smell a project for Google! The Google Scientific Archives. Stored in multiple data centres on multiple continents. Online all the time. It’s better than trying to read VMS tapes that haven’t been retentioned in a decade.

magicjava
December 20, 2013 8:56 pm

anna v says:
December 20, 2013 at 8:42 pm
If there are doubts about data, experiments should be repeated, not checked like homework problems.
—————-
You cannot repeat the experiments without the data.
Example 1:
Satellite data needs to be adjusted due to noise in one of the sensors. The adjustments are written in computer code and stored no where else. No one can independently repeat the results of the adjusted satellite readings, including scientists working on the project. And the data cannot be obtained via a FOIA request because computer code is not considered to be documentation.
Example 2:
Government climate scientists adjust temps of cities *up* rather than down (as would normally be expected for an urban heat island). They give no reason why this adjustment was done this way. You cannot independently verify their reasoning when no reason was given.
Example 3:
Astronomer claims to have reproduced the orbits of all the objects in the solar system, from ancient times when it was little more than gas to modern day. Without the computer code and data it is impossible to know what assumptions were made to reach such a conclusion.
These are all real-life examples.

RoHa
December 20, 2013 8:58 pm

One medium that is particularly stable and needs no supporting technology is ink on acid-free paper. We used to have whole buildings just full of bundles of data in this medium. I forget what those places were called. “Librioteks” or something like that.

Greg
December 20, 2013 8:58 pm

3.5″ floppy drives are still available but if you try to access a 20 year old floppy you may well be disappointed. Bit rot. The recording medium is not stable over that time scale
Even without the topical fungus someone else referred to I have back ups of software from 30 years ago kept in clean dry conditions and I’d estimate less than half are fully readable. Even commercially produced originals, and that predates the collapse in the quality of floppy disks that killed the medium.
17% per year ! That’s a half life of just 4 years. That’s serious.
Data storage requires maintenance. Libraries were traditionally created to perform this function for paper records. It seems like much in our disposable age data is now a throw away commodity too.
But modern science is disposable too. Study results are made to order for political or commercial needs. The naive idea of objective, investigative science is long dead. Scientific “reports” are bought to serve a short term objective and are then no longer needed.
Welcome to cleanex science. One wipe and flush.

Bob Diaz
December 20, 2013 9:22 pm

This is a real problem with all data, it requires a lot of work to keep porting your older data to new media.

anna v
December 20, 2013 9:50 pm

magicjava says:
December 20, 2013 at 8:56 pm
You cannot repeat the experiments without the data.
We have a different definition of experiment.. I am a physicist, and experiment means that one sets up a new experimental setup and gets new data.
What you call experiment I call analysis. Historically scientific knowledge advanced by experiments and even multiple observations for astronomy, not by reanalyzing the same data..
I agree that it would be good since now the facilities exist to keep the data in the one off observations , but it will be a new way of doing science and should only be used for checking discrepancies by reanalysis, not as if it is a new experiment/observation . Certainly they should not be written in stone for the generations. If the next generation wants to examine something it should redo the experiments, not regurgitate old measurements.

Duster
December 20, 2013 9:58 pm

RoHa says:
December 20, 2013 at 8:58 pm
One medium that is particularly stable and needs no supporting technology is ink on acid-free paper. We used to have whole buildings just full of bundles of data in this medium. I forget what those places were called. “Librioteks” or something like that.

The problem is information density. Paper is a really useful adjunct for relatively limited amounts of data. But, paper is really bulky for the amount of information you can store on it. I’ve dealt with projects where my staff really whined when I insisted on hard copies of everything. At the very worst, the data may need to be re-entered, but that is cheap compared to losing it completely.

Claivus
December 20, 2013 10:10 pm

Stone is the only answer. Poor data density but great shelf life. Or perhaps engraved platinum.