The Vast Majority of Raw Data From Old Scientific Studies May Now Be Missing

From the people that know how to save and care for things of importance, comes this essay from The Smithsonian:

One of the foundations of the scientific method is the reproducibility of results. In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.

This is why the findings of a study published today in Current Biology are so concerning. When a group of researchers tried to email the authors of 516 biological studies published between 1991 and 2011 and ask for the raw data, they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible. In total, even including papers published as recently as 2011, they were only able to track down the data for 23 percent.

“Everybody kind of knows that if you ask a researcher for data from old studies, they’ll hem and haw, because they don’t know where it is,” says Timothy Vines, a zoologist at the University of British Columbia, who led the effort. “But there really hadn’t ever been systematic estimates of how quickly the data held by authors actually disappears.”

To make their estimate, his group chose a type of data that’s been relatively consistent over time—anatomical measurements of plants and animals—and dug up between 25 and 40 papers for each odd year during the period that used this sort of data, to see if they could hunt down the raw numbers.

A surprising amount of their inquiries were halted at the very first step: for 25 percent of the studies, active email addresses couldn’t be found, with defunct addresses listed on the paper itself and web searches not turning up any current ones. For another 38 percent of studies, their queries led to no response. Another 7 percent of the data sets were lost or inaccessible.

“Some of the time, for instance, it was saved on three-and-a-half inch floppy disks, so no one could access it, because they no longer had the proper drives,” Vines says. Because the basic idea of keeping data is so that it can be used by others in future research, this sort of obsolescence essentially renders the data useless.

These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.

===============================================================

The paper:

The Availability of Research Data Declines Rapidly with Article Age


Highlights

• We examined the availability of data from 516 studies between 2 and 22 years old

• The odds of a data set being reported as extant fell by 17% per year

• Broken e-mails and obsolete storage devices were the main obstacles to data sharing

• Policies mandating data archiving at publication are clearly needed


Summary

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8, 9, 10 and 11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.


Results

We investigated how research data availability changes with article age. To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). Our final data set consisted of 516 articles published between 1991 and 2011. We found at least one apparently working e-mail for 385 papers (74%), either in the article itself or by searching online. We received 101 data sets (19%) and were told that another 20 (4%) were still in use and could not be shared, such that a total of 121 data sets (23%) were confirmed as extant. Table 1 provides a breakdown of the data by year.

We used logistic regression to formally investigate the relationships between the age of the paper and (1) the probability that at least one e-mail appeared to work (i.e., did not generate an error message), (2) the conditional probability of a response given that at least one e-mail appeared to work, (3) the conditional probability of getting a response that indicated the status of the data (data lost, data exist but unwilling to share, or data shared) given that a response was received, and, finally, (4) the conditional probability that the data were extant (either “shared” or “exists but unwilling to share”) given that an informative response was received.

There was a negative relationship between the age of the paper and the probability of finding at least one apparently working e-mail either in the paper or by searching online (odds ratio [OR] = 0.93 [0.90–0.96, 95% confidence interval (CI)], p < 0.00001). The odds ratio suggests that for every year since publication, the odds of finding at least one apparently working e-mail decreased by 7%

See more discussion and graphs here:

http://www.sciencedirect.com/science/article/pii/S0960982213014000

Advertisements

  Subscribe  
newest oldest most voted
Notify of
Chris B

Can a request be put in to just lose the bad data, like temperature measurements that don’t fully embrace the Climate Change Cause?
Doh, it’s been done.
/sarc

Pat Frank

Our problem with older data is that it’s stored in media that are no longer supported, such as TK50 and TK70 tapes formatted in VAX VMS. It’s an expensive and time-consuming proposition to recover those files, even if you have a hard-copy data log, which we do.
I’d suspect most of the recovery problem is there. The data aren’t lost, they’re just very poorly accessible.

DocMartyn

Most data is stored on I-drives and when you leave your job, every 3-5 years, it goes bye-bye with your account, shortly afterward.
You generally keep your raw data for five years or so, so that you can produce it in event of a query from a grant body or journal. Lab books go into storage or the skip, depending on institution.
I have the data for all my publication back to 2004. All my data from 95-99 is on drives that are not made any more, but the drive is in the same draw at my old department.
The data from 2000-2004 was purged from the I-drive at my previous Institute 12 months after I left, as it is for all former staff.
The IT people also deliberately reformat drives so that there are no IP documents, viruses or Trojans in reused computers

Mac the Knife

I’m experiencing this myself. I have boxes with 3.5 inch diskettes and even 5 inch ‘floppies’ with stored data from +25 years of engineering work. I don’t have a drive (or software) necessary to read them.
Perhaps there is a viable service business to be explored here??!

magicjava

Pat Frank says
I’d suspect most of the recovery problem is there. The data aren’t lost, they’re just very poorly accessible
_______________________________
Having spent some time trying to reproduce scientific results with data available to the public, I’d hazzard a guess you’re wrong and the data and source code are intentionally kept from the public.
Does anyone know if the data that went into the study discussed by this article is publically available?

Gerry

Should I be proud or embarrassed to say that I would have no problem accessing data on 3 1/2 (or 5 /14) inch drives?

I think libraries should have a lot of the old data. Soon the libraries (or their data content) will be burned down by the elites in government.

magicjava

P.s. and “kept from thepublic” includes kept from other scientists too, even scientists working on the project.
And sorry for anytypos, I’m on a tiny little nook tablet right now.

John Haddock

Two thoughts:
1) If all the data is truly valuable, then we need a National Scientific Digital Library (NSDL) that hosts digital copies of any published papers and supporting data along with relevant facts about the authors, etc, etc. Given the scale of supercomputers these days, the cost per paper would be minimal. The responsibility for sending the data to the NSDL should lie with the authors.
2) If a paper loses its supporting data it should be considered obsolete or of no value. It should not be cited in subsequent studies.
It’s unrealistic to think that individuals, or even institutions, will protect historical data.

magicjava

P.p.s. for those wondering how you could keep data from scientists working on the project, the answer is put the data into computer code and no where else. Once that’s done, the other scientists never see it and even a FOIA request can’t get it.

On a related matter, nearly all of the IPCC documentation of any historical interest prior to the 3rd assessment is at risk of being lost. The only significant holdings of these documents are in the private print archives of participants — most of who are now in their late 70s or 80s. Given the importance that many of us assess for this episode in the history of science, the lost of this documents would be as surprising as it would be tragic. I believe they hold the key to understanding how this singular scare corrupted the institutions of public science. So far I have failed to gain any real support for their collection and preservation.

“Obsolete storage devices.”
Ya think?!
How many of us could lay our hands on a punch card reader, 9-track mag tape drive, DecTape, 5.25 inch floppy? Heck moving from IBM 370 to VAX, you are going to lose a lot. going from Mac to Window 3.1 or Win-95 you lost almost everything. For a brief time I worked on the Landmark Mark-1 seismic interpreation station. It’s big removable media was a 12-inch Optical disk cassette, WORM in eight 100 megabyte sectors. $800 per cassette.

Let’s not forget 8mm and 16mm film.
I move our family films to video tape about 10 years ago.
Now I have to do it again from 8mm digital video to DVD or HD. I have to find a 8mm diginal cassette camera now — lost the one we had.
Sisyphus must have been an archivist.

Mac the Knife says:
December 20, 2013 at 6:09 pm
I’m experiencing this myself. I have boxes with 3.5 inch diskettes and even 5 inch ‘floppies’ with stored data from +25 years of engineering work. I don’t have a drive (or software) necessary to read them.
Perhaps there is a viable service business to be explored here??!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
There are service bureaus available that can take old back up tapes/floppies, hard drives – from everything from an IBM 360 or Dec PDP 1170 (many still running after 40 years – you can still buy then on line) to tapes and floppies from Atari and Commodore 64’s. I even had some old 12 inch floppies from AES word processors (that we actually used for engineering calculations using CP/M). I have 40 years of engineering work on everything from tape to floppy to external hard drives tucked away in the basement. The data does degrade over time but much is recoverable and I actually have copied data to different media to avoid degradation. Although now that we have a 10 year limit of liability in Canada for engineering, I recently tossed all my paper as I have now been retired for more than 10 years. The media files don’t take a lot of space but they may go soon. Data conversion from one form to another isn’t terribly difficult. In the late 70’s we converted VAX Intergraph files to work with Trash 80’s; Victor computers from England and later the first IBM PC’s. Kind of fun. Course now I can barely start a computer … sort of.

markx

Mac the Knife says: December 20, 2013 at 6:09 pm
I’m experiencing this myself. I have boxes with 3.5 inch diskettes and even 5 inch ‘floppies’ with stored data from +25 years of engineering work. I don’t have a drive (or software) necessary to read them.
In the tropics we found that improperly stored (ie no air con) floppy disks became overgrown with fungus within a few years. As well as clogging up the drives, the fungus damages the surface (ie, dismantling and cleaning did not help).

magicjava

Wayne Delbeke says:
There are service bureaus available that can take old back up tapes/floppies, hard drives
——————————-
While what you’re saying is true, it takes funding to carry out. I’ll bet dollars to donuts the science teams will never even make a request for such funding.
Go back and re-read the article. Notice that only 23% of the data from *2011* was available. That has nothing to do with technology.
In modern science, reproducibility is not a goal, it is something to be avoided at all costs.

Set a deadline that in order to do new, taxpayer funded research, all of your old taxpayer funded data must be saved to the “peoples” cloud. Sort of like a taxpayer amnesty program for science data. Another stick could be loss of awards, or retraction of honorarium.

RoHa

20 Years?
I’ve still got high school essays from 50 years ago.

john robertson

For climatology this is a design feature.
Small wonder Phil Jones claimed “context”, as losing/ destroying original data, now appears to be a tradition.
Surely if taxpayer dollars fund the research, the same agencies must be responsible for storing the completed research, which includes the raw data.
Otherwise what benefit does the taxpayer accrue from funding scientific research?
This makes the case for non funding even more coherent.

Philip Peake

I have some 1/2″ mag tapes that it is probably possible to read somewhere … then a couple of DEC-tapes, which maybe some museum could help with … my 10″ floppies would be harder, and the machine code and assembler source on them is for a machine that probably hasn’t been made for 30 years .. as for my punched cards and paper tape … although I *used* to be able to read the paper tape by eye.
There is a real issue here. Not just for research results, but for civilization itself if information can’t be preserved!

High Treason

“Losing” data is a dead giveaway that there is something to hide. The “Inconvenient Truth” perhaps. What will happen is that the claim of what they said will be simply assumed as gospel truth to support their theories that have not a shred of truth in them-100% baseless lies from start to finish.

magicjava

One last comment…
Even these numbers do not tell the full story. Saying that 23% of the data is available does not imply that 23% of the studies are reproducible.
A science team can make 95% of its data available and the study still cannot be reproduced without the missing 5%.

KevinK

“Our problem with older data is that it’s stored in media that are no longer supported, such as TK50 and TK70 tapes formatted in VAX VMS.”
In the “old days” we used to make a final “released” drawing of our engineering designs with ink on “vellum” (originally dried sheep stomachs, but later polyester), with several copies created and numbered. One copy went into the “vault”, a nice secure fire resistant room. I have on occasion retrieved a drawing 1 or 2 decades later from the vault for reuse. A little harder to do with a long list of numbers, but things like microfilm (metallic silver in gelatin) are also very stable for many many decades. And the only support needed is a magnifying lens (still available as version 1.0).
At one time companies used to offer “long term” storage by renting space in old (dry) mines to store your microfilm.
If you want your data to be stable it can be done, of course if you would like to “forget” your predictions a nice magnetic tape or hard drive will make your data disappear without much effort.
Cheers, Kevin.

John F. Hultquist

Nothing said above should surprise those that have worked on projects.
Isn’t it funny though that keeping data about important topics is difficult while if someone drunk or naked gets her/his photo on the web it is there forever (or at least a long time)?

NZ Willy

I’ve got a 3.5″ A-drive on my primary computer. It cost only $10 so why not?

noaaprogrammer

“To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals…” Yes, they knew better than to choose the area of climatology!
(I still have my boxes of FORTRAN II and FORTRAN IV programs and data on punched cards from the 1960s. Unfortunately I have several hundred 5 1/4 floppies that are waiting for an external drive-to-USB solution.)

Dougmanxx

You can purchase an external USB 3.5 or 5.25 in. drive from Newegg for $14.99. Working in IT for a large public University, I suspect this is less a technical problem and more a “human” one. As usual, it’s the people, and not the technology causing the problem.

I bet Google and the NSA could access the data. I was told, “On the Internet, Things Never Go Away Completely”. Was I lied to? Is that not a true statement?
http://link.springer.com/chapter/10.1007/978-0-387-79026-8_3
In other news, here’s a great interview;
COP19: Marc Morano, Executive Editor/Chief Correspondent, Climate Depot

Magnetic media degrades with time. So even if you can find a drive, you might have a hard time reading the diskette or tape.
I have drives that can read 3.5″, 5.25″, and even soft-sectored 8″ diskettes. Unfortunately, most of the 8″ diskettes and many of the others are so degraded that I can’t read the data from them.
The big problem with the 8″ diskettes (except for the 3M brand!) is that the oxide flakes off when the drive’s heads rub the media. If anyone has a solution to that problem, please contact me!
For 3.5″ diskettes, even if the data seems to be unreadable, it still might read correctly with an LS-120 “SuperDisk” drive. I have an old machine with one of those drives, and it is absolutely amazingly good at recovering the data from old diskettes.
For future data archiving, the right solution is probably M-Discs (though you’ll need an LG brand DVD burner to write ’em).

tz

Google ‘USB Floppy Drive”, under $20. Newegg or Amazon at the top, or go to your local tech store.

temp

I bet climate science would score a 97% on this topic.

noaaprogrammer, the data on your old Hollerith punch cards is probably recoverable by optically scanning them, but it won’t be very easy.
With a bit of effort, I could read paper tape, but punched cards are harder.

davidmhoffer

Dougmanxx;
I suspect this is less a technical problem and more a “human” one.
>>>>>>>>>>>>>
Exactly. For anyone in the data management profession, there’s nothing novel or surprising in this study. The technology and the processes to protect data for the long term have been known for decades. The IT department merely needs the mandate and the funding to make it happen along with publication of the indexes and process for retrieval.

… saw this ‘magnetic medium’ issue in the 70’s when I was working at the FWSH&TC (Ft Wayne State Hospital and Training Center) where I worked while going to school; we received a boatload of 1″ Ampex video mag tapes that had been stored ‘outside’ in non-temp/environment controlled atmosphere … at that time the magnetic medium was separating from the polyester tape base …
.

Neo

The is one topic where even the private sector isn’t fully immune. I’ve seen some companies attempt to bury research results from internal efforts, not out of stupidity or malice, but because the tax code makes it painful if the effort isn’t a complete write off.

Too cool not to post directly, daveburton!

GlynnMhor

I still have my first PC, with both 3.5″ and 5.25″ floppy drives.
On rare occasions we need to use it to pull up old survey data stored on those formats.

KevinK says December 20, 2013 at 7:06 pm

In the “old days” we used to make a final “released” drawing of our engineering designs with ink on “vellum” (originally dried sheep stomachs, but later polyester), with several copies created and numbered. One copy went into the “vault” …

I was going to say, whatever happened to “Drawing Control”? The master vellum copies were were ‘microfiched’ (at TI) for subsequent human access in ‘card libraries’ at the various company campuses … the 35mm (or so) microfiche film was placed into a ‘cutout’ on an IBM 80-column ‘card’ which had the drawing number encoded in the first half of the card. Doing this allowed the clerk, using a nearby IBM ‘card sorter’, to re-sort the cards after they had been pulled by engineering and production personnel during the course of work day …
.

shano

If you wanted to justify why you couldn’t satisfy an FOIA request it would be useful to have a study like this one. You could say, “I’d be happy to give you my tree ring data but it seems the magnetic medium was accidentally demagnitized so you’re out of luck. Please stop persecuting me about it.”

…. the 35mm (or so) microfiche film was placed into a ‘cutout’ on an IBM 80-column ‘card’ which had the drawing number encoded in the first half of the card. …..
________________________________________________________
These are called aperture cards. I worked on a project in 1985 to digitize all of the U.S. government (Army, USAF, Navy) Aperture cards. It was a massive contract to build the systems and deploy them (7 systems) and then a much larger task to run the cards.
The Aperture cards for the B1 Bomber numbered over 5 million. North American Rockwell estimated that it was going to cost $238 million dollars to digitize that data. We did it for $32 million.
Data migration is one of the two biggest issues in the preservation world. We are working with the National Archives and the Library of Congress and there is never enough money to get everything done.
Just our project to capture 1960’s era raw data, digitize it, and deliver it to the planetary data system from the five Lunar Orbiters is generating almost 40 terabytes of data.

anna v

One of the foundations of the scientific method is the reproducibility of results.
Yes, true. That is why experiments are not one off, and many are repeated many times, improving the result. The Milikan oil drop experiment where one measures the charge of the electrons is an example of this, and of how measurements change through time. This was a lab experiment during my studies of physics back in 1960.
In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.
I think the last statement is an overstatement analyze the same data and notice the same patterns. overshoots the mark for the scientific method by hundreds of years. The way the scientific community made sure of the stability of knowledge was through publications which another researcher could study and repeat the experiment. Nobody required the same data to be chewed over again and again. It is the computer age that made the data available for chewing over and over, which in my opinion, is the wrong path to take. If there are doubts about data, experiments should be repeated, not checked like homework problems.
The reason lies that the complexity in any decent experiment is large, the probability of errors entering in the gathering is also large, as humans are fallible. Chewing over and over the same data may only show up these errors, which would explain discrepancies between experiments , or not , because similar blind spots could exist in the new analysis. It would not advance knowledge, particularly if this habit of rechecking made experiments one off, thinking that rechecking the same data is like a new experiment and makes it safe.

mbur

From essay:
“These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.”
Mundane obstacles?what like tying your shoes?
this kind of data storage is part of ‘science’ isn’t it? Some are proud of it ,no?
IDK if it’s me?,but i seem to notice an administrative point of view to recent
articles ,studies ,news items.
From article summary:
“Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level”
Thanks for the interesting posts,articles and comments

TRM

I smell a project for Google! The Google Scientific Archives. Stored in multiple data centres on multiple continents. Online all the time. It’s better than trying to read VMS tapes that haven’t been retentioned in a decade.

magicjava

anna v says:
December 20, 2013 at 8:42 pm
If there are doubts about data, experiments should be repeated, not checked like homework problems.
—————-
You cannot repeat the experiments without the data.
Example 1:
Satellite data needs to be adjusted due to noise in one of the sensors. The adjustments are written in computer code and stored no where else. No one can independently repeat the results of the adjusted satellite readings, including scientists working on the project. And the data cannot be obtained via a FOIA request because computer code is not considered to be documentation.
Example 2:
Government climate scientists adjust temps of cities *up* rather than down (as would normally be expected for an urban heat island). They give no reason why this adjustment was done this way. You cannot independently verify their reasoning when no reason was given.
Example 3:
Astronomer claims to have reproduced the orbits of all the objects in the solar system, from ancient times when it was little more than gas to modern day. Without the computer code and data it is impossible to know what assumptions were made to reach such a conclusion.
These are all real-life examples.

RoHa

One medium that is particularly stable and needs no supporting technology is ink on acid-free paper. We used to have whole buildings just full of bundles of data in this medium. I forget what those places were called. “Librioteks” or something like that.

Greg

3.5″ floppy drives are still available but if you try to access a 20 year old floppy you may well be disappointed. Bit rot. The recording medium is not stable over that time scale
Even without the topical fungus someone else referred to I have back ups of software from 30 years ago kept in clean dry conditions and I’d estimate less than half are fully readable. Even commercially produced originals, and that predates the collapse in the quality of floppy disks that killed the medium.
17% per year ! That’s a half life of just 4 years. That’s serious.
Data storage requires maintenance. Libraries were traditionally created to perform this function for paper records. It seems like much in our disposable age data is now a throw away commodity too.
But modern science is disposable too. Study results are made to order for political or commercial needs. The naive idea of objective, investigative science is long dead. Scientific “reports” are bought to serve a short term objective and are then no longer needed.
Welcome to cleanex science. One wipe and flush.

Bob Diaz

This is a real problem with all data, it requires a lot of work to keep porting your older data to new media.

anna v

magicjava says:
December 20, 2013 at 8:56 pm
You cannot repeat the experiments without the data.
We have a different definition of experiment.. I am a physicist, and experiment means that one sets up a new experimental setup and gets new data.
What you call experiment I call analysis. Historically scientific knowledge advanced by experiments and even multiple observations for astronomy, not by reanalyzing the same data..
I agree that it would be good since now the facilities exist to keep the data in the one off observations , but it will be a new way of doing science and should only be used for checking discrepancies by reanalysis, not as if it is a new experiment/observation . Certainly they should not be written in stone for the generations. If the next generation wants to examine something it should redo the experiments, not regurgitate old measurements.

Duster

RoHa says:
December 20, 2013 at 8:58 pm
One medium that is particularly stable and needs no supporting technology is ink on acid-free paper. We used to have whole buildings just full of bundles of data in this medium. I forget what those places were called. “Librioteks” or something like that.

The problem is information density. Paper is a really useful adjunct for relatively limited amounts of data. But, paper is really bulky for the amount of information you can store on it. I’ve dealt with projects where my staff really whined when I insisted on hard copies of everything. At the very worst, the data may need to be re-entered, but that is cheap compared to losing it completely.

Claivus

Stone is the only answer. Poor data density but great shelf life. Or perhaps engraved platinum.