The Vast Majority of Raw Data From Old Scientific Studies May Now Be Missing

From the people that know how to save and care for things of importance, comes this essay from The Smithsonian:

One of the foundations of the scientific method is the reproducibility of results. In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.

This is why the findings of a study published today in Current Biology are so concerning. When a group of researchers tried to email the authors of 516 biological studies published between 1991 and 2011 and ask for the raw data, they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible. In total, even including papers published as recently as 2011, they were only able to track down the data for 23 percent.

“Everybody kind of knows that if you ask a researcher for data from old studies, they’ll hem and haw, because they don’t know where it is,” says Timothy Vines, a zoologist at the University of British Columbia, who led the effort. “But there really hadn’t ever been systematic estimates of how quickly the data held by authors actually disappears.”

To make their estimate, his group chose a type of data that’s been relatively consistent over time—anatomical measurements of plants and animals—and dug up between 25 and 40 papers for each odd year during the period that used this sort of data, to see if they could hunt down the raw numbers.

A surprising amount of their inquiries were halted at the very first step: for 25 percent of the studies, active email addresses couldn’t be found, with defunct addresses listed on the paper itself and web searches not turning up any current ones. For another 38 percent of studies, their queries led to no response. Another 7 percent of the data sets were lost or inaccessible.

“Some of the time, for instance, it was saved on three-and-a-half inch floppy disks, so no one could access it, because they no longer had the proper drives,” Vines says. Because the basic idea of keeping data is so that it can be used by others in future research, this sort of obsolescence essentially renders the data useless.

These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.

===============================================================

The paper:

The Availability of Research Data Declines Rapidly with Article Age


Highlights

• We examined the availability of data from 516 studies between 2 and 22 years old

• The odds of a data set being reported as extant fell by 17% per year

• Broken e-mails and obsolete storage devices were the main obstacles to data sharing

• Policies mandating data archiving at publication are clearly needed


Summary

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8, 9, 10 and 11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.


Results

We investigated how research data availability changes with article age. To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). Our final data set consisted of 516 articles published between 1991 and 2011. We found at least one apparently working e-mail for 385 papers (74%), either in the article itself or by searching online. We received 101 data sets (19%) and were told that another 20 (4%) were still in use and could not be shared, such that a total of 121 data sets (23%) were confirmed as extant. Table 1 provides a breakdown of the data by year.

We used logistic regression to formally investigate the relationships between the age of the paper and (1) the probability that at least one e-mail appeared to work (i.e., did not generate an error message), (2) the conditional probability of a response given that at least one e-mail appeared to work, (3) the conditional probability of getting a response that indicated the status of the data (data lost, data exist but unwilling to share, or data shared) given that a response was received, and, finally, (4) the conditional probability that the data were extant (either “shared” or “exists but unwilling to share”) given that an informative response was received.

There was a negative relationship between the age of the paper and the probability of finding at least one apparently working e-mail either in the paper or by searching online (odds ratio [OR] = 0.93 [0.90–0.96, 95% confidence interval (CI)], p < 0.00001). The odds ratio suggests that for every year since publication, the odds of finding at least one apparently working e-mail decreased by 7%

See more discussion and graphs here:

http://www.sciencedirect.com/science/article/pii/S0960982213014000

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

144 Comments
Inline Feedbacks
View all comments
Chris B
December 20, 2013 5:12 pm

Can a request be put in to just lose the bad data, like temperature measurements that don’t fully embrace the Climate Change Cause?
Doh, it’s been done.
/sarc

December 20, 2013 5:42 pm

Our problem with older data is that it’s stored in media that are no longer supported, such as TK50 and TK70 tapes formatted in VAX VMS. It’s an expensive and time-consuming proposition to recover those files, even if you have a hard-copy data log, which we do.
I’d suspect most of the recovery problem is there. The data aren’t lost, they’re just very poorly accessible.

DocMartyn
December 20, 2013 5:48 pm

Most data is stored on I-drives and when you leave your job, every 3-5 years, it goes bye-bye with your account, shortly afterward.
You generally keep your raw data for five years or so, so that you can produce it in event of a query from a grant body or journal. Lab books go into storage or the skip, depending on institution.
I have the data for all my publication back to 2004. All my data from 95-99 is on drives that are not made any more, but the drive is in the same draw at my old department.
The data from 2000-2004 was purged from the I-drive at my previous Institute 12 months after I left, as it is for all former staff.
The IT people also deliberately reformat drives so that there are no IP documents, viruses or Trojans in reused computers

Mac the Knife
December 20, 2013 6:09 pm

I’m experiencing this myself. I have boxes with 3.5 inch diskettes and even 5 inch ‘floppies’ with stored data from +25 years of engineering work. I don’t have a drive (or software) necessary to read them.
Perhaps there is a viable service business to be explored here??!

magicjava
December 20, 2013 6:10 pm

Pat Frank says
I’d suspect most of the recovery problem is there. The data aren’t lost, they’re just very poorly accessible
_______________________________
Having spent some time trying to reproduce scientific results with data available to the public, I’d hazzard a guess you’re wrong and the data and source code are intentionally kept from the public.
Does anyone know if the data that went into the study discussed by this article is publically available?

Gerry
December 20, 2013 6:10 pm

Should I be proud or embarrassed to say that I would have no problem accessing data on 3 1/2 (or 5 /14) inch drives?

December 20, 2013 6:13 pm

I think libraries should have a lot of the old data. Soon the libraries (or their data content) will be burned down by the elites in government.

magicjava
December 20, 2013 6:17 pm

P.s. and “kept from thepublic” includes kept from other scientists too, even scientists working on the project.
And sorry for anytypos, I’m on a tiny little nook tablet right now.

John Haddock
December 20, 2013 6:18 pm

Two thoughts:
1) If all the data is truly valuable, then we need a National Scientific Digital Library (NSDL) that hosts digital copies of any published papers and supporting data along with relevant facts about the authors, etc, etc. Given the scale of supercomputers these days, the cost per paper would be minimal. The responsibility for sending the data to the NSDL should lie with the authors.
2) If a paper loses its supporting data it should be considered obsolete or of no value. It should not be cited in subsequent studies.
It’s unrealistic to think that individuals, or even institutions, will protect historical data.

magicjava
December 20, 2013 6:22 pm

P.p.s. for those wondering how you could keep data from scientists working on the project, the answer is put the data into computer code and no where else. Once that’s done, the other scientists never see it and even a FOIA request can’t get it.

December 20, 2013 6:23 pm

On a related matter, nearly all of the IPCC documentation of any historical interest prior to the 3rd assessment is at risk of being lost. The only significant holdings of these documents are in the private print archives of participants — most of who are now in their late 70s or 80s. Given the importance that many of us assess for this episode in the history of science, the lost of this documents would be as surprising as it would be tragic. I believe they hold the key to understanding how this singular scare corrupted the institutions of public science. So far I have failed to gain any real support for their collection and preservation.

December 20, 2013 6:24 pm

“Obsolete storage devices.”
Ya think?!
How many of us could lay our hands on a punch card reader, 9-track mag tape drive, DecTape, 5.25 inch floppy? Heck moving from IBM 370 to VAX, you are going to lose a lot. going from Mac to Window 3.1 or Win-95 you lost almost everything. For a brief time I worked on the Landmark Mark-1 seismic interpreation station. It’s big removable media was a 12-inch Optical disk cassette, WORM in eight 100 megabyte sectors. $800 per cassette.

December 20, 2013 6:31 pm

Let’s not forget 8mm and 16mm film.
I move our family films to video tape about 10 years ago.
Now I have to do it again from 8mm digital video to DVD or HD. I have to find a 8mm diginal cassette camera now — lost the one we had.
Sisyphus must have been an archivist.

December 20, 2013 6:40 pm

Mac the Knife says:
December 20, 2013 at 6:09 pm
I’m experiencing this myself. I have boxes with 3.5 inch diskettes and even 5 inch ‘floppies’ with stored data from +25 years of engineering work. I don’t have a drive (or software) necessary to read them.
Perhaps there is a viable service business to be explored here??!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
There are service bureaus available that can take old back up tapes/floppies, hard drives – from everything from an IBM 360 or Dec PDP 1170 (many still running after 40 years – you can still buy then on line) to tapes and floppies from Atari and Commodore 64’s. I even had some old 12 inch floppies from AES word processors (that we actually used for engineering calculations using CP/M). I have 40 years of engineering work on everything from tape to floppy to external hard drives tucked away in the basement. The data does degrade over time but much is recoverable and I actually have copied data to different media to avoid degradation. Although now that we have a 10 year limit of liability in Canada for engineering, I recently tossed all my paper as I have now been retired for more than 10 years. The media files don’t take a lot of space but they may go soon. Data conversion from one form to another isn’t terribly difficult. In the late 70’s we converted VAX Intergraph files to work with Trash 80’s; Victor computers from England and later the first IBM PC’s. Kind of fun. Course now I can barely start a computer … sort of.

markx
December 20, 2013 6:45 pm

Mac the Knife says: December 20, 2013 at 6:09 pm
I’m experiencing this myself. I have boxes with 3.5 inch diskettes and even 5 inch ‘floppies’ with stored data from +25 years of engineering work. I don’t have a drive (or software) necessary to read them.
In the tropics we found that improperly stored (ie no air con) floppy disks became overgrown with fungus within a few years. As well as clogging up the drives, the fungus damages the surface (ie, dismantling and cleaning did not help).

magicjava
December 20, 2013 6:51 pm

Wayne Delbeke says:
There are service bureaus available that can take old back up tapes/floppies, hard drives
——————————-
While what you’re saying is true, it takes funding to carry out. I’ll bet dollars to donuts the science teams will never even make a request for such funding.
Go back and re-read the article. Notice that only 23% of the data from *2011* was available. That has nothing to do with technology.
In modern science, reproducibility is not a goal, it is something to be avoided at all costs.

December 20, 2013 6:53 pm

Set a deadline that in order to do new, taxpayer funded research, all of your old taxpayer funded data must be saved to the “peoples” cloud. Sort of like a taxpayer amnesty program for science data. Another stick could be loss of awards, or retraction of honorarium.

RoHa
December 20, 2013 6:56 pm

20 Years?
I’ve still got high school essays from 50 years ago.

john robertson
December 20, 2013 6:58 pm

For climatology this is a design feature.
Small wonder Phil Jones claimed “context”, as losing/ destroying original data, now appears to be a tradition.
Surely if taxpayer dollars fund the research, the same agencies must be responsible for storing the completed research, which includes the raw data.
Otherwise what benefit does the taxpayer accrue from funding scientific research?
This makes the case for non funding even more coherent.

Philip Peake
December 20, 2013 7:02 pm

I have some 1/2″ mag tapes that it is probably possible to read somewhere … then a couple of DEC-tapes, which maybe some museum could help with … my 10″ floppies would be harder, and the machine code and assembler source on them is for a machine that probably hasn’t been made for 30 years .. as for my punched cards and paper tape … although I *used* to be able to read the paper tape by eye.
There is a real issue here. Not just for research results, but for civilization itself if information can’t be preserved!

High Treason
December 20, 2013 7:03 pm

“Losing” data is a dead giveaway that there is something to hide. The “Inconvenient Truth” perhaps. What will happen is that the claim of what they said will be simply assumed as gospel truth to support their theories that have not a shred of truth in them-100% baseless lies from start to finish.

magicjava
December 20, 2013 7:05 pm

One last comment…
Even these numbers do not tell the full story. Saying that 23% of the data is available does not imply that 23% of the studies are reproducible.
A science team can make 95% of its data available and the study still cannot be reproduced without the missing 5%.

KevinK
December 20, 2013 7:06 pm

“Our problem with older data is that it’s stored in media that are no longer supported, such as TK50 and TK70 tapes formatted in VAX VMS.”
In the “old days” we used to make a final “released” drawing of our engineering designs with ink on “vellum” (originally dried sheep stomachs, but later polyester), with several copies created and numbered. One copy went into the “vault”, a nice secure fire resistant room. I have on occasion retrieved a drawing 1 or 2 decades later from the vault for reuse. A little harder to do with a long list of numbers, but things like microfilm (metallic silver in gelatin) are also very stable for many many decades. And the only support needed is a magnifying lens (still available as version 1.0).
At one time companies used to offer “long term” storage by renting space in old (dry) mines to store your microfilm.
If you want your data to be stable it can be done, of course if you would like to “forget” your predictions a nice magnetic tape or hard drive will make your data disappear without much effort.
Cheers, Kevin.

John F. Hultquist
December 20, 2013 7:06 pm

Nothing said above should surprise those that have worked on projects.
Isn’t it funny though that keeping data about important topics is difficult while if someone drunk or naked gets her/his photo on the web it is there forever (or at least a long time)?

NZ Willy
December 20, 2013 7:11 pm

I’ve got a 3.5″ A-drive on my primary computer. It cost only $10 so why not?

1 2 3 6