The Vast Majority of Raw Data From Old Scientific Studies May Now Be Missing

From the people that know how to save and care for things of importance, comes this essay from The Smithsonian:

One of the foundations of the scientific method is the reproducibility of results. In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.

This is why the findings of a study published today in Current Biology are so concerning. When a group of researchers tried to email the authors of 516 biological studies published between 1991 and 2011 and ask for the raw data, they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible. In total, even including papers published as recently as 2011, they were only able to track down the data for 23 percent.

“Everybody kind of knows that if you ask a researcher for data from old studies, they’ll hem and haw, because they don’t know where it is,” says Timothy Vines, a zoologist at the University of British Columbia, who led the effort. “But there really hadn’t ever been systematic estimates of how quickly the data held by authors actually disappears.”

To make their estimate, his group chose a type of data that’s been relatively consistent over time—anatomical measurements of plants and animals—and dug up between 25 and 40 papers for each odd year during the period that used this sort of data, to see if they could hunt down the raw numbers.

A surprising amount of their inquiries were halted at the very first step: for 25 percent of the studies, active email addresses couldn’t be found, with defunct addresses listed on the paper itself and web searches not turning up any current ones. For another 38 percent of studies, their queries led to no response. Another 7 percent of the data sets were lost or inaccessible.

“Some of the time, for instance, it was saved on three-and-a-half inch floppy disks, so no one could access it, because they no longer had the proper drives,” Vines says. Because the basic idea of keeping data is so that it can be used by others in future research, this sort of obsolescence essentially renders the data useless.

These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.

===============================================================

The paper:

The Availability of Research Data Declines Rapidly with Article Age


Highlights

• We examined the availability of data from 516 studies between 2 and 22 years old

• The odds of a data set being reported as extant fell by 17% per year

• Broken e-mails and obsolete storage devices were the main obstacles to data sharing

• Policies mandating data archiving at publication are clearly needed


Summary

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8, 9, 10 and 11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.


Results

We investigated how research data availability changes with article age. To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). Our final data set consisted of 516 articles published between 1991 and 2011. We found at least one apparently working e-mail for 385 papers (74%), either in the article itself or by searching online. We received 101 data sets (19%) and were told that another 20 (4%) were still in use and could not be shared, such that a total of 121 data sets (23%) were confirmed as extant. Table 1 provides a breakdown of the data by year.

We used logistic regression to formally investigate the relationships between the age of the paper and (1) the probability that at least one e-mail appeared to work (i.e., did not generate an error message), (2) the conditional probability of a response given that at least one e-mail appeared to work, (3) the conditional probability of getting a response that indicated the status of the data (data lost, data exist but unwilling to share, or data shared) given that a response was received, and, finally, (4) the conditional probability that the data were extant (either “shared” or “exists but unwilling to share”) given that an informative response was received.

There was a negative relationship between the age of the paper and the probability of finding at least one apparently working e-mail either in the paper or by searching online (odds ratio [OR] = 0.93 [0.90–0.96, 95% confidence interval (CI)], p < 0.00001). The odds ratio suggests that for every year since publication, the odds of finding at least one apparently working e-mail decreased by 7%

See more discussion and graphs here:

http://www.sciencedirect.com/science/article/pii/S0960982213014000

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
144 Comments
Inline Feedbacks
View all comments
Lynn Clark
December 21, 2013 3:44 am

Somewhat [un]related:
Those of us who are currently alive may be the last generations whose descendants will be able to see us in old photographs. When I take my last breath, the thousands of digital photos I have stored on my strong-password-protected computer will probably never be seen again by anyone. This year I finally got around to transferring all the VHS home movies of my family, and the old 8mm home movies of my Dad’s family to digital files. Uploaded them all to youtube (unlisted so only family members can view them) and gave the URLs to every relative I thought would be interested in them. I couldn’t think of a better way to try to preserve them for posterity. I’ve commented to various people before that 100 years from now it is highly unlikely that there will be anyone alive who remembers any of us unless we manage to do something extraordinarily memorable (good or bad).

December 21, 2013 3:56 am

Well, this is a very relevant issue.
I surely do not have raw data of the experiments I made in the early 90′ when we used to record on a paper polygraph with no digital capabilities at all.
As others have mentioned to keep all the raw data of an entire life spend in research is a very costly procedure. Perhaps funding bodies should consider to sponsor this kind of applications which may turn out to be even more scientifically relevant than a sound and trendy new grant application with up to date technology.

December 21, 2013 4:00 am

For those of you who are having trouble recovering data from 3.5″ floppy diskettes, I strongly recommend two things:
1. Open the write-protect tab/window on each 3.5″ diskette before inserting it into any disk drive. (For 5.25″ diskettes, cover the write-protect notch.) This is very important, because diskette drives position their heads by “dead reckoning,” and when two different drives write to a diskette the new data usually doesn’t line up exactly with the old, which causes hard-to-read mixtures of old & new data, and Microsoft Windows writes to the disks (to update the “last accessed” date/time) whenever you read from them. By write-protecting the diskettes before you try to read from them, you will prevent Windows from destroying the readability of your fragile data.
2. Buy a used LS-120 “SuperDisk” drive (under US$30 on eBay), install it in (or plug the USB version into) an old Windows XP computer, and use it to rescue the data from your diskettes. Those LS-120 drives work much better than regular diskette drives when reading from old, degraded media.
Note that most LS-120 drives (including mine) use IDE (PATA) cable connections, rather than diskette cable connections. There are also some USB LS-120 drives, too, but I’ve never used one. There are also laptop drives, but it could be challenging to get one of those to work unless you have the right model (old!) laptop computer.
You’ll need an old computer. I don’t think that Windows versions after Windows XP include LS-120 drivers (I’m sure 64-bit versions of Windows don’t), and new computers don’t have IDE interfaces unless you add an adapter.

December 21, 2013 4:15 am

Part of high defence procurement costs is in a small part due to data archive.
My old company had procedures where, on a regular basis, all archived data was copied from surface to surface. Irrespective of the type of surface medium, tape or disk, the data would be copied.
This ensured that the design and support data could genuinely survived the contracted period of 20 years.
The customer could then procure a new contract or have the data destroyed.
In the engineering world, to completely loose a whole dataset would be professional suicide.
It seems in science, it’s a badge of respect amongst some!

Allen63
December 21, 2013 4:35 am

I think much “scientific” data is bad data in the sense of not completely supporting the “conclusions” for one reason or another. But, given “publish or perish”, the conclusions are published — and the “data” not.
Given the modern “cloud”, virtually “all scientific data” could be uploaded there and probably kept for “all time” — if fellow scientists really wanted their data available to the “public”. In a more honorable world, someone might seriously propose that solution on a National or World Wide basis — and others agree to it.

December 21, 2013 4:52 am

Steve Richards says December 21, 2013 at 4:15 am
Part of high defence procurement costs is in a small part due to data archive.
My old company had procedures where, on a regular basis, all archived data was copied from surface to surface. Irrespective of the type of surface medium, tape or disk, the data would be copied.

One of the big advantages of working in the forward-looking and organized environment provided for on-going semiconductor production and research as well as (defense) ‘projects’ at a company such as TI was access to such resources as the IBM “Tape Librarian” facilities maintained by the CIC (Corporate Information Center) folks …
Nowadays, robotically implemented mechanisms ‘fetch and replace’ tape volumes in cartridge form when named datasets are requested for read or write … ‘updating’ or refreshing of the data is done on a regular timed basis to new tapes in a multi-tape (older to newer) ‘set’ to allow access to some number “n” back in the series of tapes in the event of any issue which might arise.
Unaccessed, ‘dormant’ (for some period of time) ‘datasets’ normally residing on DSD (direct access storage devices – 3350, 3380 etc ‘hard disks’) were also backed out to the ‘tape library’, freeing up that valuable and limited resource as well. I can recall several ‘jobs’ (on the IBM 370 mainframe) in the 80’s where it required extra time to complete (or start, actually!) because an infrequently used data file had been taken off the HDs and put into long-term (and cheaper) ‘tape’ storage system.
http://en.wikipedia.org/wiki/Tape_library
Quickie showing a modern tape library being installed, plus a sample of its operation:
automated tape library (IBM TS3500) … capable of storing ~ 27.4 PB petabytes of uncompressed data. The library is composed of 16 frames plus 2 service bay cabinets and can contain up to 18,257 tapes moved by two robots

dearieme
December 21, 2013 5:06 am

I binned my stuff when I retired: there was neither room nor point in taking it home, and nobody at work would have been interested in my punched cards, mag tapes, floppies, zip drives and so on. Or even paper records. It also saved me from having to distinguish data that were mine to use freely and data I had been given under confidentiality agreements, or had assigned to grant-givers. I suppose my collaborators still have raw data from the last decade or so.
Mind you, I did once have a Philistine Head of Department who had our librarian destroy all reports older than ten years on the grounds that nobody could possibly be interested any more. Inevitably this was discovered by somebody asking for an old report of his own and being told by the librarian what had happened. So there is not necessarily any advantage in handing stuff over on retirement anyway.
The key lessons are (i) Don’t appoint arseholes as Heads of Department, and (ii) If data aren’t archived promptly, loss is very likely.
So who’s to pay for the cost of the original archiving (probably minor) and for maintaining the archive (probably major)?

Robin Hewitt
December 21, 2013 5:14 am

I use a “One Year Rule” to control bin access when tidying up, surely the same kind of arrangement could be used here. If your data have sat around for, say, 5 years and nobody has shown a lick of interest you must be fairly confident that the world has forgotten your efforts. If you get a request from someone talking about data preservation, send them 2Gb of random Wiki on a cheap memory stick. If they complain, try retrieve the real thing.

Editor
December 21, 2013 5:53 am

I have spent literally days and days scanning old photographic negatives and prints, if I couldn’t find the negatives, and have catalogued them by year and season (current is 2013 Winter). The earliest I have is my grandparents photos from 1902 (season unknown!) I have digitised VHS videos of my children’s parties, school plays etc. To lose it all would be heart breaking, it is automatically backed up on two network HDD’s in the house and manually to another one in Spain. I have tried cloud backup but with 1.2 TB of data, it takes ages to upload and almost as long to download even with fibre broadband.
I think like many people have said in previous postings, that some climate data was deliberately “lost” to prevent uncomfortable questions being asked. Again like others have said above if the raw data is no longer available this should make that study invalid. I appreciate that storage methods and formats have changed, but the bottom line is that all computer data is binary using sequences of zeroes and ones so moving from one medium to another is only a problem if the person doing the moving does not feel it is important enough to move.

Luke Warmist
December 21, 2013 5:53 am

Data Retrieval
 Something that so far hasn’t been addressed needs mentioning here.
 Back in the late 80’s I worked on complex non-planar surfaces, and one of the things we had to do was generate surface normals for a predetermined distance and the approach vectors to them. This was done on a VAX/VMS system.
 Some years later when we migrated to desk top systems, it was discovered that the end points of the surface normals and the approach vectors changed when processed on the new systems.
 Much handwringing and tooth knashing took place until it was finally determined that the VAX was using sum of least squares, and the desktops were using the much newer cubic squares technique to arrive at what were supposed to be the same points in space.
 Simply having the data may only be a part of the solution, especially concerning complex calculations. Processor math has changed over time as well.

geography lady
December 21, 2013 5:59 am

Almost 40 years ago I worked on outdoor environmental exposures to asbestos. There were not very many samples taken outdoors, much less indoor/occupational exposures. The analytical techniques were all over the map and many many disagreements as to the health exposures. Asbestos was then the hot button issue in the mid-1970’s and early 1980’s. Most of the exposures we were concerned had to do with the type of rock and how it was used (road surfaces). We established sampling protocol with analytical techniques that were cutting edge at the time (not used by any others), and now standard protocol. There was 3 years of sampling of different exposures. At the time I worked for a governmental agency–who like everyone else–I did not trust to keep the information for others to look at the data or analyze the samples.
I kept the samples, the raw lab data, and the end results until 4 years ago when I moved my home (which is where I kept the information and stuff). I threw out everything, since there would be no more interest in this data. BTW under EPA law much of the asbestos data, jobs, etc has to be kept for up to 30 years or more. So I have a whole bunch (closet full) of asbestos projects that will only go when I die and then no one will care. 😉

Earl Smith
December 21, 2013 6:24 am

One of the suggestions was requiring a government data archive.
Let me give you an example of a government data archive in action.
Harris County, Texas (Houston) had a massive problem with old land records etc, it was all on paper and costing a fortune to house. So being efficient they decided to transfer it all to microfilm and then burn the originals.
Lots of money made in the contract and it was expertly carried out by a friend of the Powers That Be. And in due course the paper went up in flames.
And then the discovery. For cost control purposes the contractor stored over 100 years of data on cheap microfilm that had an expected life of 4 years. Nothing to be done but weep, the project was a 5 year effort and the earliest data was already decayed.
Trust not in Princes!

Pippen Kool
December 21, 2013 6:26 am

“they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible.”
This is a little overdone, data storage. First, explain what meaningful finding has been “destroyed” by the fact hat the data is no longer available? Is Madam Curie’s radium not radiumating anymore? Are the fly mutants that Nueslein Volhard and Eric Weischaus no longer identifying important genes in human disease? Is science falling apart because we can’t read our 3 1/4 inch floppies?
Any work that is important is usually repeated in a slightly different way. For example, most people believe Mann’s old ’ 8 paper not because they have looked at his data but because the study has been repeated–not exactly, but close enough—many times by now, Marcott’s paper (’13) might be the last one.

beng
December 21, 2013 6:30 am

Not surprising, given the current attitudes. As Mosher has said, raw data is crap.

Editor
December 21, 2013 6:49 am

Stephen Rasey says:
December 20, 2013 at 6:24 pm

“Obsolete storage devices.”
Ya think?!
How many of us could lay our hands on a punch card reader, 9-track mag tape drive, DecTape, 5.25 inch floppy?

Philip Peake says:
December 20, 2013 at 7:02 pm

I have some 1/2″ mag tapes that it is probably possible to read somewhere … then a couple of DEC-tapes,

Yay DECtapes! I have one on my desk I’ve meaning to take into work for a while. I don’t have a DECtape drive, but I bet the tape is still readable. At least the oxide hasn’t fallen off the tape yet.
I should be able to read a 5.25″ floppy, but only if my C/PM system still boots.
http://www.obsoletemedia.org/dectape/

December 21, 2013 6:53 am

This is done on PURPOSE for the most part.
1) There is no science in their studies so they lose the data.
2) The Piled on High and Deep ‘researchers’ are not researching anything with meaning.
3) If they ‘lose’ the data, they can always reissue their ‘study’ for grant requests and tax money.
4) They are incompetent and corrupt – see Mann made hockey sticks for more info.

Terry Warner
December 21, 2013 7:00 am

As a non scientist there are a number of things which strike me about this debate:
1. In days of old only a small segment of a small population generated data. It was expensive, valued, and people were motivated to preserve it (libraries). In 2013 data is cheap to generate by a hugely increased population, most of whom have the ability (education and facilities).
2. With geometric growth, data (scientific and other) cannot all be realistically archived and indefinitely accessible. The costs (hardware, software, porting, file maintenance, storage etc) will become increasingly un-affordable. We therefore have to accept that most data will have a short shelf life.
3. The scientific community needs to respond to this by doing a number of things:
(a) set minimum standards for data retention supporting scientific papers – minimum of 5 years??
(b) identify the key data sets which need to be preserved for longer – 20, 50, 100 years??.
(c) funding for this to be from the scientific community budgets – probably unpopular but otherwise the taxpayer will be dragged into funding all kinds of data retention claims (film, news, sport, photos, where would it stop??)
(d) improve peer review process to ensure (amongst possibly other things) that only those papers which have clear data policies in place, and whose narrative properly describes changes to base data sets and assumptions are approved for publication/acceptance.

cba
December 21, 2013 7:08 am

several things come to mind. first and foremost on my mind is what happened to the rest of you folks that used 8″ floppy disks like me. The next thing that came to mind is this article is about data collected for biology and as Rutherford(?) is quoted (para.) “Physics is science, all else is stamp collecting”. Finally, there is that infamous paper from a couple of years back that estimated 90% of scientific papers are proven wrong or turn out to be severely flawed within 5 to 7 years of publication. Finally, I’ve not yet found out whether that infamous paper was in the 90% or 10% – or whatever the numbers really turn out to be.
It is clearly a problem that important information is getting lost in our so far rapidly advancing technological society, including scientific research. It has been a problem for quite some time that the repeatability part of the scientific method has taken a back seat for the vast majority of experimental results.
The creation of the patent office was one of the first attempts at preventing this sort of problem – at least with technology. I’m sure computer repositories can deal with the much greater information now existing – at least until an EMP takes out the technology to the point where we can no longer access the information neccesary to reproduce the damaged technology.

PaulH
December 21, 2013 7:31 am

Mag tape isn’t dead yet!
“Magnetic tape to the rescue”
http://www.economist.com/news/technology-quarterly/21590758-information-storage-60-year-old-technology-offers-solution-modern
“WHEN physicists switch on the Large Hadron Collider (LHC), between three and six gigabytes of data spew out of it every second. That is, admittedly, an extreme example. But the flow of data from smaller sources than CERN, the European particle-research organisation outside Geneva that runs the LHC, is also growing inexorably. At the moment it is doubling every two years. These data need to be stored. The need for mass storage is reviving a technology which, only a few years ago, seemed destined for the scrapheap: magnetic tape.”
It’s interesting to see that mag tape still provides a viable method of mass storage. But ultimately, someone in academia/research has to mandate the actual procedure of moving “cold data” to tape storage (or other) instead of simply letting the data vanish.

Pamela Gray
December 21, 2013 7:36 am

Maybe it isn’t a bad thing for humans to have to re-discover themselves and their world. Maybe we are not genetically designed to constantly improve from generation to generation. Maybe we are a pendulum species, like many others, waxing and waning in an oscillation of re-discovery and hibernation.

Steamboat Jon
December 21, 2013 7:49 am

Records management long ago became it’s own IT specialty and for my employer it is a growing part of the IT budget. There is a constant review of holdings (physical and electronic) with increasing costs associated with storage space (including physical security and environmental control costs), maintaining legacy equipment and software (to access old storage media) and ongoing media conversion (micro film or paper to electronic format, and moving data from old media to new). Add FOIA, congressional or litigation related requests and it’s easy to see that archive management is a growth area within IT (especially within the federal government).

December 21, 2013 8:02 am

Near a hundred comments and no one has suggested why raw data is so scarce?
“Hide the refine!”

pat
December 21, 2013 8:26 am

Given the reversals of so many scientific conclusions on science, particularly science involving health, the environment, and climate, I suggest the loss is intentional.

Pablo an ex Pat
December 21, 2013 8:28 am

Perhaps the missing raw data is not missing at all ?
Perhaps it’s being stored in the deep ocean from where it will be emerge at some future date and time ?
That’s my theory anyway.

mbur
December 21, 2013 8:35 am

IMO, all of this could give rise to a new logical fallacy…it’s the
“Dog Ate My Homework Fallacy”
an example from a teacher POV:
“I have lost the assignment ,class ,now all of you fail, but i will still give you all a passing grade depending on your social status.”
or it could be like the “http://www.logicalfallacies.info/relevance/appeals/appeal-to-tradition/”
fallacy with or without the actualdata.
a paraphrased quote/info from tha preface of”The Golden Bough” circa 1922 :
“…the Khazars in S.Russia ,where kings were liable to be put to death either on the expiry of a set term or whenever some public calamity, such as drought ,dearth, or defeat in war, seemed to indicate a failure of their natural powers.” (italics mine)Some cultures back then gave there kings everything and then took it all away.
Nowadays , we’re supposed to elect a new king.
How’s that for conjecture? http://en.wikipedia.org/wiki/Conjecture
Thanks and have a good day.