The Vast Majority of Raw Data From Old Scientific Studies May Now Be Missing

From the people that know how to save and care for things of importance, comes this essay from The Smithsonian:

One of the foundations of the scientific method is the reproducibility of results. In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.

This is why the findings of a study published today in Current Biology are so concerning. When a group of researchers tried to email the authors of 516 biological studies published between 1991 and 2011 and ask for the raw data, they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible. In total, even including papers published as recently as 2011, they were only able to track down the data for 23 percent.

“Everybody kind of knows that if you ask a researcher for data from old studies, they’ll hem and haw, because they don’t know where it is,” says Timothy Vines, a zoologist at the University of British Columbia, who led the effort. “But there really hadn’t ever been systematic estimates of how quickly the data held by authors actually disappears.”

To make their estimate, his group chose a type of data that’s been relatively consistent over time—anatomical measurements of plants and animals—and dug up between 25 and 40 papers for each odd year during the period that used this sort of data, to see if they could hunt down the raw numbers.

A surprising amount of their inquiries were halted at the very first step: for 25 percent of the studies, active email addresses couldn’t be found, with defunct addresses listed on the paper itself and web searches not turning up any current ones. For another 38 percent of studies, their queries led to no response. Another 7 percent of the data sets were lost or inaccessible.

“Some of the time, for instance, it was saved on three-and-a-half inch floppy disks, so no one could access it, because they no longer had the proper drives,” Vines says. Because the basic idea of keeping data is so that it can be used by others in future research, this sort of obsolescence essentially renders the data useless.

These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.

===============================================================

The paper:

The Availability of Research Data Declines Rapidly with Article Age


Highlights

• We examined the availability of data from 516 studies between 2 and 22 years old

• The odds of a data set being reported as extant fell by 17% per year

• Broken e-mails and obsolete storage devices were the main obstacles to data sharing

• Policies mandating data archiving at publication are clearly needed


Summary

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8, 9, 10 and 11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.


Results

We investigated how research data availability changes with article age. To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). Our final data set consisted of 516 articles published between 1991 and 2011. We found at least one apparently working e-mail for 385 papers (74%), either in the article itself or by searching online. We received 101 data sets (19%) and were told that another 20 (4%) were still in use and could not be shared, such that a total of 121 data sets (23%) were confirmed as extant. Table 1 provides a breakdown of the data by year.

We used logistic regression to formally investigate the relationships between the age of the paper and (1) the probability that at least one e-mail appeared to work (i.e., did not generate an error message), (2) the conditional probability of a response given that at least one e-mail appeared to work, (3) the conditional probability of getting a response that indicated the status of the data (data lost, data exist but unwilling to share, or data shared) given that a response was received, and, finally, (4) the conditional probability that the data were extant (either “shared” or “exists but unwilling to share”) given that an informative response was received.

There was a negative relationship between the age of the paper and the probability of finding at least one apparently working e-mail either in the paper or by searching online (odds ratio [OR] = 0.93 [0.90–0.96, 95% confidence interval (CI)], p < 0.00001). The odds ratio suggests that for every year since publication, the odds of finding at least one apparently working e-mail decreased by 7%

See more discussion and graphs here:

http://www.sciencedirect.com/science/article/pii/S0960982213014000

0 0 votes
Article Rating

Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

144 Comments
Inline Feedbacks
View all comments
Tobias Smit
December 21, 2013 9:32 pm

As I “print” this comment I feel ???? rising up. As Larry said “somewhere some one must be archiving”. I agree .
Is that why the NSA is building this massive project in an environment (both isolated from people and weather today and is self sustainable.. And as far as being concerned about where all the data is going? Every time I hear of another scientific project being a “success” and “providing scientists” with enough data to keep them “BUSY FOR YEARS TO COME”, I think great guys … but… WHO is paying for that?.

Mark
December 22, 2013 3:57 am

daveburton says:

noaaprogrammer, the data on your old Hollerith punch cards is probably recoverable by optically scanning them, but it won’t be very easy.
With a bit of effort, I could read paper tape, but punched cards are harder.

In the case of punched cards and paper tape the data density very low and it is likely to take quite a bit of damage before things are irrecoverable.
Also cards can be “interpreted” where the data is also printed along the top. Which means you have the same data represented in two different ways.

Mark
December 22, 2013 4:11 am

anna v says:
The reason lies that the complexity in any decent experiment is large, the probability of errors entering in the gathering is also large, as humans are fallible. Chewing over and over the same data may only show up these errors, which would explain discrepancies between experiments , or not , because similar blind spots could exist in the new analysis.
It may depend on who is doing the analysis. Someone from a different group or background may spot “obvious” errors because they don’t have the same “blind spots”in their thinking and reasoning. Thus the attitude that only “climate scientists” are qualified to hold opinions of “climate science” is potentially a big problem.

mbur
December 22, 2013 6:19 am

please allow me to add a proper citation for the “archive’. The quote from my earlier comment.
*from “The Golden BoughA Study in Magic and Religion” by Sir James George Frazer f.r.s.,f.b.a. Hon.D.C.L.,Oxford; Hon.LITT.D.,Cambridge and Durham; Hon.LL.D.,Glasgow; Doctor Honoris Causa of the Universites of Paris and Strasbourg
I volume, abridged edition
copyright 1922 by the macmillan company
copyright 1950 by barclays bank ltd.
and i consider my use of the quote as a reviewer who wishes to quote a brief passage in connection with a review for inclusion in a magazine or newspaper
my connection from the posting to the quote was in the light of data being lost and unelected officials that control the archives.(?)—-because this quote is from an old book that might get lost in the new digital age of data.
I wish to thank the author /publisher for their work in producing the book.
I also would like to thank WUWT for publishing/posting my comment/review.
I apologize for any mis-use of quotes and my cryptic style of commenting/reviewing

December 22, 2013 7:53 am

mbur says December 22, 2013 at 6:19 am

I also would like to thank WUWT for publishing/posting my comment/review.
I apologize for any mis-use of quotes and my cryptic style of commenting/reviewing

re: In bold above.
Yes, and, it does require more than the normal or average amount of effort to read and mentally parse. On first glance, if you pardon the blunt appraisal, it almost looks like, well, gibberish (yes, we have a few posters who post at that caliber).
Let me be kind: The visual style and presentation needs some work …
Apologies if English is not your native language, and also, my own personal view, I would rather you post in any form you can rather than not post; better to have your viewpoint, if substantive, than not.
.

mbur
December 22, 2013 8:33 am

@_Jim—Thank you for your reply and your kind words. Sometimes i have a flare up 😉 and blurt something out. Recently i have refrained from commenting as often as i would like to ,due to the fact that i read it again and it doesn’t make as much sense as i first thought.
I apologize to any who think that it is “gibberish”.I have been exploring ways to improve and your reply is appreciated.
Maybe i am just having a “Watts Up With That ” moment ,or others are having it.

December 22, 2013 8:33 am

mbur says December 22, 2013 at 6:19 am
please allow me to add a proper citation for the “archive’. The quote from my earlier comment.
*from “The Golden Bough A Study in Magic and Religion” by Sir James George Frazer

Incidentally, the above cited volume seems to be viewable here:
1922 abridged edition – http://ebooks.adelaide.edu.au/f/frazer/james/golden/
1894 ed Vol. I (of 2) – https://archive.org/stream/goldenboughstudy01fraz#page/n9/mode/2up
1900 ed Vol. I (of 3) – https://archive.org/stream/goldenboughstudy01frazuoft#page/n11/mode/2up
Google search for all volumes on Archive.org
.
.
The referenced quote from the Preface reads:

… of the Khazars in Southern Russia, where the kings were liable to be put to death either on the expiry of a set term or whenever some public calamity, such as drought, dearth, or defeat in war, seemed to indicate a failure of their natural powers.

.

December 22, 2013 8:47 am

Ernst-Georg Beck reconstructed the data from tens of thousands of CO2 measurements, analysed the experimental techniques and rated the quality of the data, based on descriptions of the methods of measurement and ambient conditions.
Methinks there’ll be a “science gap” from about the 1970’s until real soon now, I hope. Future researchers may come to the conclusion that there was no scientific activity at all for a generation, because there are almost no surviving data or rigorous experimental or analytical documentation.
Ask some of the “climate scientists” to replicate their own “experiments” with the catastrophic models that they used just five years ago. It has nought to do with bit-rot.

December 22, 2013 8:47 am

Larry Ledwick says December 21, 2013 at 12:23 pm

Do you think Lockheed could produce the original air tunnel test data for the SR-71?

There is also the aspect, Larry, that the design and design verification (testing) would use contemporary methods involving (dare I say it?) modelling on computer equipment unavailable ‘in the day’ … also, the design would make use of CAD software/hardware unavailable in the day as well. The coupling of CAD and numerically controlled metal forming.cutting/turning equipment and the availability of composite materials might (in all likelyhood) result in a shorter design cycle than the original.
.

mbur
December 22, 2013 8:50 am

Thank you for a complete linked citation for my selected quote.I do know that almost everything is ‘archived’.My cryptic comments alluded to—–those that control the archives
are really the ones in control of the whole thing.

December 22, 2013 9:24 am

Thanks guys for mentioning M-Disc.
I hadn’t been aware of its capabilities but as it turns out, the optical drive that I bought about 2 years ago for my current desktop system will etch M-Disc. Looks like I’ll be ordering some M-Disc media after Christmas.
BTW: Most of my old VHS tapes are still OK after 3 years with no special storage facilities; just a bit of common sense. The DV tapes I have from 1998 are still readable, with a few, correctable errors. Old hard drives that haven’t been powered up for 5+ years, last written before the turn of the century, also come up good. fsck (readonly) has no complaints. The major problem that I have with old cassette tapes (other than entropic losses of quality), is in the glue binding the leaders to the spools; mostly on pre-1980’s tapes.
And my university notes from the 1970’s are still as incomprehensible as they were the day after they were written.
Have a Merry Christmas.

December 22, 2013 1:05 pm

Gail Combs says:
December 21, 2013 at 3:37 pm
This is the reason I am in favor of the old paper lab notebooks and 35 millimeter film for photography. (And microfiche) This loss of data is not only happening in science but through out entire lives.
Think about it. No paper letters between friends and families, no diaries or permanent photos from the present generation. With e-books much literature may only be published in electronic form in the near future. Future historians will consider this a “Lost Era” Heck they don’t even want to teach kids how to write cursive or how to take hand written notes in class!
There will be no real permanent records for this generation. Orwell would love it. /sarc

=============================================================================
We’re required by the EPA to keep our lab records for 10 years. At present, our “official” records are paper but there is an option to go digital.
In my locker I have records from almost 25 years ago on 5 1/4 floppies. I also have the proprietary DOS program that made them, also on the old floppies.
I’ve tried to access a backup of the info using other programs but to no avail.
At present we enter our paper data into a WindowsXP based proprietary program to generate our reports.
If we told the EPA we were “officially” going digital, could we be fined for not being able to access the digital data because Microsoft is forcing people to go to Windows8 or if the company that supplied our program goes under?
I’m just a peedon where I work but there are also legal issues to be considered with only digital storage.

Larry Ledwick
December 22, 2013 1:45 pm

If you really want to keep those records on 5.25 floppies and be able to recover it if needed in the future, you need to open those files in a computer that has a 5.25 drive and the appropriate software, and then save the documents out to a hard drive so they can be written out to cdrom or dvd at least in a universally accepted document format. For images you want to save them as jpg, tiff or png, those formats are recognized by just about all browsers, and document image programs, for text data RTF (rich text), txt (ascii text), html or odt (open document format used in open office).
The early 5.25 floppies lose data over time as the small areas which are magnetized to store the info slowly spread (sort of like a spreading stain in a rug). They eventually begin to blend into adjacent data and become unreadable. That is assuming that the magnetic oxide still is adhered to the disk itself. They also have a tendency to flake off oxide, just like old magnetic tapes do.
The only sure way to keep that data is to periodically “refresh it” by writing it back out to newer media.
Like I mentioned above I have a lot of late 1980’s early 1990’s vintage 3.5 inch floppies which are still readable. But some of the files do sometimes take a couple of attempts to read them. They have been stored in low relative humidity at room temperature for 20+ years.
I just pulled 4 random floppies from that box and opened random files on the disks with no problem/read errors. But they could be unreadable next week no guarantees on a floppy that old.
If they can only be opened in that proprietary program then your only option might be to use print screen or cntl A cntl C to highlight the data and cut and paste it into a more modern document format. You might lose formatting but with both a screen grab of the display and the raw text data, you could reconstruct the page display in a modern word processor document with proper formatting. Big hassle I know, I spent about 4 hours yesterday evening converting some of my old WP7 files to RTF and ODT documents.
It is a long tedious process but if the data is important to you, it is a cost of doing business to protect the data. If you are dependent on an xp compatible system to run that software you might need to keep an old xp box someplace disconnected from all networks where you can read the data and then export it to flash drives, a USB hard drive or cd/dvd storage to be transferred to a more modern computer system.

December 22, 2013 2:02 pm

Larry Ledwick says:
December 22, 2013 at 1:45 pm

=======================================================================
Thanks. Even though we couldn’t be legally required to required to retrieve data that old, it’s worth a shot.
The paper records should also be cared for. About five years ago someone found some of our old paper records. They went back to before we had any automation. I found the first report with my initials on it. A year or two ago someone decided to toss out the old records. *Sigh*
Aside from the personal disappointment, there were clues in those records as to how to run the place if our SCADA went down.
(“SCADA” is Supervisory Control and Data Acquisition or, in other words, automation.)

December 22, 2013 2:07 pm

The solution is a bit mean but not particularly difficult. If the data is not available to replicate, this is “trust us” science. “Trust us” science needs a rating scale from the data from the study itself is not available to the immediate citations, to 2nd, 3rd, or whatever degrees up line no longer have data. Grant making bodies should use the scores to determine future funding.
Once journals are rated on how unreliable they are as far as insisting on data archiving/maintaining the ability to replicate and real money is on the line in the form of grant eligibility, the problem will correct itself and basic experiments without available data will get redone, perhaps with interesting variations in results.

tancred
December 23, 2013 5:09 pm

If there is no longer data, it’s no longer science.

December 23, 2013 5:41 pm

I have just read that there was no pause because not enough data was collected in arctic. Surely new data for the pause years cannot be slotted in? How can this work?

LarryD
December 24, 2013 4:44 pm

Access to the raw data and re-analysis is valuable to confirm the original analysis isn’t screwed up (e.g., Mann’s hockey stick). Ideally, this sort of thing should be soon after publication, (if not before, but I doubt academic peer review will ever be that good.) After the analysis is verified, then you try to reproduce the experiment.

1 4 5 6