The Vast Majority of Raw Data From Old Scientific Studies May Now Be Missing

From the people that know how to save and care for things of importance, comes this essay from The Smithsonian:

One of the foundations of the scientific method is the reproducibility of results. In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.

This is why the findings of a study published today in Current Biology are so concerning. When a group of researchers tried to email the authors of 516 biological studies published between 1991 and 2011 and ask for the raw data, they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible. In total, even including papers published as recently as 2011, they were only able to track down the data for 23 percent.

“Everybody kind of knows that if you ask a researcher for data from old studies, they’ll hem and haw, because they don’t know where it is,” says Timothy Vines, a zoologist at the University of British Columbia, who led the effort. “But there really hadn’t ever been systematic estimates of how quickly the data held by authors actually disappears.”

To make their estimate, his group chose a type of data that’s been relatively consistent over time—anatomical measurements of plants and animals—and dug up between 25 and 40 papers for each odd year during the period that used this sort of data, to see if they could hunt down the raw numbers.

A surprising amount of their inquiries were halted at the very first step: for 25 percent of the studies, active email addresses couldn’t be found, with defunct addresses listed on the paper itself and web searches not turning up any current ones. For another 38 percent of studies, their queries led to no response. Another 7 percent of the data sets were lost or inaccessible.

“Some of the time, for instance, it was saved on three-and-a-half inch floppy disks, so no one could access it, because they no longer had the proper drives,” Vines says. Because the basic idea of keeping data is so that it can be used by others in future research, this sort of obsolescence essentially renders the data useless.

These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.

===============================================================

The paper:

The Availability of Research Data Declines Rapidly with Article Age


Highlights

• We examined the availability of data from 516 studies between 2 and 22 years old

• The odds of a data set being reported as extant fell by 17% per year

• Broken e-mails and obsolete storage devices were the main obstacles to data sharing

• Policies mandating data archiving at publication are clearly needed


Summary

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2, 3 and 4], and journal [5 and 6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8, 9, 10 and 11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.


Results

We investigated how research data availability changes with article age. To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). Our final data set consisted of 516 articles published between 1991 and 2011. We found at least one apparently working e-mail for 385 papers (74%), either in the article itself or by searching online. We received 101 data sets (19%) and were told that another 20 (4%) were still in use and could not be shared, such that a total of 121 data sets (23%) were confirmed as extant. Table 1 provides a breakdown of the data by year.

We used logistic regression to formally investigate the relationships between the age of the paper and (1) the probability that at least one e-mail appeared to work (i.e., did not generate an error message), (2) the conditional probability of a response given that at least one e-mail appeared to work, (3) the conditional probability of getting a response that indicated the status of the data (data lost, data exist but unwilling to share, or data shared) given that a response was received, and, finally, (4) the conditional probability that the data were extant (either “shared” or “exists but unwilling to share”) given that an informative response was received.

There was a negative relationship between the age of the paper and the probability of finding at least one apparently working e-mail either in the paper or by searching online (odds ratio [OR] = 0.93 [0.90–0.96, 95% confidence interval (CI)], p < 0.00001). The odds ratio suggests that for every year since publication, the odds of finding at least one apparently working e-mail decreased by 7%

See more discussion and graphs here:

http://www.sciencedirect.com/science/article/pii/S0960982213014000

The climate data they don't want you to find — free, to your inbox.
Join readers who get 5–8 new articles daily — no algorithms, no shadow bans.
0 0 votes
Article Rating
144 Comments
Inline Feedbacks
View all comments
Pamela Gray
December 21, 2013 8:40 am

Case law usually drives record management. But case law is ahead of our ability to store mountains of data in a small space. Non-profits and governmental agencies are notoriously disconnected and technologically mismatched to the extent that record keeping is localized and housed in a huge variety of ways. Some of it is paper, some of it is on disks of various configurations, and some of it has been shoved to the back of some storage room, uncatalogued and slowly melting into oblivion. No one’s fault. It is simply the current state of the collective mess.
So what does that mean? It means that we will have to rediscover through new research what was discovered and forgotten decades ago. If there is one lesson to be learned when we open the door on the raw data archive mess it is this one: do not let politicians pass any legislation based on this current state of affairs. If we do, the onus is on us, not them, and Ike will have been proven right.

Doug Huffman
December 21, 2013 9:03 am

Another argument for Open Access is the wider distribution of data across locations, formats and media. Unfortunately this horse is long gone from the barn. Is DRM/paywall akin to a flaming barn?

mbur
December 21, 2013 9:07 am

I apologize for the spelling/syntax errors in my comments.
Now i must pay a sin-tax penance, and can’t comment for a undefined period of time.
dang it 😉

TImothy Sorenson
December 21, 2013 9:08 am

Let X be an elliptic curve, of the form: $y^2 = x^3 +17$. Consider the set of integer solutions…
150 years in the future a mathematician will know what ‘data’ I am working with!
“Let X be a CMIP model initial with UAH and GISS data from 1970-1990…” we can’t replicate that next year!

December 21, 2013 9:13 am

Why does this make me think of the fire at the Library of Alexandria? In some cases the names of the Greek and other inventors/philosophers are only remembered today because some of their scrolls survived.
If a system is devised to store the data on obsolete media or in obsolete formats it should be redundant.
For starters, the NSA has a huge storage facility that could be put to a better use.

Rod Everson
December 21, 2013 9:43 am

First, I suspect that 99+% of “data” collected for all studies, both published and unpublished, is of relatively little value and will never, ever, be accessed again, even by the original authors. Spending public money to archive all of it will therefore be a waste of our tax dollars.
Instead, now that anyone can “publish” even if only via a personal website, authors should have the sense to strip all data of personal identities and publish the supporting data along with their results. And why can’t a journal insist that upon publication of a journal article, all supporting data be uploaded to their server and made available instantly to anyone interested? The cost of doing so would be trivial, for the data need not be archived beyond a few years. If anyone had an interest in the raw data, they could download it to their own server/computer and keep it as long as they wanted. The 99% that is simply useless junk will never be accessed and will disappear into the ether in time. The 1% that interests people will be preserved in multiple places. If it’s not, so what? That’s what’s apparently happening now anyway, and will likely continue to happen.
Journals should make it an absolute requirement that raw data be published online at the same time any article is published. Authors of unpublished articles who publish them on their own websites instead should do the same if they expect anyone to pay attention to their studies. People who actually discover something should be more than willing to share their data if they are scientists. After all, they’re supposedly trying to make a point, a point indicated by the data itself.
I do realize that authors will have competitive reasons for not sharing all of the data they’ve generated. However, any data relied upon in a published paper should be made available. And, as we’ve seen, those who withhold their data, at least in the climate “science” area, are not always behaving like real scientists seeking to advance the state of knowledge.
Incidentally, the only records I can find of data I’ve collected for personal reasons that are over 20 years old are all recorded on paper, not a hard drive or a floppy. And I expect most of it to be tossed when I’m no longer around.

Jimbo
December 21, 2013 9:54 am

This is nothing. What about those climate scientists who absolutely refuse to hand over the data NOW!

One of the foundations of the scientific method is the reproducibility of results.

I agree. But what is this?

Dr. Phil Jones (CRU)
I should warn you that some data we have we are not supposed to pass on to others. We can pass on the gridded data – which we do. Even if WMO agrees, I will still not pass on the data. We have 25 or so years invested in the work. Why should I make the data available to you, when your aim is to try and find something wrong with it. There is IPR to consider.
http://climateaudit.org/2005/10/15/we-have-25-years-invested-in-this-work/

This is the world of Climastrology in action. This is not yet a science I see.

Larry Ledwick
December 21, 2013 10:00 am

It has been alluded to in prior posts but one of the biggest problems is changing hardware and standards. I do photography and generate about .5 terabyte of images a year. Having worked in the IT field for many years I saw this problem a long time ago and have considered how do I preserve and protect my images so like photographers of old using glass negatives any of my images might out live me. The biggest hurdle was how do I know what means of recovery will exist in 10, 30, 50, 100 years from now.
Lets assume a small budget generator of data gathers some information. It matters little if it is some obscure authors plays (Shakespear) or a budding amateur photographer (Ansel Adams) each generates data in some original form, which gets stored away in the storage media of the day. In both of the above examples, they use two of the best currently available archival means. Fade proof ink on archival paper is good for 100-200 years assuming it is kept dry and free of bugs and mold. Likewise silver halide film negatives on glass or polyester stock last over 100 years (not so much if on celluloid base stock).
In modern digital data you have progressive changes in the data itself (ibm punch cards, punched paper tape, 8″ and 10″ reel tapes, 3480 family of tape cartridges, 4mm dat tape, DLT tapes, LTO tapes and probably a dozen other media I have never seen or heard of).
Not only do the data formats on the media change, but so do the availability of the software to allow the data to be read (not mentioning forgotten passwords for protected files). In the late 1980’s early 1990’s the defacto standard word processing system used in government was Corel Word Perfect on the early PC’s. I have a large number of old documents I wrote as a State Planner that were written on early PC’s using Word Perfect. Open office used to allow you to open those documents but recent editions of Open office no longer allow that. It no longer has the drivers/internal code to open those files. I recently had to dig through a stack of old CD’s to find an old copy of Word Perfect 8 to allow me to open one of those files. How many folks out there do you think can open any of those documents written just 30 years ago?
Then you have the physical hardware infrastructure changes. Just suppose our intrepid data generator, diligently put all his/her data on a hard drive in some relatively universal document format like RTF html xls or TXT which almost all systems can read today and locked that disk drive away in a safety deposit box. 30 – 40 – 50 years later someone pulls out that disk drive and tries to read the data off it. Assuming the lubricants in the disk drives bearing has not turned to varnish so the platter still spins. Does he have a system and the needed adapter cables to allow the drive to even be plugged in and powered up. Is it IDE, ATA PATA SATA, USB2, USB3 interface? Can they find the drivers to read that format on modern equipment. Will their new “Garagedoor 28” computer using a 128 bit operation system still be backward compatible to now ubiquitous document and image formats. Will they still be using simple 0 1 binary or will we have moved on to a system of data storage that uses a 4 state code or some quantum system that no longer even uses binary data representation. Will 2010 vintage PDF files still be readable? Will anyone know what a JPG or PNG image file is?
The biggest hurdles right now are at the hardware software level. Simple little things like not having an obsolete adapter cable interface (remember the old key boards and mice before almost all moved to USB) You had the old PS/2 and prior to that the older DE-9 RS-232 serial mice connectors. Right now most everything is moving to some variant of the USB interface but even there you have 3-4 different connector types and sizes. Who will have a junk box of old USB cables 50 years from now even if you have a working disk drive?
One data preservation task it to store both the physical media but also the primary interface hardware (cables, sockets adapters, drivers etc.) along with the primary storage device. Otherwise your only option is to every 5-10 years migrate the data to a more “modern” storage media, and hope that you don’t pick a “newer” media system that is suddenly obsoleted by a law suite or the bankruptcy of the parent company, or some poorly informed administrator who has all your old “archival junk” tossed out so they can use the space for a new break room.
This is a very big problem! Right now the most reliable form of preservation of rare old reports is on the personal hard drives of thousands of topic specific web surfers. We have seen it several times right here on WUWT that someone has posted that they could not find a certain old document and some other user posts an obscure link to some site that has a local copy of the original or a personal copy they captured years ago on their personal system.
Anthony needs to consider this issue seriously for his study materials on station quality and see that some archival version of his study and raw data is placed in as many reliable repositories as possible! I suspect in 100 years all the data that supported the CAGW hysteria will be long gone and only the diligent efforts of a few skeptics might survive for historians to review.
In fact this blog is a very important historical archive of CAGW and how it matured and decayed as the hype gave way to reality. I sincerely hope all the early blog data is well kept and preserved somewhere.
I would happily send in some donation money to help Anthony archive his blog data!

December 21, 2013 10:40 am

Not a new problem. When I was in high school, the electronics shop teacher made some money on the side fixing up magnetic wire recorders and using them to transcribe old recordings in the Smithsonian collection to (then) modern magnetic tape. There was a lot of material recorded during the Depression of rural American folk music that was at risk of being lost forever. The Germans developed magnetic tape for audio recording during WWII; prior to that there were several competing systems which used magnetic wire.
I also attended a very interesting lecture by a 3M chemist on the archival properties of magnetic tape and basically it boils down to “store in a cool, dry place, and hope for the best”. After 10 years, there is no assurance the information can be read.
Digitizing only puts off data obsolescence for a while, as you have to continually refresh/migrate it onto new media. At some point the effort involved is no longer worth it for most content.
If you want to keep information for a very long time with very high assurance it can be read again, the best bet is archival microfilm. If processed and stored correctly it should be recoverable for 500 years, based on extensive testing by Kodak. The only technology required to read microfilm is illumination and magnification; modern microfilm would have been readable with the technology of the 18th century. Unfortunately, the digital camera revolution has gutted the market for conventional film and I think Kodak no longer makes it, but Fuji does.

December 21, 2013 10:50 am

Sir Isaac Newton’s notes were kept. His notes on alchemy were “lost” I believe because later biographers didn’t want to reveal this unbecoming interest of the great man. However in the 1930s, they were found among unsorted papers at the Royal Society. They created a stir and history panned Newton mercilessly, referring to him as a magician rather than a scientist. The linearity history eggheads at Oxford and Cambridge (Newton’s university) who still are far behind the great man in knowledge and understanding, do this kind of thing. Tear down the great. The guy was 17th century for goodness sake before chemistry had legs! Despite this, he successfully experimented with production of hydrochloric acid from salt and he crystallized antimony oxide needles (how many of you even know there is such an element as antimony (symbol Sb). By the way, Newton was denied the chair in mathematics at Kings College, Oxford because he had heated disagreements with King James, he of the King James’s version of the Holy Bible.
An outside the box thought: Maybe we will convert lead to gold one day. It will be expensive rejiggering the nucleus but it would be worth it to put a shine back on Newton’s image and thumb a nose at the linearity eggheads. We already have converted some elements to others after all.
http://rsnr.royalsocietypublishing.org/content/60/1/25.full
More on linearity. The same Oxford eggheads disenfranchised Herodotus as the father of History, buying into the probably jealous and much more boring (like the Oxford historians) Thucydides (whom I’ve also read) who called Herodotus a storyteller – ironically not realizing that he himself would not likely have been a historian if it hadn’t been charted out by his benefactor. I’ve read Herodotus’s Histories and it was a superb read. I forgave him his wrong thinking about the ebb and flow of the Nile which some believed (correctly) was caused by seasonal melting of snow.
I also have two of the three volumes of Isaac Newton’s physics and mathematics lecture notes, collected and published in the late 18th Century (still looking for vol 3). Believe you me, we are in for a veritable tsunami of lost data when the climate science house of cards has finally collapsed. No loss really. But irresponsible scoundrels like these will unfortunately also do their best to destroy the raw data that has been foolishly left in their care to play with as they like. I have little doubt we have already lost long running records that weren’t behaving according to the script after their data had been put through the mincer. Boy, there is a mess to clean up and a major starting over awaiting us.

DirkH
December 21, 2013 11:28 am

Pippen Kool says:
December 21, 2013 at 6:26 am
“Any work that is important is usually repeated in a slightly different way. For example, most people believe Mann’s old ’ 8 paper not because they have looked at his data but because the study has been repeated–not exactly, but close enough—many times by now, Marcott’s paper (’13) might be the last one.”
Well, the spike near the present in marcotte&Shakun’s data is not reliable, as Shakun has told Revkin
(
http://dotearth.blogs.nytimes.com/2013/03/07/scientists-find-an-abrupt-warm-jog-after-a-very-long-cooling/#more-48664
)
; and when we ignore the spike, we see gradual cooling over the past 8000 years.

December 21, 2013 11:55 am

This would be a great public service that a company like Google could provide for the world. They could have a data archive for scientific research. It could be part of Google Scholar.
REPLY: I was thinking the same thing, but with a backup on the Amazon cloud service. – Anthony

December 21, 2013 11:56 am

What gets me is how these papers get written in the first place if they cite previous research data. SOMEONE must be archiving their data or there would be nothing to research. New research would require new data every single time, or maybe that is part of the game. Collecting data requires money.

dearieme
December 21, 2013 12:15 pm

“Newton … had heated disagreements with King James, he of the King James’s version of the Holy Bible.” King James died in 1625; Newton was born in 1643.

Larry Ledwick
December 21, 2013 12:23 pm

crosspatch says:
December 21, 2013 at 11:56 am
What gets me is how these papers get written in the first place if they cite previous research data. SOMEONE must be archiving their data or there would be nothing to research.

That is why it is so much cheaper to just make the data up. /sarc
Actually I would bet that much of that old data started out as photo copies in the library from old documents or hand transcribed lists (by grad students) off of old printed documents. The originals are long gone now. Then that hand transcribed or photocopied second generation data was again re-transcribed into digital format (again by a graduate student) into 3rd generation data, with more new typos introduced at each layer of use.
Then that data was manipulated, adjusted, tweaked, modified and formatted each time losing content or introducing unintentional errors (we will ignore any intentionally introduced errors).
As a result the “original data” that the author of the study used was in reality 2nd, 3rd or 4th generation data from an original source that no longer exists.
Just like historians have identified about 7 different versions of the Gettysburg address, all that data has been subject to decay of content due to each step in the replication process if not intentional destruction. Even computers drop data during copy operations. That is why we usually include data verification steps like hash values and check sums to verify the data has not suffered errors in copying when we duplicate files on computers.
Even high profile programs suffer this sort of data decay. Many people talk about apocryphal stories that the U.S. could not build a new copy of the Saturn V booster because the original engineering drawings and specifications are no longer available, nor are key single source components used in the design. The original designers and builders have all retired or died taking with them their first person knowledge about why and how certain things were done. Similar examples exist such as the order to destroy all tooling and fixtures that were used in the production of the SR-71. If there was a need to re-manufacture one, it would be cheaper to start from scratch than to reassemble all the blue prints and hardware necessary to assemble a new clone. It is obvious that the same or worse happens to much less visible programs on a daily basis. Do you think Lockheed could produce the original air tunnel test data for the SR-71?
We just burn our Library at Alexanderia a bit more slowly than the Egyptians did.

jaymam
December 21, 2013 12:26 pm

The newspapers of New Zealand have now been digitised for around 100 years up to 1945, and are searchable. This is the result of my first search, for the words Auckland temperature:
http://i44.tinypic.com/2qk1jk2.jpg
Note the large amount of information, amazing for 1869 or even today.
Since NIWA seem to have lost NZ’s early temperature data, I plan to recreate the records from printed newspapers. I’ll save a JPG with the relevant date for each reading.

December 21, 2013 12:49 pm

dearieme says:
December 21, 2013 at 12:15 pm
“Newton … had heated disagreements with King James, he of the King James’s version of the Holy Bible.” King James died in 1625; Newton was born in 1643.”
Oops, you are correct. It was James the second (James the VII for you Scots out there). Jimmy one was famous for his treatise on the tobacco smoke being bad for us.
http://www.royal.gov.uk/HistoryoftheMonarchy/KingsandQueensoftheUnitedKingdom/TheStuarts/JamesII.aspx
http://www.jesus-is-lord.com/kjcounte.htm

J G
December 21, 2013 1:12 pm

Floppy drives that connect to recent computers via a USB connection are readily available for about $25.

Third Party
December 21, 2013 2:20 pm

Data? We don’t need no stinkin’ Data.

Gail Combs
December 21, 2013 3:37 pm

This is the reason I am in favor of the old paper lab notebooks and 35 millimeter film for photography. (And microfiche) This loss of data is not only happening in science but through out entire lives.
Think about it. No paper letters between friends and families, no diaries or permanent photos from the present generation. With e-books much literature may only be published in electronic form in the near future. Future historians will consider this a “Lost Era” Heck they don’t even want to teach kids how to write cursive or how to take hand written notes in class!
There will be no real permanent records for this generation. Orwell would love it. /sarc

December 21, 2013 5:39 pm

So I’m thinking back to my first publication in the cancer literature, which was between 1991 and 2011, the period of study. The raw data is written in a notebook archived somewhere, I have no idea where, at Boston University School of Medicine. If I had been e-mailed for the study, I would have replied that it is archived somewhere in a notebook, but that retrieving it would take some doing. Would I have been counted as having “missing raw data?”
There really is no excuse for relying on magnetic media of uncertain lifespan as the record of scientific work. For my later lab work, I wrote all my notes on my laptop, but then printed out the pages, signed and dated them, and put them in a notebook. It is clumsy in the digital age to resort to paper, but it is the only thing that will be verifiable 50 years from now. Eventually it will all be scanned, interpreted, and searchable.

December 21, 2013 7:31 pm

There really is no excuse for relying on magnetic media of uncertain lifespan as the record of scientific work.

Giving the prices of storage, replication across physically separated devices would solve most of the problems. Most raw data should be publically available in any case, seeing as how we paid for them.

December 21, 2013 8:07 pm

Given that even with the data only about 1/4 of the studies are reproducible I don’t see the great [loss]. So much of what we think we know ain’t so.

December 21, 2013 8:07 pm

loos = loss

Jean Parisot
December 21, 2013 8:10 pm

Shouldn’t the journals be responsible for the archiving of data that supports what they have published?
What’s the archival policy of this discussion forum? The material herein and in a few associated blogs will provide an interesting historical footnote in the cold, hungry world of 2040.