
Guest Post by David M. Hoffer
Since both ClimateGate 1&2 there has been considerable confusion in regard to how the emails were obtained, how the FOIA requests were managed, and what was or wasn’t possible in that context. There is no simple answer to those questions.
The ClimateGate emails span a period of nearly two decades. During that time period, email systems evolved substantially in terms of technology, implementation, operational procedures, and the job descriptions of those responsible for them. Other technologies such as backup systems, archive, and supporting technologies for legal compliance also changed (as did the laws themselves). Saying exactly what was and wasn’t possible for such simple actions as deleting an email have completely different answers over time, and also based on the technology that was implemented at any given time. With so many moving targets, it is impossible to draw any conclusions to a 100 percent certainty.
This article is written to cover the basics of how email systems and their supporting infrastructure work, and how they have evolved over time. With that as a common background, we can then discuss everything from the simple questions regarding who could delete what (and when), how the emails might have been obtained, and possibly most interesting of all, raise some serious questions about the manner in which the FOIA requests were handled at the CRU.
EMAIL 101
There are many, different email systems, and many different ways for end users to access them. The basics are common to all of them however. Each user has a “client” that allows them to access their email. It could be an internet browser based client such as the ones used by Hotmail and Gmail, or it could be an email client that runs on your desk top computer like Outlook or Eudora. For the purposes of this discussion I am going to discuss how things work from the perspective of an email client running on a desk top computer.
The email client connects an email server (or servers in a very large implementation). To send an email to someone on a different email server, the two servers must “talk” to each other. In most cases they do so over the internet. How the clients interact with the servers however, is part of understanding why deleting an email that you sent (or received) is not straight forward. The reason is that an email is never actually “sent” anywhere. Once you write an email it exists on the disk drive of the computer the client software is installed on. Press “send” and it goes….nowhere. It is still there, exactly as it was before you “sent” it.
A copy however, has now been sent to the email server you are connected to. That email server makes yet another copy and sends it to the email server the recipient is connected to. That email server then makes still one more copy and sends it to the email client on the recipient’s computer, which in turn writes it to the local hard drive. There are now a minimum of four copies of that one email.
But wait, there may be more copies. When researchers first started exchanging information via email, they were commonly years ahead of the rest of the world. Most large organizations had central IT shops, but they ran financial applications for the most part, email was a curiosity at best. Many researchers were left to run their own email systems, and it wasn’t that hard to do. Solaris was the UNIX operating system in vogue in those days, and Solaris came with a pretty good email system built in called Sendmail. There were many other options too. The bottom line was that early email systems were frequently run by researchers on their own computers.
As time went on, email became more common, and it became more important. The volume of data, performance matters, and security were all becoming beyond the skill set of anyone but someone whose full time job it was to run IT (Information Technology) systems. Researchers began giving up ownership of their own email system and central IT shops took over. Email was becoming mission critical, and a lot of data was being stored in email systems along with records of contract negotiations and other important “paper” trails. Losing email was becoming a painful matter if important information disappeared as a result. As a consequence, the systems that protected the data on email systems also began to mature and be run professionally by IT departments.
The early email systems were just a single server with local hard drives. As they grew in capacity and overall usage, plain old hard drives could no longer keep up. Storage arrays emerged which used many hard drives working together to increase both capacity and performance. Storage arrays also came with interesting features that could be leveraged to protect email systems from data loss. Two important ones were “snapshots” and “replication”.
Snapshots were simply point in time copies of the data. By taking a snapshot every hour or so on the storage array, the email administrator could recover from a crash by rolling back to the last available snapshot and restarting the system. Some storage arrays could handle keeping a few snapshots, others could maintain hundreds. But each snapshot was actually a full copy of the data! Not only could a storage array store many copies of the data, consider the question of deletion. If an email was received and then deleted after a snapshot, even by the central IT department itself, the email would still exist in the last snapshot of the data, not matter what procedure was used to delete it from the email system itself.
What if the storage array itself crashed? Since the storage arrays could replicate their data to other storage arrays, it wasn’t uncommon to have two arrays and two email servers in a computer room so that no matter what failed, the email system could keep on running. What if the whole computer room burned down? Replication to another storage array at a completely different location is also very common, and should the main data centre burn down, the remote data centre would take over. Keep in mind as you think this through that the ability of the storage arrays to replicate data in this fashion is completely and totally independent of the email system itself.
Early email systems were, as mentioned before, most often a single server with internal hard drives. A modern “enterprise class” email system would be comprised of many servers and storage arrays more like this:
If you recall that just sending an email makes, at minimum, four copies, consider what “one” copy on a large email system actually translates to. In the figure above, there are two copies on the storage arrays in the data centre. If snapshots are being used, there may be considerably more. Plus, there is at least one more copy being replicated to a remote data center, which also may have regular snapshots of data. That’s a LOT of copies of just one email! And we haven’t even started talking about backup and archive systems yet.
Let’s return to the question of deleting email. It should be plain to see that in terms of current email technology, deleting an email just from the email system itself is not a simple task if your intention is to erase every single copy that ever existed.
As an end user, Phil Jones is simply running an email client connected to an email server run by somebody else. He has no control over what happens on the server. When he deletes an email, it is deleted from his email client (and hence the hard drive on his computer), and from his view of his emails on the email server. Technically it is possible to set up the email server to also delete the email on the server at the same time, but that is almost never done, and we’ll see why when we start discussing backup, archive, and compliance.
On the other hand, are we talking about what was most likely to happen when Phil Jones deleted an email in 2009? Or what was most likely to happen when Phil Jones deleted an e-mail in 1996? The answers would most likely be entirely different. In terms of how email systems have been run in the last ten years or so however, while it is technically possible that Phil Jones hit delete and erased all possible copies of the email that he received, this would have done nothing to all the copies on the sender’s desk top and on the sender’s email server… and backup systems. Let’s jump now into an explanation of additional systems that coexist along with the email system, and make the possibility of simply deleting an email even more remote.
Backup Systems
Just as we started with email and how it worked at first and then evolved, let’s trace how backup systems worked and evolved. There are many different approaches to backup systems, but I’ll focus here on the most common, which is to make a copy of data to a tape cartridge.
At first, backup was for “operational” purposes only. The most common method of making a backup copy of data for a server (or servers) was to copy it to tape. The idea was that if a disk drive failed, or someone deleted something inadvertently, you could restore the data from the copy on tape. This had some inherent problems. Suppose you had a program that tracked your bank account balance. But for some reason you want to know what the bank account balance was a week ago, not what it is today. If the application didn’t retain that information, just updated the “current” balance as it went, you would have only one choice, which would be to restore the data as it existed on that specific day. To do that, you’d need one tape for each day (or perhaps one set of tapes for each day in a larger environment). That starts to be a lot of tape fast. Worse, as data started to grow, it was taking longer to back it up (and the applications had to be shut down during that period) and the amount of time at night where people didn’t need their applications running kept shrinking as companies became more global.
Several approaches emerged, and I will be covering only one. The most common by far is an approach called “weekly full, daily incremental”. The name pretty much describes it. Every weekend (when the backup window is longest), a full copy of the data is made to tape. During the week, only what changed that day is copied to tape. Since changes represent a tiny fraction of the total data, they could be run in a fraction of the time a full copy could. To restore to any given day, you would first restore the last “full copy” and then add each daily “incremental” on top until you got to the day you wanted.
This worked fine for many organizations, and larger ones bought “tape libraries” which were exactly what they sound like. They would have slots for dozens, sometimes hundreds, of tape cartridges, several tape drives, and a robot arm that could change tapes for both backup processes and for restore processes. The problem was that the tape library had to be as close as possible to the servers so that data could be copied as fast as possible (performance degrades sharply with distance). The following depicts the email system we’ve already looked at, plus a tape backup system:
By making regular copies of data to tape, which was a fraction of the cost of disk storage, the IT department could have copies of the data, exactly as it existed on any given day, and going as far back as the capacity of the tape library (or libraries) would allow. Now try deleting an email from say a year ago. In addition to all the copies on disk, there are at least 52 copies in the tape library. Since we have a tape library however, it is easy to make still more copies, automatically, and most organizations do.
Disaster Recovery
What if there was a major flood, or perhaps an earthquake that destroyed both our local and remote data centers? In order to protect themselves from disaster scenarios, most IT shops adopted an “off site” policy. Once the backup was complete, they would use the copy of the data on tape to make… another copy on tape. The second set of tapes would then be sent to an “off site” facility, preferably one as far away as practical from the data centers themselves.
Consider now how many copies of a given email now exist at any given time. Unlike that financial application whose “current account balance” is constantly changing, email, once received, should never change. (But it might which is a security discussion just as lengthy as this one!). Provided the email doesn’t change, there are many copies in many places, and no end user would have the security permissions to delete all of them. In fact, in a large IT shop, it would take several people in close cooperation to delete all the copies of a single email. Don’t organizations ever delete their old data?
Data Retention
The answer to that question can only be answered by knowing what the data retention policy of the organization is. Many organizations just kept everything until the cost of constantly expanding their storage systems, tape libraries and the cost of housing off site tapes started to become significant. Many organizations decided to retain only enough history on tape to cover themselves from a tax law perspective. If the retention policy was implemented correctly, any tapes older than a certain period of time would be removed from the tape library and discarded (or possibly re-used and overwritten). The copies in the offsite storage facility would also be retrieved to be either destroyed or re-used so that the offsite data and he onsite data matched.
Archive
As email systems grew, the backup practices described above became problematic. How long people wanted to keep their email for was often in conflict with the retention periods for financial purposes. They were designed for general purpose applications with ever changing data. As the amount of data in an email system started to grow exponentially due to ever larger attachments, graphics, and volume, the expense and pressure on even an “incremental” backup window became enormous. That’s where archive started to emerge as a strategy. The storage arrays that supported large email systems were very expensive because they had to be ultra reliable as well as ultra high performance. But 99% of all emails were being read on the day they were sent… and never again. Only if something made an older email important… evidence of who said what and when from a year ago for example, would an email be accessed again after it was a few days old. So why house it on the most expensive storage the organization owned? And why back it up and make a copy of it every week for years?
Many organizations moved to an “archive” which was simply a way of storing email on the cheapest storage available. If someone needed an email from a year ago, they would have to wait minutes or perhaps hours to get it back. Not a big issue provided it didn’t need to be done very often. Some organizations used low performance low cost disk, some even went so far as to write the archive to tape. So, for example, the email you sent and received in the last 90 days might open and close in seconds, but something from two years ago might take an hour. Not only did this reduce the cost of storing email data, but it had the added benefit of removing almost all the email from the email system and moving it to the archive. Since the archive typically wasn’t backed up at all, the only data the backup system had to deal with in its weekly full daily incremental rotation was the last 90 days. This left an email system, with the integrated backup and archive systems, looking something like this:
For most IT shops, if you ask them how many copies of a given email they have if it was sent a year ago, they can’t even answer the question. Lots.
What does that mean in terms of FOIA requests? Plenty.
Compliance
The world was rolling along quite nicely using these general techniques to protect data, and then the law got involved. Enron resulted in Sarbanes Oxley in the United States and similar laws in other countries FOIA came in existence in most western countries. Privacy laws cropped up. Suddenly IT had a new problem, and a big one. The board of directors was suddenly asking questions about data retention. The IT department went from not being able to get a meeting with the board of directors to having the board shining a spot light on them. Why?
Because they (the board of directors) could suddenly go to jail (and some did) because of what was in their email systems. Worse, they could even go to jail for something that was NOT in their email system. The laws in most jurisdictions took what you could delete, and what you could not delete, to a whole new level. Worse (if you were a member of the board of directors) you could be held responsible for something an employee deleted and shouldn’t have…. or didn’t delete and should have. Bingo. The board of directors is suddenly no longer interested in letting employees decide what they can and cannot delete, and when. The same applied in most cases to senior management of public institutions.
Back to the original question. Could Phil Jones have deleted his emails? When? In the early days when his emails were all in a server run by someone in his department? Probably. When the email system moved to central IT and they started backing it up regularly? No. He would only be able to permanently delete any given email provided that he had access to all the snapshot copies on all the storage arrays plus the archive plus all the backup tapes (offsite and onsite). Fat chance without the express cooperation of a lot of people in IT, and the job of those people, based on laws such as FOIA, SOX and others was to expressly prevent an end user such as Phil Jones from ever doing anything of the sort, because management had little interest in going to jail over something someone else deleted and shouldn’t have.
So…did CRU have a backup system? Did they send tapes off site? Did they have a data retention policy and what was it? Did they have an archive? If they had these things, when did they have them?
With all that in mind, now we can look at two other interesting issues:
- What are the possible ways the emails could have been obtained?
- Were the proper mechanisms to search those emails against FOIA requests followed?
Short answer: No.
In terms of how the emails could have been obtained, we’ve seen various comments from the investigation into ClimateGate 1 that they were most likely obtained from accessing an email archive. This suggests that there at least was an email archive. Without someone laying out a complete architecture drawing of the email systems, archive system, backup system, data retention policies and operational procedures, we can only guess at how the system was implemented, what options were available, and what options not. What we can conclude however is that at some point in time, an archive was implemented. Did it work like the description above about archives? Probably. But there are many different archive products on the market, and some IT shops refer to their backup tapes as an archive just to confuse matters more.
In addition, without knowing how the investigators came to the conclusion that the emails were obtained from the archive, we don’t have any way to assess the quality of their conclusions. I’m not accusing them of malfeasance, but the fact is without the data, we can’t determine if the conclusions are correct. Computer forensics is an “upside down” investigation in which the “evidence” invariably points to an innocent party. For example, if someone figured out what Phil Jones username and password was, and used them to download the entire archive, the “evidence” in the server logs would show that Phil Jones did the deed. It takes a skilled investigator to sort out what Phil Jones did (or didn’t do) from what someone using Phil Jones credentials did (or didn’t do). So let’s put aside what the investigators say they think happened and just take a look at some of the possibilities:
Email Administrator – anyone who had administration rights to the email system itself could have made copies of the entire email database going back as far as the oldest backup tapes retained with little effort. So…who had administration rights on the email system itself? There’s reason to believe that it was not any of the researchers, because it is clear from many of the emails themselves that they had no idea that things like archives and backup tapes existed.
Storage Administrator – In large IT shops, managing the large storage arrays that the application servers are attached to is often a job completely separate from application administration jobs such as running the email system. Since the storage administrator has direct access to the data on the storage arrays, copying the data from places such as the email system and the archive would be a matter of a few mouse clicks.
Backup Administrator – This again is often a separate job description in a large organization, but it might be rolled in with storage administration. The point being however, that whoever had backup administration rights had everything available to copy with a few mouse clicks. Even in a scenario where no archive existed, and copying the data required restoring it from backup tapes that went back 20 years, this would have been a snap for the backup administrator. Provided that the tapes were retained for that length of time of course, the backup administrator could simply have used the backup system itself, and the robotics in the tape library, to pull every tape there was with email data on it and copy the emails to a single tape. This is a technique called a “synthetic full” and could easily run late at night when it would just look like regular backup activity to the casual observer. The backup administrator could also “restore” data to any hard drive s/he had access too… like their personal computer on their desk.
Truck Driver – yes, you read that right, the truck driver. Google keywords like “backup tapes stolen truck” and see what you get. The results are eye popping. The companies that specialize in storing tapes off site for customers send a truck around on a regular basis to pick up the weekly backup tapes. There have been incidents where entire trucks (and the tapes they were carrying) were stolen. Did anyone steal CRU’s tapes that way? Probably not. The point is however that once the tapes leave your site and are entrusted to another organization for storage, they could be copied by anyone from the truck driver to the janitor at the storage site. Assembling 20 years of email from backup tapes could be a real hassle of course. On the other hand, an offsite storage facility frequently has as part of the service it provides to clients…great big tape libraries for automating copying of tapes. Encryption of backup tapes was a direct response to incidents in which tapes with valuable (and/or embarrassing information) wound up in the wrong hands.
But encryption has only been common for a few years. That raised an interesting theoretical question. The last email release ends in 2009, and the rest of the release is, in fact, encrypted. One can only wonder, does the CRU encrypt their backup tapes, and if so, when did they start doing that?
Administrative Foul Up – One of the biggest “cyber crimes” in history occurred when a company doing seismic processing for oil companies cycled the tapes back to their customers for the next round of data, and sent old tapes to different customers. One of their customers figured it out, and started checking out the data they were being sent which was from their competitors. It wasn’t the first time it happened, and it wasn’t the last time.
Janitor – Let’s be clear, I’m not accusing anyone, just making a point. There’s an old saying about computer security. If you have physical access, then you have access. Anyone with physical access to the computer room itself, and the right technical skills, could have copied anything from anywhere.
The FOIA Requests
There are dozens of emails that provide glimpses into both how the email systems at CRU were run, and how FOIA requests were handled. Some of them raise some very interesting questions. To understand just how complex compliance law can be, here’s a brief real world story. Keep in mind as you read this that we’re talking about American law, and the CRU is subject to British law which isn’t quite the same.
In the early days of compliance law, a large financial services firm was sued by one of their clients. His claim was that he’d sent instructions via email to make changes to his investment portfolio. The changes hadn’t been made and he’d suffered large losses as a result. His problem was that he didn’t have copies of the emails he’d sent (years previous) so his legal case was predicated upon the financial firm having copies of them. To his chagrin, the financial firm had a data retention policy that required all email older than a certain date to be deleted. The financial firm figured they were scot free. Here’s where compliance law starts to get nasty.
A whistle blower revealed that the financial firm had been storing backup tapes in a closet, and had essentially forgotten about them. A quick inspection revealed that a number of the backup tapes were from the time in question. The financial services firm asked the judge for time to restore the data from the tapes, and see what was on them that might be relevant. The judge said no.
The judge entered a default judgment against the financial services firm awarding the complainant $1.3 Billion in damages. The ruling of the court was that the financial services firm was guilty by virtue of the fact that they had told the court the data had been deleted from that time period, but it hadn’t been. They had violated their own data retention policies by not deleting the data, and were guilty on that basis alone. Wake up call for the financial industry…and everyone else subject to compliance law, which includes FOIA requests.
Suddenly deleting information when you said you hadn’t was a crime. Not deleting information when you said you had, was a crime. Keeping information could wind up being used against you. Not keeping information that it turns out you were required to keep (by the tax department for example) could be used against you. No one serious about compliance could possibly take the risk of allowing end users to simply delete or keep whatever they wanted. From their own personal accounts certainly, but not from the company email server. Ever.
In that context, let’s consider just a few words from one email in which Phil Jones, discussing with David Palmer whether or not he’d supplied all the email in regard to a specific FOIA request, says “Eudora tells me…”
These few words raise some serious questions. Eudora is an email client, similar to the more familiar Outlook. So, let us ask ourselves:
Why was David Palmer relying on Phil Jones to report back all the emails he had? Compliance law in most countries would have required that David Palmer have the appropriate search be done by the IT department. This would have captured any emails deleted by Phil Jones that were still retained by the CRU based on their data retention policy.
Was David Palmer aware of the proper procedure (to get the search done by IT)? If not, was he improperly trained and who was responsible for properly training him in terms of responding to FOIA requests? If he was aware… well then why was he talking to Phil Jones about it at all?
Phil Jones specifically says that “Eudora tells me” in his response to Palmer. Since Phil Jones evidently did the search from his own desk top, the only emails he could search for were ones that he had not deleted. But, that doesn’t mean he found all the emails subject to the FOIA request, because email that he did delete was more than likely retained on the CRU email server according to their data retention policies. As in the case of the financial company, the CRU may well have said they didn’t have something that they did. In fact, we can surmise this to be highly likely. There are multiple emails showing up in which, for example, Phil Jones says he is going to delete the message right after sending it. But we now have a copy of that specific message. Did he send it and then forget to delete it? Probably not. The more likely answer is that he did delete it, not realizing that the CRU data retention policy resulted in a copy being left on the server. If the CRU responded to an FOIA request and didn’t include an email that met the FOIA parameters because they failed to search all their email instead of just the email that Phil Jones retained in his personal folder… well, in the US, there would be some prosecutors very interested in advancing their careers…
“Eudora tells me” is even more curious from another perspective. Why “Eudora”? Why didn’t he say that he’d searched all his email? Why specify that he’d used the search capabilities in Eudora? Personally, I have three email systems that I connect to, and a different email client for each one. Searching all my email and searching all my email in just one email client are two completely different things. Most interesting!
While I am comfortable discussing with IT shops how to architect email systems to protect data and properly service legal requirements such as FOIA requests, I have to admit that I wouldn’t know where to start in terms of submitting one. If I did, I just might ask for all the emails they have pertaining to FOIA requests, and I’d be specific about wanting all the email, regardless of it being in the main email system, the archives, or on backup media, and all the email that ever existed, regardless of having been deleted by end users. Then for the capper, I’d ask for their data retention policy and see if they managed to meet it in their response.
Just sayin’
dmh
Discover more from Watts Up With That?
Subscribe to get the latest posts sent to your email.
Everything is so god damn complex unless you know how to do it. But when you know what to do and how to do it, it very easy and not at all complex, albeit it was a bit harder some 10-15 years ago, but these day, not so much.
1. Disregard everything you think you know about Climate gate 1. Too much information makes you susceptible to presumptions.
2. What then do you actually know about climate gate 2? The real actual facts, not what you thinks are the facts.
3. From where does most people leak? Excepting the usual orifices.
4. From where does most people hack? Excepting the woods.
5. Put the remainder into a simple visual social network analysis powerpoint presentation if you like.
6. Try connect each actual real fact to each person in the network using bing and google.
7. Probability dictates, these days, that, then, your done.
Pretty much everything that is done via the internet is stored on the internet, so it’s not at all that difficult these days to find out about stuff, the hard part is actually figuring out why people do what they do, unless they’re trying to fix something that is corrupt that is. :p
“Interesting approach. My own non-DOD-approved technique is to remove the hard drive platters and bend them into interesting shapes. In theory you might be able to recover some fragments of data, but it’s unlikely you’d be able to do anything with them.”
I know you’re talking about large arrays and hard drives, rather than personal backups, but it reminded me of my approach to making my DVD backups non-accessible before I throw them in the trash. About 5-6 seconds in the microwave works great!
re: single-overwrite
I suggest a quick search on the “Great Zero Challenge”. The original challenge was by 16 Systems, I think. It was up for several years and finally taken down, unaccepted. Several smaller groups appear to have made similar offers under the same name and also claim that no one has ever taken them up on the offer.
A good analysis of Gutmann’s original paper about the ability to read overwritten data was published by Daniel Feenberg of the National Bureau of Economic Research at http://www.nber.org/sys-admin/overwritten-data-guttman.html
One caveat (and maybe the source for the recent multiple-overwrite recommendations): When a sector goes bad, the data is still in the sector even though your OS can’t access it. You can, however, format and attempt to overwrite the device. Only some of the overwrite of the bad sector will fail. Overwrite enough times and you have a good chance of overwriting even the bad sectors.
And, yeah, the paint blob explanation is a pretty loose analogy for what’s really done but trying to concisely explain fractional additive charge probabilities and quantum interference patterns was beyond my ability. It’s possible the NSA has something that the rest of us don’t know about. I’m skeptical, though. Write sizes are approaching the theoretical limits of resolution for the media and all the research I’ve seen suggests that you needed a read head 20x more precise than was used to write to even have a shot at reading overwritten magnetic media.
This whole thing is incredible. Anti-Cause WUWT undercover Agent Unit #1 ‘FOIA’ gambled a great risk—he/she waited for the anniversary to release the next tranche, Climategate II, set of emails assumnig he’d be alive to do so!! I wouldn’t put it past this conglomeration- deluded into thinking no one should stand in the way of them saving the planet- into investigating FOIA and perhaps ‘intervening’. The access to the emails points to a inside job(?)(see above discussions). We know it was dumped into a Russian server. The few who are the web gate keepers should be able trace the time and location of the uploads. What is incredible is the feat was repeated since ClimateGate I. The leaks derailed the Copenhagen meeting Sark, Merk, Brown and O’bama(IMHO). The combined secret service power of these nations seems unable to ‘deal with’ FOIA. A possible clue to his/her identity. Who ever it is needs to make provision for the password release in case of a visit from the heavies. The Climate Science perversion is total. Why would ‘Russia’ want to derail the AGW steam roll.? Who………………..without the protection of state, who could wait a year and repeat the release of these damaging emails??? GCHQ should surrely be able to trace an upload of that Climategate size to a Beligerant State’s (Hi Chelski fans) obscure server. Governments don’t suffer individuals with the power to politically derail $37 Trillion tabbed tax schemes. FOIA—who ever you are: stay thirsty my friend!!
And as he posted on TallBlokes under FOIA….wouldn’t Tall Bloke have some basic information to identify something about this……….hero??.
davidmhoffer ,
outlook against exchange in cached mode does use ost file, similar to pst file.
but in non-cached there isn’t a file (ost or pst) but I suspect there HAS to be a trace somewhere, just was trying to track down some unknown disk usage today on client machine so was hoping you might have known 🙂
a very interesting thread, again I thank you.
my username got skewed last time LOL
For a crash course in real world information security go to the CEO’s desk, or virtually any (non-IT) senior executive’s desk and open the center drawer. His username and passwords for all the systems he has access to will be taped there for him to use when he needs them – unless they’re taped to his keyboard or the front of his monitor.
De-Duplication – does it change anything?
Yes. And no. Like all technology questions…. it depends.
Let’s start with what it is. Even that isn’t straight forward.
Data has a lot of duplication within it. Let’s think about it just in terms of email to see how much duplication there could be. Suppose one sent an e-mail to 100 people, and the email had an attachment (say a large Word document that was 1 Megabyte in size). Suppose further that all of those people are on the same mail server (within your company for example). How many copies of the email and the attachment are there? Better still, let’s further suppose that someone changes a single word in the attachment, hits reply all, and sends it. Now how many copies of what are there?
Answer: It depends. The answer varies from one email system to another. Sendmail, Lotus Notes and Exchange all do things differently. So, let’s just focus on one of those. Exchange is the most common, so we’ll go with that.
Answer: It depends. What version of Exchange? Let’s say Exchange 2003, which is a couple of versions ago.
Answer: It depends. Are all the users in the same datastore? Or are they spread out amongst several data stores? OK, let’s make it easy, and say they are all in the same datastore.
Answer: There is one copy of the first email, one copy of the original attachment, 100 “pointers” to them, one copy of the “reply all” email, one copy of the attachment with one word changed, and 100 pointers to them. For easy reference, let’s call them Email1, Attachment1, Email2 and Attachment2 respectively.
If we were on Exchange 2007 however (the second oldest release) we’d get a different answer. There would be 100 copies of Email1, but only one copy of Attachment1 (plus 100 pointers). There would also be 100 copies of Email2, and one copy of Attachment2 (plus 100 pointers).
Now if we were on Exchange 2010, there would be (hang onto your hat) 100 copies of Email1, 100 copies of Attachment1, 100 copies of Email2 and 100 copies of Attachment2. No, I am not kidding! Isn’t that going backwards? Needing more storage for the exact same emails as we did with the older version?
Answer: It depends. The total performance from a disk storage point of view for Exchange 2003 was, by comparison to Exchange 2010, ginormous. So, since the performance requirements of Exchange 2010 are so much lower, you can actually store all your email for less money than you did in Exchange 2003. The expression “your mileage may vary” applies however. But let’s put that aside for the moment.
If you have been following along, you’ve already figured out that there really are only two unique emails and two unique attachments. On top of that, the two attachments, which are huge, are identical except for one single word. The mechanism used in Exchange 2003 to store one copy of each and 100 “pointers” was called “Single Instance Store” or SIS. In brief, SIS was a way to have 100 copies of something without actually “duplicating” it 100 times. But in Exchange 2010, is in fact is duplicated 100 times. So SIS is one form of “de-duplication”. But when people talk about “de-duplication” they are, in most cases, talking about “de-duplicated” backup. To add to the confusion however, de-duplication actually applies to much more than backup.
Let’s discuss backup first. In the article we discussed backup to tape. Tape was designed to backup large amounts of data and restore large amounts of data. If all you want is one file though (say someone deleted their one page document) it might take as long to restore that one file as the whole backup took. If only the users would coordinate their accidental deletions and do them all at the same time…. but they don’t. Like herding cast those end users…
So along came these very interesting specialized storage arrays that could “de-duplicate” data, and could look like a tape library to the backup system. This changed backup architectures considerably, solving a lot of problems along the way. And of course, creating some new ones. To illustrate how complex the question of de-duplication is, let’s follow the changes in backup systems based on the three different versions of Exchange already discussed.
If we were backing up Exchange 2003, we’d only have two emails, two attachments, and a bunch of pointers to backup, right? Wrong. The backup program doesn’t look at the data on the storage array. OK, that’s not true either, there’s ways for the backup program to talk to the storage array directly, but let’s forget about that for a moment. In most cases the backup program would talk to the email server and ask it to send out copies of all the emails it has. Now Exchange 2003 doesn’t give a hoot about what is actually on disk, it responds with what it logically has stored. That is 200 emails and 200 attachments, and it sends each one individually. If we were backing up to tape, we would actually write 200 emails and 200 attachments to tape. Good thing we bought that de-duplication device.
The de-duplication device recognizes that it just got 100 copies of the same email, so it makes one copy on disk, with 100 pointers. It does the same with the other email, and the attachments. But wait! Those two attachments are nearly 100% identical. The de-duplication device picks up on that too, and stores only one copy of the attachment, plus the delta changes between it and the second attachment. So, even though Exchange 2003 had SIS, the de-duplication device used even less storage for backing up the data than the email system itself used. It could “de-duplicate” the two attachments. In fact, if one paragraph was repeated inside the attachment, it could very possibly de-duplicate that too.
Paradoxically, from a backup perspective, Exchange 2007 and Exchange 2010 would look exactly identical to the backup software compared to Exchange 2003. But, the amount of disk space that one needed to backup Exchange 2010 to a de-duplication device would be (in this example) less than 1% of the disk space on the email server. (The purists would argue that this statement is incorrect, and it is. Exchange 2010 uses a number of compression techniques, so while the apparent storage on the email server is 100X, the actual storage used is less. )
When someone wants to restore data from the de-duplication device, in theory it doesn’t actually exist. There are a whole bunch of blobs of data with a whole bunch of pointers. If someone deleted that large Word document and needed it back, it could be restored, and to do so, the de-duplication device would re-assemble the original document from the various blobs on the disk. This is a process known as “re-hydrating” the data.
OK, now we can start to answer the original question. De-duplication devices changed the way tape was used. It was so much easier to backup and restore to de-duplication devices that they became the main target for backup. One de-duplication device could hold the equivalent of weeks or months worth of backups to tape, but on disk. Could the tape libraries just be abandoned? No.
Tape was still a lot cheaper per terabyte in the long run. Tape didn’t need power to keep running. Tape could be thrown into a cardboard box and shipped off site. What many IT shops did was move their daily backup activity to their de-duplication device. They could then relegate tape to a much less frequent use, sometimes as little as once per quarter or even once per month. Better still, the copies of tape could actually be made from the data on the de-duplication device. The de-duplication device would re-hydrate the data and copy it to tape, producing a tape that was identical to having made a backup directly to tape in the first place. There were a number of secondary effects of de-duplication though.
The first one was that the number of tapes one needed for tape backup went way down. No more daily incremental and weekly fulls. Maybe four or twelve full copies on tape for the year instead of 52. Suddenly the capacity of the tape library to keep years and years of data increased by a factor of four or more. Capacity is to an IT shop like space in a garage. The mount of stuff you need to keep will always expand to fill the capacity. So, one immediate effect of de-duplication was a massive increase in tape capacity that enabled IT shops to keep more years of data than they otherwise would have, and the number of tapes required to represent one year of data was a fraction of what it otherwise would have been. More years of data to…uhm… give liberty to, and less tapes required to do it.
Since the introduction of de-duplication devices, de-duplication has spread throughout the backup system. Backup systems can actually de-duplicate data at a target device such as the de-duplication device we have been discussing, or it can be done in software by the backup application itself, or it can be done by an agent running on the server. But all those options introduced a new problem.
For security purposes, encryption is the best way to protect data. Encryption could be done anywhere in the data food chain. At the application level, on the storage array, in the backup tapes, and so on. The problem was that encryption and de-duplication didn’t get along. Encrypt your data, and your de-duplication devices could no longer de-duplicate it. The same was true of compression. So, a lot of IT shops that were moving toward encryption of their data on disk, had to back away from that if they wanted the benefits of de-duplication. So, de-duplication increased risk to…having data liberated… because it reduced the number of places in the food chain that data could be protected using that method.
You’d think we’d be done with this topic by now, we’re not. There is also archive. Going back to the discussion in the backup section of the article, many IT shops instituted archives to reduce storage costs and take pressure off of backup systems. De-duplication is also a feature of many archive systems. As a result, the capacity to store email in an archive platform is much higher than it is in the email system itself, often by orders of magnitude. Capacity is like a garage… oops, I already said that. Bottom line is that archives that are capable of de-duplicating data could store years of email for little cost compared to the email system itself. So, if the source of the data was in fact the archive, our intrepid data liberator would have had considerably more emails all nicely stored in one place. Better still (or worse, depending on your point of view) archive systems came with sophisticated search tools. If someone had access to those tools in the archive, isolating emails with key words and downloading them from years worth of data, would be a snap.
To be complete, I’d have to add primary storage de-duplication, de-duplication as it applies to wide area acceleration, how compression and de-duplication don’t play well together, how deterministic de-duplication changes that, how journaling and de-duplication affect one another…. and instead, I’m going to bed.
I think that’s the basics though. De-duplication would have no doubt increased the amount of data that was available to liberate, decreased the number of security options available to protect it from unauthorized download, concentrated larger amounts of data in fewer tapes, and made searching for specific email easier. Again, I don’t know that CRU used all (or any) of these techniques, or that they enabled or hindered the liberator of the ClimateGate emails. I’m just saying what the possibilities were.
G’night!
mikerossander;
“There is, in fact, a reward offered for anyone who can recover once-overwritten data from any reasonably modern media. It has gone uncollected for several years.”
Probably because the “reward” is an NSA front. Anybody who tries to claim the reward is shot. /tinfoil hat
Alex Heyworth;
Probably because the “reward” is an NSA front. Anybody who tries to claim the reward is shot. /tinfoil hat>>>
Now now now, “shot” is such an ugly term. I believe the policy is to identify the source of the irritation and delete it or encrypt it (scramble them memory cells). hey, you got one of them tinfoil hats too? cool. works like a hot zzzzzzzt. huh? who are you? where am I?
been wondering if some of these mails/data were gotten off a versioning and/or CMS server such as sharepoint.
Hey it’s JEDI SALESMAN!
http://www.cracked.com/members/david.hoffer
ROFLMAO
davidmhoffer,
Well done post.
I am left with whether the self-named “we” was a single person (an “I”) wrt the CR1 and CR2 releases or an actual “we”.
After reading your wonderful post, I still lean toward there actually being a “we”; more than one person involved. The primary reason I think so is that I cannot see a unique personality in the effort and communications by “we”.
John
The emails were obtained off a backup email server in the IT department. Ostensibly few people knew about it including the CRU team and the person handling the FOIA requests. This is a finding of the UK government’s investigation into the climategate as given in item 31 of the conclusions in the Muir-Russell Review:
http://www.cce-review.org/pdf/FINAL%20REPORT.pdf
Unadvertised archiving of employee email is quite common. Even if the official policy states otherwise there’s usually someone in IT who does it anyway.
One might wonder if the Muir-Russell conclusion 31 is an honest finding. UEA responded to a number of FOIA requests saying there was no responsive material. Phil Jones and the rest of the usual suspects cleaned out their personal email storage around 2008 which was admitted in climategate emails. If those had been the only copies then it would have been a defensible response. If any of the CRU team or the FOIA respondent himself had known about the “extensive” “long duration” email backup server then UEA would have been committing a serious fraud in denying the existence of responsive materials. So I’m left wondering if it actually was the case that it was an honest mistake or whether the Muir-Russell review is itself covering up criminal activity with the seemingly less evil conclusion of “poor communication”.
I’d say it probably was an honest mistake. Many institutions keep a permanent easily searchable record of all employee emails and few institutions advertise the fact that they do. Often it’s done in contravention of official policy with IT doing it and everyone who’s aware it’s being done just looking the other way.
Let’s get something straight. I’ve actually written email clients and pop servers. I’ve been in the business of designing computer hardware and software since the 1970’s. I didn’t sell the stuff I designed it and I worked with the people who used what I designed. There are so many mistakes in the OP I hardly know where to begin. This one below just happened to jump out at me:
This is not how it works in thin clients. Web browsers are thin clients. If you’re accessing your email through a browser nothing is stored locally on your computer. When you press send it does indeed leave your computer except perhaps an inadvertant copy left behind in a temporary browser cache. Other email client implementations may not be thin clients but these have been increasingly popular beginning in mid to late 1990s when browsers became ubiquitous and internet bandwidth costs were plummeting.
John Whitman;
I am left with whether the self-named “we” was a single person (an “I”) wrt the CR1 and CR2 releases or an actual “we”.>>>
Having zero evidence in regard to how anything was or wasn’t done, exactly how their environment was set up, etc, I am nonetheless inclined to agree. Consider simple things like a backup administrator setting the backup system to cut an extra set of tapes for off site storage, and a truck drived from the off site storage company simply taking them out of the box. There are many ways in which an individual in the right place could pull this off themselves, but two or three people working together…and you have a crazy number of possibilities.
Dave Springer;
The emails were obtained off a backup email server in the IT department. Ostensibly few people knew about it including the CRU team and the person handling the FOIA requests. This is a finding of the UK government’s investigation into the climategate as given in item 31 of the conclusions in the Muir-Russell Review:>>>
The Muir-Russel Review that went out of its way to whitewash anyone and everything it could in regard to the entire incident? Yes, that would be a firm conclusion backed up by solid evidence that we could all just accept at face value. Sorta like tree ring data and hockey sticks…. here’s the results, no, you can’t see how we got them.
Dave Springer;
This is not how it works in thin clients. Web browsers are thin clients.>>>
You are correct. If you will refer back to the article itself, you will see that I was specific about the fact that there is a difference, and that the balance of the article would be written from the perspective of an email client installed on a desk top (ie a “thick” client). As at least one of the emails from Phil Jones referes to his use of Eudora, a thick client, it made most sense to restrict the article d a discussion of email from that perspective.
Dave Springer;
There are so many mistakes in the OP I hardly know where to begin. This one below just happened to jump out at me:>>>
Your personal animosity toward me is leading you astray. You claim that I’ve made many errors, yet single out one which, a brief read of the article itself, makes it clear was not an error on my part at all. I specified the difference, gave examples of both, and then advised that the balance of the article would focus on only one of them. Do you want to contribute to the discussion? Or just try and attack what I’ve said because you don’t like me personally?
Dave Springer says:
December 1, 2011 at 7:56 am
Hey it’s JEDI SALESMAN!
http://www.cracked.com/members/david.hoffer
ROFLMAO>>>
Hey that was hilarious! Are you certain that the author and me are the same person? There’s a David Hoffer who went to jail in Saskatoon and all the people that he owed money to tried to collect by putting liens on my assetts. Boy, was that fun sorting out. Then there was a David Hoffer who had a one night stand with a girl in Winnipeg, got her pregnant, and phoned my house at 3:00AM demanding that I pay child support. My wife was not amused. If you hit linkedin, you’ll find 25 professionals named David Hoffer. There’s a David Hoffer who is a public defender, and another one who is an investment banker. Of course, there’s actually only one of me, those are all just personas I’ve adopted to keep you guessing.
Really Dave, drop it. Your obsession with discrediting me is tiresome and of little value to the discussion.
Store-and-forward email
This, where emails you write and receive are stored on your local computer, is considered obsolete. In 1993 when I first started working at Dell Computer we were using cc:Mail which is a store-and-forward system. This is what Hoffer is describing when he says “when you press ‘send’ nothing is really sent”. I think it was finally abandoned around 1998 at Dell but it may have been later. The last I recall needing a cc:mail client was when I was over in Taipei working on the first Inspiron laptop circa 1998. In order to get my email I had to dial up Dell from my hotel room using an analog modem on my laptop and use a fat client called cc:Remote which retrieved any emails sent me since the last connection and forwarded anything I’d written. My personal email at that time was already a thin client where telnet was the client side software. It was pretty soon after that I began using web mail and never looked back. After January 2000 when I left the corporate rat race I’ve used nothing but hotmail (general purpose) and gmail (technical-only). I understand gmail introduced something called “Gears” which allows for some measure of offline work with your email. I read about “Gears” just a moment ago as I was engaged in due diligence fact-checking about fat clients that may still around before posting this comment. Gears would be an example of a fat client.
Anyhow, ccMail itself was purchased by Lotus in 1991, Lotus was purchased by IBM in 1995, and was officially abandoned by IBM/Lotus in 2000. I recall the struggle. There was a push by Lotus to migrate cc:Mail into Lotus Notes. Lotus Notes sucked IMO. The UI was obnoxious and bordering on unusable for the casual user. I used Notes because our test/qualification system was tied to our suppliers so, for instance, when we were qualifying a newly designed motherboard Intel, NVidia, and assorted others could all work off the same issues database hosted through Notes with appropriate compartmentalization. I never recall our email going through Notes though and I was high enough up on the food chain so I didn’t have to muck around in Notes very often. An issue serious enough for my attention usually found me as opposed to me finding the issue.
mikerossander;
A good analysis of Gutmann’s original paper about the ability to read overwritten data was published by Daniel Feenberg of the National Bureau of Economic Research at http://www.nber.org/sys-admin/overwritten-data-guttman.html>>>
I read Freenberg’s article and agree 100% with his criticisms of Gutmann’s paper. Thanks for the link! I had a bit of time to dig through my own archives and have to admit that based on what I have (the newest correspondence I have on the matter is almost ten years old) suggests that you are probably correct. The precision and density of current hard drive technology would have made the techniques I recall pretty much obsolete. Also, the correspondence I have turned out to be in relation to formatting hard drives and whether or not the data could be recovered afterward (I’m talking low level format here, not an O/S format). The answer was “sometimes” because while formatting would return any given “bit” to “zero”, it would still be in a range called “zero” by the heads in the drive. By reading the absolute value of the magnetic “charge” of that “bit”, it was possible to guess that a “zero” at the top of the range was originally a “one”, but that a zero at the bottom of the range was originally a zero. After coming up with a “best guess” on a bit by bit basis, the technique was then to use parity bits and other ECC information to try and verify the guesswork.
However, that was recovering data from a format operation, actual over writing of data would be many, many, times as challenging, and, as you pointed out, the precision from ten years ago was completely different from what we have today.
dmh
Dave Springer;
This, where emails you write and receive are stored on your local computer, is considered obsolete. >>>
Dave, Dave, Dave…. Outlook is by far and away the most commonly used email client there is. It can be implemented with different options, the most common of which is “store and forward”. Phil Jones specified his use of Eudora, another thick client that is a “store and forward”. It matters not one wit what is obsolete and what isn’t. What matters is what technology was being used, and how it worked, which is what I’ve focused on. If Phil Jones had made comments about using thin clients, I would have focused more on that approach.
BTW, even if you are using a thin client, the email servers themselves still operate via a store and forward mechanism.
In regard to Lotus Notes, this is far more than an email system. It is also an electronic conferencing tool, and a workflow automation tool. The manner in which you describe interacting with Lotus suggests you were using functionality over and above the email system itself. If you want to trace the heritage of Lotus Notes from that perspective, you’d have to dig into technologies such as Digital Equipment’s VAX Notes which was pretty much abandoned after the VAX Notes development team resigned en masse and went to work for…Lotus.
Thank you, David, for an excellent read! Also, thanks to commenters for your valuable analysis and, as always, Anthony & the Mods for keeping the conversation going!
I’m just a humble biologist, but I’ve known the basics about email for many years (I have friends who helped to develop these systems at the University of Illinois in the 1970’s on our PLATO network system, a fascinating story). It blows my mind that these “brilliant” climate scientists were so ignorant about even the basics of email architecture & function!!
Ha ha ha, Eudora made him do it!! What a pack of fools!! Oops, now that comment will be retained for eternity on a drive somewhere. Oh well.
@davidmhoffer
“Phil Jones specified his use of Eudora, another thick client that is a “store and forward”. It matters not one wit what is obsolete and what isn’t. ”
It also doesn’t matter what kind of client Jones used. I never claimed he was using one or the other. I just said not all email clients keep a local copy which you do not dispute. You wrote:
“Once you write an email it exists on the disk drive of the computer the client software is installed on. Press “send” and it goes….nowhere. It is still there, exactly as it was before you “sent” it.”
This you wrote under your heading EMAIL 101. Now you’re trying to say it wasn’t a general introduction to email (hence 101) but rather on Phil Jones’ Suspected Email Client Configuration. Okay. Sure. That’s it. That’s the ticket.
You REALLY need to learn to stop digging.
.