Climate Audit is down

While we are on the subject of hardware failure (such as has hit the DMSP satellite NSIDC and Cryosphere Today use) Climate Audit is down due to a file system or HD error. It happens. I’m on my way to the Colo (90 miles away) to effect a repair. Comments may be delayed for a few hours if other moderators aren’t online.

UPDATE: 5:30PM

The Climate Audit server is in fact RAIDed, I built it that way for just such an emergency, but some corrupted data was written before the one disk of the array failed. Since I could not stay at the CoLo all day, I’ve brought the CA server to my office for repairs. Hopefully the RAID rebuild goes smoothly (it takes several hours) and I’ll be able to repair the problem areas. Hard drives were both new, RAID quality units, with 3 year warranty. One failed 1.5 years into the warranty – that’s Murphy for ya.

Wish me luck, otherwise I have to rebuild from scratch and restore from backups which is also a chore.

Just for those who like to know about hardware, here is what Climate Audit runs on:

3.4 GHz Intel Pentium D CPU

2 GB ECC DDR2 400 RAM

RAID1 Dual Western Digital 250GB SATAII drives with 16MB cache ram

Running Linux with WordPress in LAMP config

1u Intel Server enclosure like this one:

intel-1u-1325

sr1325tp1

Thanks to those who hit the tip jar.

UPDATE: 8:30PM

One hard drive of the RAID failed. Now before you panic let me say I anticipated this (but like 2 years from now) and this was a RAIDed system with two drives setup to mirror. Normally when one drive fails, I can unplug the other and reboot the system and it will come up and run on the one, then I can install a new second drive and rebuild the RAID, and off we go.

I’ve done that dozens of times in my own systems. It is why I built the CA server the way I did. It is an identical server to 15 others I’m running here.

But for some reason known only to Murphy, this time when the system failed sometime last night, it appears it wrote corrupted data to the “good” drive before the full hardware failure. So at the moment the system is unbootable.

The good news is that most everything should be recoverable, but it takes time. If I can’t repair the boot sector on the good drive, then we have to rebuild two new drives from scratch, mount the one good drive, and pull files over. Though I don’t know just yet how much corruption there is and how much of it can be fixed.

The annoying thing is that these mirrored Western Digital 250GB drives had only 1.5 years on them, and less that 10% full. They were brand new when I purchased and installed them specifically for CA. They have a 3 year warranty. They’ve been in a temperature controlled and dust controlled environment at the CoLo. For one to totally fail now is quite the surprise. I wasn’t all that worried about regular backups due to the RAID mirroring, now the RAID fails with the drive.

I was able to rebuild the RAID, but it appears that the boot sector is corrupted. This will require a mount from a CDROM boot and fix the file system and make copies that way.

Best laid plans….

I anticipate it will be Monday evening before CA is back up and running.

Get notified when a new post is published.
Subscribe today!
0 0 votes
Article Rating
150 Comments
Inline Feedbacks
View all comments
tallbloke
February 22, 2009 3:07 am

Anthony, if the level 1 RAID system is one of those dedicated IDE cards which pin on to the motherboard’s IDE headers and does the mirroring – beware. I had the same disc corruption problems running SuSE on an X86 publicly accessible webserver 8 years ago. Surely there is a skeptical ISP CEO out there who would like to donate some extra rack space for a dedicated external RAID box mirroring WUWT and CA?
Keep hitting the tip jar people, we need Anthony and Steve to keep these services running for us.

pkatt
February 22, 2009 3:28 am

Oh for shame! YOu didnt back up. One of the most common raid deaths is catostrophic hard drive failure. It happens a lot more than you would think and usually its the stinkin drive 0. It is so not difficult to set up a daily back up, to a different machine, and have it save days seperately, that do not overwrite for a week. That way if you data is pooched on the latest back up, you can step back, up to 7 days to get ahead of the corruption and at least rescue the majority of your data. But also check with WD, they may have a trick or two on their site.

February 22, 2009 3:34 am

Beware hijacking of this thread by the Evolution / Creation shindig.
But it would be nice to run something along these lines under its own thread (with extremely strict moderating like Breathe Deeply Ten Times First, Ad Homs will be Shot, Fundies Go Home) because I think there too, open and creative debate, whereby each side can usefully learn from the other side, has been hijacked by fundamentalists, and it would be nice to see both fundamentalisms thoughtfully deconstructed.
Now please, back to thread topic. CA and WUWT are too precious. Can we distribute the load, backup or mirror material? How curious, just as we find NSICS problems, it hits here too. I found this very interesting blog duplicating a couple of QUQT posts a couple of days ago. It set me thinking about possibilities. I try to pull my weight (click on my name, go to our forum) but I’m not a heavyweight.

thefordprefect
February 22, 2009 3:37 am

Jordan (16:46:22) :
Negative feedback only stabilises according to a classification which seems to be unique to Earth Sciences.
According to engineering and mathematics (disciplines that have a record of making things that work in practice), feedback can be stable or unstable, whether positive or negative.

Negative feedback that causes instability has turned into positive feedback by phase change. It is no negative feedback any more in my simple engineer’s brain!
Let’s try not to be too harsh on the journalists for being confused by the terminology.

Look, he’s the one to complain and mock the BBC for their use of negative instead of positive feedback. All he has done is inverted the error and therefore deserves all the same mocking!
Mike

mercurior
February 22, 2009 4:06 am

A true scientist must question. They have too or they wouldnt be a scientist.
Questioning Something, can be a good thing. it forces people to find the facts that prove their hypothesis.
Science questions, once there is an Apparent answer, it stops being science., and becomes dogma. (code of beliefs accepted as authoritative, Which is what they are saying AGW is.) So the people running the fear of agw, are no longer scientists. they Never question, Whether this is due to fear of loss of job, or blindness. These people are no longer scientists in the truest sense of the term.

MS in Illinois
February 22, 2009 4:30 am

Anthony,
We had some drive failures a couple of years ago. At the time, we had a mix of manufacturers, but one particular manufacturers drive were failing much more often then others. Did some research, and found others had similar experiences. I will not mention who it is, but we now avoid this particular manufacturer and use Seagate drives if possible.

EW
Reply to  MS in Illinois
February 23, 2009 5:06 am

Hmmm…after having various HDD’s as well, a Seagate one was the first that crashed after only 2.5 years of standard office use……

MrPete
February 22, 2009 4:53 am

Brendan H, without getting far from climate, which is the purpose of WUWT:
It’s very appropriate to discuss the prevalence of “true believers” in science.
To discuss this in evolutionary terms, we need to be a bit more precise. First, let’s leave speciation (micro-evolution) aside. That’s not a question of origins of life. And that involves simple (?!) observation.
Now, with respect to evolution of life — from non-life to life, the question of origins. Such “general” evolution is an extension of the speciation model well beyond our ability to calibrate. Very similar to what is being done with GCM’s.
Examining the record of scientific publication, you will find in recent years a very similar trend to what we are seeing in climate science. As various theories are raised, and disproven by the facts, the “true believers” have become ever-more strident in asserting that facts don’t matter, and in proposing ever more fanciful theories to support their deeply held convictions about origins.***
That sounds like people acting on a set of beliefs to me. It looks very much the same certain arenas of climate science as well.
This is a common theme in science. The existing “consensus” has a hard time switching from what they have long believed, to a new understanding. I’m old enough to remember the switch to continental drift — particularly because one of my best friends in high school in the 1970’s had an uncle who got a lot of grief as he worked to publish evidence for plate tectonics theory.
It’s interesting how often the crackpots of yesteryear eventually are discovered to have discovered something significant.
=========
*** So far, they’ve found there were no “billlllions of years” (~50m max last I checked), there was no primordial “soup”, there are problems everywhere you look. One of the emerging “mainstream” hypotheses is a punt: life originated so quickly here, it could not be evolutionary… so it must have been brought here by intelligence from outer space. [And how did it get _there_?] If you would like to read up on a surprising, truly scientific set of falsifiable hypotheses regarding naturalistic vs supernaturalistic origins, check out http://www.amazon.com/Origins-Life-Biblical-Evolutionary-Models/dp/1576833445. I was surprised — didn’t think it was possible. Extensive mainstream journal references. These guys are not crackpots by any stretch.

Knut
February 22, 2009 5:35 am

I saw that the reference number of the comments changed and emailed Steve about that suspecting data base corruption. Steve repied that that due to deletions of posts, the comments were renumbered. This is not practical as people ofte refer to each others comments. That this software fuction like that stands out to me as suspect.
If it this way it really works, I believe snipping is a much better way to go.

TCO
February 22, 2009 5:53 am

I previously stated this fault of blogs versus archived literature.

DaveE
February 22, 2009 6:26 am

I’m with E.M.Smith & others.
RAID 5 is the way to go but obviously there may be problems in a 1U box.
DaveE.

harbinger
February 22, 2009 6:31 am

Mark ref Booker: Booker is right that white asbestos as used in asbestos cement products is not the dangerous version, other than the usual necessary precautions if machining it and producing dust, but that applies to dust in general. It is fibrous blue asbestos used in pipe lagging that is the danger but lawyers see dollar signs when they see the word asbestos. I wouldn’t use it instead of talc though!
There is absolutely no proof that BSE in cattle ever produced vCJD in humans and the numbers affected are so small that it could always have been there but is now monitored and recorded.
There is no proof that second hand smoke causes cancer in the general population. Individual smoking of tobacco does not directly cause cancer, it increases the risks. If it were cause and effect, everyone who smoked would get cancer and that is patently not the case.

Peter
February 22, 2009 7:14 am

harbinger:

There is no proof that second hand smoke causes cancer in the general population

In fact, there is ample evidence that second-hand smoke does not cause cancer in the general population.
The type of lung cancer prevalent among smokers (squamous and oat-cell carcinomas) exhibits a very strong dose-response to the amount of tobacco smoked, whereas the type of lung cancer prevalent among non-smokers (adenocarcinomas) exhibits a zero dose-response.
This was first established in the 1950’s by Sir Austin Bradford-Hill, whose research first established the link between smoking and cancer.

Fred Harwood
February 22, 2009 7:36 am

Thanks, Anthony. Glad to hear that it was just hardware.

George Patch
February 22, 2009 7:41 am

This is the one problem with replication… what if you replicate corruption or a mistake?
I think you had the right solution given the situation, just a bit unlucky.

Editor
February 22, 2009 7:57 am

Andrew (19:38:50) :

So much for Linux Reliability. (What criticism of Linux is not allowed here?)

Linux is software. Hardware is hardware. The RAID level for file systems, clustering issues, and other redundancies are an economic decision. Amazon has different requirements and resources than do free blogs.
If ClimateAudit were a megabuck a day business, they might have a VMS cluster and support staff. Given that it’s volunteer time and tip jar supported, an imperfect low cost solution that doesn’t offer five 9s reliability makes sense.
To answer your question, I don’t know of any criticism of Linux that isn’t allowed here, at least as long as it’s on topic. I use Linux at home and work, but there are good reasons to recommend preinstalled systems like Windows and MacOS for non-computer savvy use. I have coworkers with 30 years of computer experience who use both (and multiple Linuxes, multiple Unixes, etc.)
I do like to have hardware close to me – the sound of a failing disk bearing is more important than most preventative maintenance. We often call that provocative maintenance anyway.

husten
February 22, 2009 8:09 am

Raid is o.k. but better to keep a frequent automatic backup. Linux has got a utility called rsync. You can use it for keeping a number of daily (or hourly) backups without using extra diskspace for identical files. (It uses Unix/Linux’s hardlinks).
Suggest you set up a twice-daily rsync to a server in a different physical location (HD does not need to be much larged than your data.) I am using this to backup home folders to a small cheap NAS box.
example, rotation through 9 backups:
cd /mount/nas
rm -rf backup.9
mv backup.8 backup.9
mv backup.7 backup.8
mv backup.6 backup.7
mv backup.5 backup.6
mv backup.4 backup.5
mv backup.3 backup.4
mv backup.2 backup.3
mv backup.1 backup.2
mv backup_home.0 backup.1
rsync -a –delete –progress –exclude-from=/home/xxxxxx/exlude_
list.txt –link-dest=../backup.1 /home/ backup.0/

Mark
February 22, 2009 8:45 am

All,
I appreciate the discussion involving my original post about Booker and his credentials ie Abestos and dust, second hand smoke not causing cancer, etc. This is why I love this site. I learn quite a bit about much that I know nothing about. Thanks!!

Pamela Gray
February 22, 2009 8:46 am

However, Candice Pert later discovered that white blood cells gone crazy (via mutations) are the source for smoker-related lung cancers, even in those exposed to second hand smoke. The body’s natural tendency to send white cells (and produce more of them) to the lungs as they are assaulted with smoke and other foreign matter means that there is a greater chance of these cells mutating and becoming cancerous. That also explains why lung cancer is the number one metatastic cancer. White blood cells, be they cancerous or not, live in the bloodstream. That is their home. And they roam the body through this liquid highway. It is likely that before a lung spot is found, cancerous cells, or cells becoming cancerous, have already spread. Most such cells likely die as single cells. But some stop along the way and grow where they are planted. The other thing about white cells is that they can mutate into more than one type of cancer.

AnonyMoose
February 22, 2009 9:03 am

TCO (05:53:37) :
I previously stated this fault of blogs versus archived literature.

If blogs are remotely archived, such as if they’re not blocking archive.org, then they are archived. Shall we quibble about the meaning of “literature”?

WA
February 22, 2009 10:02 am

Checking at archive.org (wayback machine):
Most recent entry is dated February 8, 2008
“Material typically becomes available here 6 months (FAQ: ‘or more’) after collection”

Not sure
February 22, 2009 10:18 am

I just started using smartmontools on my server on a friend’s recommendation. He says it gave him early warning of impeding doom with some of his hard drives. Interestingly, his drives were still under warranty too. I wonder if they were Western Digitals?

John G. Bell
February 22, 2009 12:22 pm

Don’t you have a file named backup_mbr in your /boot subdirectory?
/boot> ls -la backup_mbr*
-rw——- 1 root root 512 2008-01-15 15:44 backup_mbr
If so the command to restore the mbr as root if your boot device is /dev/sda is:
dd if=backup_mbr bs=448 count=1 of=/dev/sda
I have also seen:
dd if=backup_mbr bs=512 count =1 of=/dev/sda
The latter includes the partition table information and I would try the former first as it is more conservative. I could not find this in any of my books and haven’t fooled with this for years so be careful. Good luck!
REPLY: I’m sure it does…but Linux is my second language, and for sucha delicate repair I’m waiting for my Linux Guru in Chief to come in to the office Monday morning since I don’t want to make any mistakes – Anthony

Scott Finegan
February 22, 2009 1:41 pm

From experience, Raid 5 is not a panacea.
Drives fail.
Controllers fail and take out the drive(s).
Dedicated Raid Controllers become obsolete, if your system lives long enough. Try to locate a replacement on Saturday evening… or at all.
Backups fail because everyone assumes they are good. Test Test Test!
Backups are real good at saving previously corrupted data.
Corrupt data/programs may cause hardware failure. —> Loop

Nigel Sherratt
February 22, 2009 1:59 pm

A small typo explains a great deal about George Monibot.

KlausB
February 22, 2009 2:16 pm

Anthony,
I agree with several posts suggesting RAID 5, at least.
Shure I’ve lived about 15 years with a Tandem Nonstop CLX-R, drives had RAID 1.
Never had problems with that ole gadget.
In the last six years, administering quite a bunch of application servers – basically like the one which spoiled your weekend – I’ve had some hard days/nights with that stuff.
Sometimes they are really freaky. May be it’s related to some models / some type of drives, i.e. 36.4 GB Drives in some models did drive me mad, 18.2 GB drive were never a big issue when it came to problems. Have an old Compaq Prolinat server as home server in the cellar. Is running 8 years now. Six 18.2 GB drives.
Never had problems with that, no exchange of drives.
The 36.4 and 78.x GB Drives is another story. At my work, on 2 SAN’s equipped
with them and a total capacity of 1.6 TB, about 2 times per month, we have a drive
which is bad or is starting to become bad.
RAID 1 is only single fault tolerant. RAID 5 is a small step better.
For applications with high reliability, I suggest two fully mirrored servers.
I feel sorry for you, and too, for SaintMc.
Hope the re-build goes smoothly.
Klaus