While we are on the subject of hardware failure (such as has hit the DMSP satellite NSIDC and Cryosphere Today use) Climate Audit is down due to a file system or HD error. It happens. I’m on my way to the Colo (90 miles away) to effect a repair. Comments may be delayed for a few hours if other moderators aren’t online.
The Climate Audit server is in fact RAIDed, I built it that way for just such an emergency, but some corrupted data was written before the one disk of the array failed. Since I could not stay at the CoLo all day, I’ve brought the CA server to my office for repairs. Hopefully the RAID rebuild goes smoothly (it takes several hours) and I’ll be able to repair the problem areas. Hard drives were both new, RAID quality units, with 3 year warranty. One failed 1.5 years into the warranty – that’s Murphy for ya.
Wish me luck, otherwise I have to rebuild from scratch and restore from backups which is also a chore.
Just for those who like to know about hardware, here is what Climate Audit runs on:
3.4 GHz Intel Pentium D CPU
2 GB ECC DDR2 400 RAM
RAID1 Dual Western Digital 250GB SATAII drives with 16MB cache ram
Running Linux with WordPress in LAMP config
1u Intel Server enclosure like this one:
Thanks to those who hit the tip jar.
One hard drive of the RAID failed. Now before you panic let me say I anticipated this (but like 2 years from now) and this was a RAIDed system with two drives setup to mirror. Normally when one drive fails, I can unplug the other and reboot the system and it will come up and run on the one, then I can install a new second drive and rebuild the RAID, and off we go.
I’ve done that dozens of times in my own systems. It is why I built the CA server the way I did. It is an identical server to 15 others I’m running here.
But for some reason known only to Murphy, this time when the system failed sometime last night, it appears it wrote corrupted data to the “good” drive before the full hardware failure. So at the moment the system is unbootable.
The good news is that most everything should be recoverable, but it takes time. If I can’t repair the boot sector on the good drive, then we have to rebuild two new drives from scratch, mount the one good drive, and pull files over. Though I don’t know just yet how much corruption there is and how much of it can be fixed.
The annoying thing is that these mirrored Western Digital 250GB drives had only 1.5 years on them, and less that 10% full. They were brand new when I purchased and installed them specifically for CA. They have a 3 year warranty. They’ve been in a temperature controlled and dust controlled environment at the CoLo. For one to totally fail now is quite the surprise. I wasn’t all that worried about regular backups due to the RAID mirroring, now the RAID fails with the drive.
I was able to rebuild the RAID, but it appears that the boot sector is corrupted. This will require a mount from a CDROM boot and fix the file system and make copies that way.
Best laid plans….
I anticipate it will be Monday evening before CA is back up and running.