Is there a Linux specialist in the house?

In the attached picture you’ll be able to see what’s happening to one of my most important servers containing some irreplaceable climate data. I’m at a loss and understand what’s going on because Linux is not my specialty and my Linux expert has since moved on to other ventures. It’s very important I get this server working again, so I’m asking the WUWT community for help.

The server had been offline for a few weeks, and was properly shut down. Upon powering it back up, I got a “no bootable disk found” message. I determined the RAID (Hardware RAID1 – 2 mirrored drives) had been degraded, and it seemed one disk had failed. So I purchased two new identical HD’s cloned the good one, and rebuilt the RAID1. The RAID is administered by an on-board Adaptec RAID controller, and it reports the RAID as healthy.

What happens now is that it attempts to boot, but gets stuck in a loop on the last messages “Init ID c1, c2…etc” and repeats those error messages. I get the same partial boot and error sequence if I take out the RAID in BIOS, and try booting a single drive in straight SATA mode.

This machine was built circa 2007, and has Slackware Linux of that era installed, I don’t see a version number coming up on boot, so can’t provide it.

Any and all help appreciated. – Anthony

 

The climate data they don't want you to find — free, to your inbox.
Join readers who get 5–8 new articles daily — no algorithms, no shadow bans.
0 0 votes
Article Rating
223 Comments
Inline Feedbacks
View all comments
brians356
March 22, 2017 10:09 am

This incident points up the importance of keeping an independent offline or nearline backup of “irreplaceable” data. LTO tape is one popular and inexpensive approach.

Don K
Reply to  brians356
March 22, 2017 10:25 am

It’s been a decade, maybe two since I had to worry about it, and the world was simpler back then. But if the issue is just inittab (why would inittab and only inittab be bad?) isn’t there such a thing as single user mode? I vaguely think it might be usable to edit inittab.

I’d still make copies of the disk(s) before I did anything else.

Reply to  Anthony Watts
March 22, 2017 6:05 pm

Your server shut down normally, was offline for several weeks, and crapped on reboot.
My first thought is that your onboard cmos battery died and the bios lost the boot settings.
If that is the case, your drives and data are fine. Replace the battery on the motherboard, reboot and press “delete” key repeatedly until the bios menu comes up. Depending on motherboard, might be F8 or another key to enter bios setup mode.

Read the Adaptec pages here to find out the bios settings

http://www.supermicro.com/manuals/other/RAID_SATA_Adaptec_ESB_ICH7R_ICH9R.pdf

Yes, it could be a drive failure. But the circumstances you describe point more towards a bios setting issue.

When time permits, please consider a ZFS file system using a FreeBSD Linux base system on a supermicro motherboard w/ECC memory. much more fault tolerant and more reliable. Future stuff, for sure.
See: https://www.howtogeek.com/175159/an-introduction-to-the-z-file-system-zfs-for-linux/

CEH
March 22, 2017 10:16 am

Anthony, check this link out, it seems to address the problem that is shown on line four from the bottom, “/dev/console……..”

http://www.linuxquestions.org/questions/general-10/open-dev-console-input-output-error-383629/

gregfreemyer
March 22, 2017 10:16 am

I’m a data recovery expert. (See me on Linked-In). Tons of UNIX/Linux usage over the last 30+ years.

Step one, take a few deep breaths.

The odds are very high that 100% of your data is recoverable. That is until you destroy it in the process of trying to recover it.

Just go slow and work with copies of the drive(s), not the originals.

If you want more advice, connect up to me on LinkedIn and send me a message.

Reply to  gregfreemyer
March 22, 2017 11:07 am

+1

I’m not anywhere near as technical, but I’ve been selling backup and disaster recovery systems to corporations for many years. Greg’s advice (and that of others) is spot on. Go after the data first. You don’t really care if you EVER fix the boot issue if you have the data and can load it on a new O/S image. Go after the data from the cloned drive because destroying data by trying to save it happens way more often than one would think (see Greg’s remark above and read it over three or four times while breathing deeply). This leaves the original intact if you accidentally blow it way on the clone.

Now for Step 2 which Greg didn’t address. The moment you have the data recovered, do not, I repeat, DO NOT go merrily on your way getting the old server running or even a new one. Repeat, DO NOT. Instead, make a G** D*** copy of the F***ing data on something that can go off site. Cloud, disk drive, tape cartridge, pick one. Then do the same for ALL YOUR OTHER COMPUTERS. Take the copy home in the trunk of your car and store in the closet if need be, if your business location has a fire or a flood or theft, no amount of technical expertise will help you. Copy the data, get it off site. Backup, backup often, off site. First rule of preparing for disaster recovery.

Reply to  Anthony Watts
March 23, 2017 1:53 am

Well that’s a relief 🙂

whiten
March 22, 2017 10:26 am

For what it could be worth.

I do not know much about servers, but I have my own computers…..and I had to give up on an old one, few weeks ago, and go for a new one due to a lot of virus attacks.

I could very easy have recovered my old hardware, but still thought at that stage it was not worth it, and as things stands I do not regret it at this point….

My old hardware, regardless of my effort, in the end suffered due to a simple but very effective virus, that always manged to cripple the functioning of the CD-drive and the USB drives and the external control from mouse and keyboard…
Regardless of the antivirus software installed, even the specific one against such a virus, I still had to do a lot to keep the hardware respond to the mouse and the keyboard and be under “exceptional control” …. and at the same time be happy with a partial usage of the CD-drive and the USB’s…..
The virus infecting that old hard-drive came from an infected USB-stick….In a moment that I got lazy and careless.

Strangely enough, a lot of virus attacks happening with my new hardware, including the one that crippled my old comp.
While I am not in the same pattern of the usage with my new hard-ware and not yet have inserted on it a USB-stick…

My new comp is warning me to install and use the same antivirus software that I had previously in my old Hard-ware…..strange indeed….but the point I am trying a make is, that some times viruses or malware can be very effective in the crippling of the basic functions of the access to the computer or the server without damaging actually the data and the hardware itself in it……
While in the same time a rush to fix it may resolve in a greater damage to the hard-ware and the data in it.

Not rushing and taking the necessary time needed is essential in these cases I think…

cheers

Jim Hodgen
March 22, 2017 10:33 am

To stay on topic – I hope – for a longer term fix… yes Centos is a good production system, but what you might want to consider is getting a hardware controller appliance. It will cost between $350 to $700 (there are more expensive but those should work, maybe better prices if you can get to Fry’s or go online) and you can have the appliance do a lot of the error checking and preventive management.

This allows other system to access the drives through a network, some allow hot swaps for the individual drives, you get error reporting and an operating system that is entirely separate from the drives themselves. This is a fixed cost model that has higher upfront but lower ongoing costs than a cloud storage option that will keep sending a bill for as long as you want to keep the data.

Editor
Reply to  Ralph Dave Westfall
March 22, 2017 11:02 am

I never buy those – I’m not a dummy.

Reply to  Ric Werme
March 22, 2017 11:09 am

Since you are not a dummy, you would not run into the problem faced by our host.

Editor
Reply to  Ric Werme
March 22, 2017 11:27 am

No at all, not at all. I’ve been making money off computers since 1969. They haven’t exhausted all the ways they can bamboozle me yet.

Reply to  Ric Werme
March 23, 2017 8:31 am

I’m willing to bet that you have several O’Reilly books though 🙂

March 22, 2017 10:49 am

The easiest way to clone the drive with dd is download and burn a copy cf clonezilla.
Work from a copy of your good disk, never from the original.

Ian Macdonald
March 22, 2017 11:12 am

I would suggest downloading a copy of Knoppix on another computer, burning a CD or DVD and booting the affected computer from that. There is a good chance you might be able to read and recover the data that way. Knoppix mounts partitions readonly by default, which is the best approach when dealing which what might be a corrupted and fragile partition.

If it proves necessary to run data recovery tools then always do that on a copy of the original.That way if it makes matters, worse you still have the original.

http://knoppix.net

Incidentally at the boot prompt I usually type ‘knoppix no3d’ to stop the compiz effects from loading. Might be a matter of opinion but I just find them a nuisance.

Charles Curley
March 22, 2017 11:12 am

There are several pieces of good advice in earlier comments.

* I’d be talking with gregfreemyer right about now. His advice is good, and if he will work with you, great.

* Michael 2’s advice to copy off disk images using dd is excellent. Do that first. Make damned sure you get the if= and of= parameters correct. 🙂

A few other thoughts…

Take your time.

You may be able to boot to a modern data recovery CD. I use finnix (https://www.finnix.org/), but gregfreemyer may have a better recommendation.

It looks like your system uses lvm version 1.0.8 (2003). I have 2.02. After 14 years, it is possible that modern lvm no longer supports your disks. Building a boot CD from your slackware distribution might be the way to go.

You might also look at rdd to see if you can recover the failed drive. That will depend on what exactly failed in the drive.

There are user communities for all the software you need; feel free to ask for help.

Once the crisis is over….

A more recent distribution of Linux may be in order, as smalliot suggested. Centos is good. I use debian stable. Update things a bit more often than every 14 years.

Start doing backups. I use amanda, others can recommend other products. amanda is an industrial grade backup system which can be a bear to configure. Once you get it set up it runs forever and degrades gracefully. I back up to virtual tapes (files on disk) and copy to external USB drives for off-site backup.

Good luck!

Ian Macdonald
March 22, 2017 11:18 am

BTW, be very careful with the dd command, if you get the parameters wrong you can nuke the data.

Editor
March 22, 2017 11:25 am

I haven’t had to delve into a problem like this on Linux, but my take on things is that the RAID array is fine, but the kernel is confused about where it can have you login.

Do you normally login from a “login: ” on a mostly bare screen like the boot display or does the system start a graphics/window server?

I think the best comments are from Bear at https://wattsupwiththat.com/2017/03/22/is-there-a-linux-specialist-in-the-house/comment-page-1/#comment-2457854 and CEH at https://wattsupwiththat.com/2017/03/22/is-there-a-linux-specialist-in-the-house/comment-page-1/#comment-2457919

Beyond that, I’m not sure where I’d start. Is there a chance that you really do normally login on to the beast from some terminal line (e.g. for remote access) and that the cable is loose or fell out while you were working with the replacement disk drives?

Editor
Reply to  Anthony Watts
March 22, 2017 12:21 pm

Sigh. Then the console stuff could be a red herring, and you may have been getting those messages ever since you first set things up.

I haven’t maintained RAID anything, but I don’t see anything alarming about the boot time messages about it.

So it could be from some boot time program that hasn’t printed anything yet. I might have used Slackware, briefly, 20 years ago, so I won’t hazard another guess now.

Windows and Linux are both a pain in the butt to administer. It just that they are very different pains….

The simplest thing at this point, I guess, is the rescue CD approach. I used Knoppix a decade or so ago and thought highly of it then, especially for looking at virus infected windows systems. I’d expect new distribution (umm, DVD) installation disks would be able to make sense of the file system. If you can boot from a USB stick, that’s probably the way to go, though I haven’t done that myself.

Not Sure
Reply to  Anthony Watts
March 22, 2017 3:55 pm

The agetty entries in question are not for physical terminals connected with physical serial cables. They’re for virtual terminals that you can access with alt-f1, alt-f2, etc. If your agetty binary is missing or damaged you won’t get a login: prompt at the console because that’s just virtual terminal number 1.

https://luv.asn.au/overheads/virtualconsoles.html

R Fujii
March 22, 2017 11:27 am

A quick google says that those messages are coming from the alternate consoles, so the good news is that it doesn’t look like there is anything wrong with the disks. Are you sure everything is /exactly/ the same as when it was powered off (minus the drive)? Since this is a ~2007 machine, is a PS/2 keyboard attached? I see some talk that this might happen if your network cable isn’t attached (just the messenger here). As been suggested, if you just want to recover the data, it’s probably easiest to boot from a live image DVD and copy the data to another HD. From the looks of things, you’ll probably have to do this anyway and edit /etc/inittab to get it to stop spawning those processes so it will get further in the boot sequence. Be happy to help if you don’t mind remote help…

March 22, 2017 11:31 am

I can write in more detail later. Interesting you say the mirroring is done on HW raid because the boot messages clearly indicate software raid with the “md” (metadevice) subsystem. The boot messages identify two SCSI disks, which in my experience it doesn’t do if hardware RAID were happening at the adapter level.

You will need to boot a recovery system from CD/DVD/USB depending on your hardware (which looks kind of old, so I suspect we’re talking CD here). Try here: http://www.system-rescue-cd.org/

The good news is the recovery system does not need to be slackware, as long as it supports the software raid and LVM. This will get you a working system and hopefully able to access your disk and its filesystems. You then at least have access to the data and can copy it off somewhere safe while you work on fixing the problem.

I can write more later.

Reply to  Alan Watt, Climate Denialist Level 7
March 22, 2017 11:42 am

The boot messages identify two SCSI disks, which in my experience it doesn’t do if hardware RAID were happening at the adapter level.

I in general agree. But I have run into (twice in my life) systems where h/w raid was enabled, the volume was logically partitioned, and then s/w raid run across the partitions. Long time ago, don’t remember all the details other than tech guys running around with hair on fire screaming “WTF why would anyone do that?” when the system went south.

Reply to  davidmhoffer
March 22, 2017 8:21 pm

I’ve administered linux since the 1.2.3 kernel. Started an ISP in 1996 on slackware then to redhat. This fake raid card is probably the problem. The hardware, works,,,, but,,,
To get anywhere boot in single user mode to do your checks,
cat /proc/mdstat will tell you what is happening with the array.
Fake raid is the worst of confounding situations, the worst.
3ware cards are full hardware raid and this issue would never be noticed or happened to the linux system.
Software raid repairs well. Fake raid will take a while, connect the drive to an onboard sata port to see if it will boot from there. If it does, all the better, fake raid was not the problem. If it does not boot from a bog standard port then it is an order of magnitude of a bigger problem.
Your problem is happening when the system is trying to go to runlevel 3, multiuser without the display manager. this is so the init occurs with multiple terminals and uses multiple threads efficiently. This is probably just a red herring as to the real problem.
keep to runlevel 1, read up on raid recovery, make a new system, use only software raid, transfer the data and software.
You can repair this os, Linux is 100% repairable, if you have buckets of time. $600 gets you a very good basic server. How much is your time worth?

Stephen Richards
March 22, 2017 11:33 am

I think you banned the one guy who could really help you. Could be wrong by Tony Heller never comments here ??

Editor
Reply to  Stephen Richards
March 22, 2017 12:27 pm

I believe Tony was/is primarily a microprocessor designer. That doesn’t necessarily mean he’d be good at fighting with recalcitrant Linux systems. Heck, my file system and other experience isn’t being much help….

Nor handle a CO2 frost brouhaha well. I still have scars from it.

March 22, 2017 11:40 am

This is a virtual console issue that happens when it can not find font files or any number of other issues. I would be interested to know what the /etc/inittab file looks like. Way to find that would be to boot from a bootable CD or flash drive then mount the system hard drive on /mnt and then have a look at what /mnt/etc/inittab looks like. In particular, I would be interested in entries that look something like:

# TERMINALS
c0:12345:respawn:/sbin/agetty 38400 vc/0 linux
c1:12345:respawn:/sbin/agetty 38400 vc/1 linux
c2:12345:respawn:/sbin/agetty 38400 vc/2 linux
c3:12345:respawn:/sbin/agetty 38400 vc/3 linux
c4:12345:respawn:/sbin/agetty 38400 vc/4 linux
c5:12345:respawn:/sbin/agetty 38400 vc/5 linux
c6:12345:respawn:/sbin/agetty 38400 vc/7 linux

I would see if I can edit that file and comment out the entries for c1 through c6 by inserting a “#” char at the start of the line, save the file, remove your boot media and attempt to reboot from the drive.

It does look like it loads the c0 device, though which might be tty0 rather than vc/0

What is happening is that it is attempting to initialize the console devices c1 through c6 and that is failing. c0 does appear to be initializing, though, which might be a serial console if c0 says its device is a tty device on your initab. So connecting the serial port (if it has one) to another machine’s serial port with a crossover cable might allow you to terminal in (if you know the baud rate, etc) and get a login prompt.

Your /etc/inittab might also have something like this:

# SERIAL CONSOLES
#s0:12345:respawn:/sbin/agetty 9600 ttyS0 vt100
#s1:12345:respawn:/sbin/agetty 9600 ttyS1 vt100

if they are not commented out (as this example IS commented out) you might have a 9600 baud login on a serial port available to you.

curly
Reply to  crosspatch
March 22, 2017 12:14 pm

That was my first thought after looking at the logs.
It looks like it’s happening during the start of going to multi-user mode.

Since you have console access, can you boot single-user?
And carefully look around and see what’s mounted, how it’s mounted?

The other thing that stuck out is that it said it’s mounted your root fs read-only,
and it’s an ext2 filesystem. Hoping that your root filesystem really is not ext2,
and really is ext4 (or at least ext3), and it maybe mounted it as ext2 because
there’s a problem with the fs journal.

But that’s just for diagnosing.

As others have written, protect that original good drive and carefully clone it.

Editor
Reply to  curly
March 22, 2017 12:35 pm

Anthony says: “This machine was built circa 2007,”

https://en.wikipedia.org/wiki/Ext4 says: “Kernel 2.6.28, containing the ext4 filesystem, was finally released on 25 December 2008.”

I’d be very surprised if it uses ext4. Even if the OS was upgraded a few times.

curly
Reply to  curly
March 22, 2017 1:30 pm

sounds like it is ext2 then. eek.
was hoping for at least ext3.
definitely past time for an upgrade.
not even going to ask the patch level for 2.6.28.

Dena
March 22, 2017 11:57 am

I have never used Linux but I have had many years in computers. The question is always what changed and that would be the new disk drives. What I suspect is something about the new hardware is causing issues in the 2007 software. A possible solution would be to update Linux on the cloned drives to get the latest I/O drivers that may recognize the newer drives.

A secondary possibility is both original drives were corrupted as the result of the drive failure. If so, you have a massive rebuild a head of you unless you have backups you can recover from.

Dena
Reply to  Anthony Watts
March 22, 2017 4:04 pm

Unfortunately they may not be identical. Much of the modern hardware uses firmware to control the function of the device. The manufacture could have had an issue with the product and corrected/changed it in the firmware resulting in the device to function differently than the older product. Unless you purchased the drives with the same lot number, it’s a possibility.

I have spent years programming software at the hardware level and every so often manufacturing would start having problems with something that was working fine. A little digging around would turn up a part with different functionality than the original part. Fortunately we produced our own drivers and controlled the hardware design so once we figured out what was going on, we could come up with a fix. It’s a good deal more complicated when you don’t know whats going on in the hardware or software.

Greg
Reply to  Anthony Watts
March 23, 2017 2:19 am

The RAID is administered by an on-board Adaptec RAID controller, and it reports the RAID as healthy.

Linux ( the OS ) will not need to know about LBA and sector details. If anything is sensitive to that is will be RAID firmware which is below the level of the OS. The drivers in the Linux kernel will have to interact with Adaptec controller.

It is a wise choice to buy identical drives if replacing half a RAID mirror, though not essential within certain limitations.

If you make hardware changes which lead to bios reassigning disk order the OS will need to know about those changes. That is not “unrealistically sensitive”.

It is quite feasible to clone one entire partitions scheme from one HD over to another larger one to plug into the same hole ( eg SCSI connector ) and reboot without the OS having to know anything about it. The new disk can be different make , capacity, LBS, SCSI version.

If that Slackware is really 2007 vintage and not maintained, moving to a newer Linux of any flavour will change any references to IDE drives from /dev/hda to /dev/sda. If there is only the SCSI RAID devices, as it appears, this will not be an issue.

Chimp
Reply to  Dena
March 22, 2017 12:50 pm

What happened to the post on offshore windmills?

Is it a victim of the Linux malfunction?

Chimp
Reply to  Chimp
March 22, 2017 1:11 pm

Thanks.

I noticed it when looking for my reply to Griff on birds and offshore wind farms. Off topic here, but this is another link to the same Scottish court decision:

http://www.telegraph.co.uk/news/2016/05/12/birds-scupper-2bn-offshore-wind-farm/

Diving gannets are the coolest:
comment image

But their swarming behavior around schools of fish give the lie to Griff’s lack of worry. Bird migration routes have nothing to do with it. Gannets go where the fish are.
comment image

KevOB
March 22, 2017 12:04 pm

I’m not going to offer a detailed solution but my suggestion from over 45 years working with,designing and building computer and as a former computer shop owner with a service department. Most of my family also use Linux.

First rule: “Do no harm”. Turn it off and keep it turned off until you find the right technician you can get it too.. They are scarce. Your problem may not even be software. With older machines components failure occurs in so many ways and IC’s not infrequently slowly die in stages. Not knowing yet exactly what the fault or possible combination of faults allowing the machine to be powered on can cause an even worse situation. Expertise in software is not necessarily sufficient when age and particular hardware is considered. A machine with a component in death throes can go from able to recover to impossible over a short time.
It may be software but you need someone skilled overall with this. One blessing Linux is robust and provided the read/write subsystem of the hard drives has not physically screwed up critical data your basic data is most likely to be intact. If so putting the disk in another machine for cloning and then attempting to copy the appropriate partitions is likely to be successful.

Tom
March 22, 2017 12:05 pm

Not sure which Kernel version you are running, but there have been bugs introduced before that caused this in the older 2.6.31 kernel. When you boot up, if you have a grub menu with a previous kernel version, try booting from that instead and see if it works.

BallBounces
March 22, 2017 12:09 pm

Anthony — please provide us with an update when this is resolved.

Nozza Wales
Reply to  Anthony Watts
March 22, 2017 1:29 pm

Time. Simply time Anthony. Sleep on it. Listen to the advice here and sleep on it. It’s the data that’s important nothing else.I’ve spent ages trying to recover data, and went back a couple of weeks later and couldn’t understand what the problem was – and the data flowed out. There’s no rush. Working fast isn’t the answer. Time is the answer. Keep the original drives. Recopy them with dd in a few days with a new RAID 1 setup. oh. Good luck. It’s all down to luck 😉

Editor
Reply to  Anthony Watts
March 22, 2017 2:56 pm

Whenever I’m facing some new disaster, I’ve found one of the best first steps is to set down, think black thoughts about computers in general, turn my back on the problem, and go for a walk, maybe to a local pizza place and think more black thoughts there. Then figure out what to do.

Brahms’ German Requiem is good music to debug an OS crash to. I’m not sure what’s good for broken hardware. Mahler’s Resurrection Symphony might be good, but you need to sync the fix to the problem with the climax of the last movement. 🙂

Reply to  Anthony Watts
March 22, 2017 3:06 pm

” Mahler’s Resurrection Symphony ???” no way, try Pink Floyd’s “Welcome to the Machine”

E.M.Smith
Editor
Reply to  Anthony Watts
March 27, 2017 12:23 pm

Daft Punk – Technologic…

please 😉

Non Nomen
March 22, 2017 12:41 pm

I’ll keep my fingers crossed that this mess ends well. I lost data on a HDD a while ago and it was a bloody pain in the a*ss to get it recovered.

PiperPaul
March 22, 2017 12:43 pm

Copy the drive contents and then take the old one and throw it into a dumpster. Before you throw it in, though, write “Tr*mp Ru*ssian Sekrets” (use the letter ‘k’ so it looks authentic, maybe write it backwards, too) on the external case of the drive and then call the New York Times with an anonymous tip. Bingo! The drive will be recovered and copied in no time all over the internet for easy access by you later. But I can’t guarantee that the content won’t be ummm, adjusted if some climate “scientists” are involved somewhere along the way.

KevOB
March 22, 2017 12:58 pm

I have worked with, built, sold,serviced, designed etc computers for 45 years. Also I am a intermediate Linux user.
Much good advice here but much ignores first rule of service: “Do no harm”. It is not what you think might work is important but the certainty that the data will not be further compromised. So turn crook machine off and do not attempt any recovery for something so important meanwhile.

I am also concerned with age of software and possibly hardware. Computers at all levels are never static as subtle upgrades are constant and so are internal conflicts arising, timing states etc.. These conflicts are not always significant or evident until catastrophic failure. A generation of hardware for a system builder is only 3 months–It’s been like that for 30 years (Look at the version numbers on parts..)

In my opinion you need an expert tech in older machines as well as a Linux one. A rebuild is probably called for.

Good news: unless the file system has been damaged at the physical read/write level your data is probably intact. Keeping the machine unpowered is your best protection at this time while you find the right tech.

Another hazard is a failing component. They can do the oddest things while dying and recovery may be possible at an early stage and impossible even only minutes later. The more your restart attempts the greater the risk.

Best regards for your fine work.

Man Bearpig
March 22, 2017 1:10 pm

Do you still have the original drives. It may be that one went out of sync before the other.

TomB
March 22, 2017 1:11 pm

The second post in the thread was correct. Go to linuxquestions.org. You WILL get expert advice there.

Don K
March 22, 2017 1:15 pm

Anthony, It looks like you have lots of help. I’ll just bow out of this discussion, but let me caution you against a few common ways to screw up the recovery.

1. If at all possible, work on a copy of the disk, not the original

2. If you copy with dd (bit for bit copy) check the “of=” parameter several times. Take a short break then check it again. It **MUST NOT** be the source device. You can probably survive botching the “if” parameter, but not “of=/dev/whatever”

3. If someone tells you that you need to run cfdisk, sfdisk, fdisk, gparted, or parted, think about it long and hard. If you screw up your partitioning, your data is gone.

4. If you copy using file system tools cp, cpio, tar, etc (pretty much anything but dd) you will need to mount your source and destination devices. remember to unmount them when you are finished. If you don’t, buffers may not be flushed and you may lose data.

… And the unmount command is umount, not unmount

… And if you can’t umount, it is probably because you have moved yourself into the mounted file system with cd for some reason. It’s OK to do that, but you can’t umount until you move out of it.

5. Oh yes, if the amount of data you need to save isn’t enormous, you can back up occasionally to a usb flash memory stick. Devices up to 64gb are pretty cheap. It’s slow, but you can wander off and do goodly works elsewhere. Use a stick with an LED indicator and don’t pull it out ’til it quits flashing and the umount command exits. BTW, once you have written a backup, you can write subsequent backups of the same filesystem to the same device much more quickly with rsync. In your case, I’d do two or three backup sticks per server, rotate them, and backup every few days or weeks depending on how many days of data you’re willing to risk losing.