Is there a Linux specialist in the house?

In the attached picture you’ll be able to see what’s happening to one of my most important servers containing some irreplaceable climate data. I’m at a loss and understand what’s going on because Linux is not my specialty and my Linux expert has since moved on to other ventures. It’s very important I get this server working again, so I’m asking the WUWT community for help.

The server had been offline for a few weeks, and was properly shut down. Upon powering it back up, I got a “no bootable disk found” message. I determined the RAID (Hardware RAID1 – 2 mirrored drives) had been degraded, and it seemed one disk had failed. So I purchased two new identical HD’s cloned the good one, and rebuilt the RAID1. The RAID is administered by an on-board Adaptec RAID controller, and it reports the RAID as healthy.

What happens now is that it attempts to boot, but gets stuck in a loop on the last messages “Init ID c1, c2…etc” and repeats those error messages. I get the same partial boot and error sequence if I take out the RAID in BIOS, and try booting a single drive in straight SATA mode.

This machine was built circa 2007, and has Slackware Linux of that era installed, I don’t see a version number coming up on boot, so can’t provide it.

Any and all help appreciated. – Anthony

 

Advertisements

  Subscribe  
newest oldest most voted
Notify of
crosspatch

Virtual console issue. I don’t have time to look at it right this second (dentist appt) but there’s a bunch of stuff on google about this error message. Might be having trouble finding the virtual console device or the console fonts. Might have lost one or more files somewhere. Might have to boot off a live CD, mount the raid device and go hunting.

Leo Smith

yes. I agree. this should not stop you booting tho.

Jim Hodgen

Is the highest priority to recover the data or to get the server running again? If data recovery is most important would suggest offline recovery activities.

Neil

Is the highest priority to recover the data or to get the server running again? If data recovery is most important would suggest offline recovery activities.

+1

First rule of disaster recovery: sit on your hands and not make a bad situation worse.

Before you do anything else, image those drives somewhere safe.

After that, I would personally start a disaster recovery using copies of those images and a known good OS in VMWare, but that’s me.

rbabcock

And remember it’s the restore that counts.. not the backup. Make sure you don’t restore from a bad backup.

My reading of the story is that a new RAID was created using the one good original onto two new replacements. If so, the original is now the Golden Master backup of the data.

IMHO, best path to recovery is to build a new Linux on a dedicated single drive, then mount the RAID to it and commence to validate the data. Slackware is a decent build, so I see no reason to change (though generally I lean to Debian, or since the systemd infestation, Devuan.

I’m available to do it if needed. 3 hours drive away.

Drive 3 hours to help? That’s awesome. Voluntary communities are the best.

JP

I think what you have is a RAID puncture. It usually begins with a faulty disk. You replace the disk, the RAID is re-stripped, but the array still refuses to boot. I went through this a few years ago. A punctured RAID array is pretty much lights out. I don’t know if you use a PERC card or not. This seemed to plague those cards about a decade or so ago. I don’t work in the Linux world that much, so I don’t have any Linux tools that come to mind. Here’s a link that may be of help. When I went through this very same problem, I did an emergency P2V (physical to Virtual machine) conversion. I was lucky.

http://www.theprojectbot.com/what-is-a-punctured-raid-array/

D. J. Hawkins

JP, if you look at the original post, it’s a RAID 1, not a RAID 5. The failure mode you described doesn’t seem possible for RAID 1.

JP

You can have a punctured RAID 1 array. Especially if using Perc cards

https://community.spiceworks.com/topic/279704-punctured-raid-1

TRM

“raid 5 measuring checksum speed” ?? If this is supposed to be a RAID1 is the bios on the controller / motherboard somehow set to RAID5?

Like others suggested get a backup of the good disk. Plug it into another computer and manually mount it, back it up. Remember Captain AHAB (Always Have A Backup) so Moby Disk doesn’t eat your data.

Michael 2

RAID 1, mirrors, can be re-cloned from the last good disk. Anthony reports buying TWO new disks and cloning the old disk. With luck the old disk still exists. Using a live bootable image, you can clone the disk, partition table and everything with “dd” (disk to disk) but this is a thing terribly easy to get wrong and erase the source disk. But it is easy, reliable and every Linux has it built-in.

I’d install a new linux and then copy the valued data from the old disk.

JP

If the RAID is in fact healthy, another thing to do is to create a bootable Linux flash drive or CD, boot from that drive, see if you can mount the volume in question. If you can mount it, copy the data to another drive. I don’t know how much you can afford, but you may want to begin thinking about taking your site to the cloud. Let the Cloud providers worry about the hardware.

Yes. this would be the best course. Boot from a CD or other media and fix the configuration. You can also try bringing up the system in single user mode. The messages indicate that processes forked by init are crashing, likely related to tty’s being misconfigured. Why this would occur after recovering a disk is unknown, but could be because the hard drive failed during an update which left the configuration in a bad state.

Not Sure

Totally agree. I recommend System Rescue CD (also works on a USB stick, despite its name.)

Best guess: your /sbin/agetty binary is corrupt or missing.

Greg

.

…begin thinking about taking your site to the cloud.Let the Cloud providers worry about the hardware.

Yeah, right. They worry about the hardware then all you have to worry about is who, what and where, the “cloud” is and whether that may evaportate at some unknown point in the further. You do realise clouds can evaporate, right?

Firstly this is not about the WUWT site it was some climate data that he had/has backed up off line. I would imagine that the point of that is be in possession of a physical copy in case the orginal provider decides to remove it from public view. If this is a “valuable” backup then that has probably already happened.

While making more copies “on the cloud” would not do any harm it does replace being physical possession of your own data.

Mark Gibson

Whenever someone writes “in the cloud” you should read “somebody else’s computer.” And there are cloud failures all the time. Bottom line: as long as your bill is paid, Amazon/Azure doesn’t give a damn if you lose your data. Read any cloud TOS & ask yourself if they have any liability whatsoever. (Hint: they don’t.)

Anthony…the advice to get a backup right now is sound. P2v if you can. You should focus on saving your data at this point; the hardware isn’t going anywhere.

In future…if a RAID1 goes down, just break the mirror, mount the good disk, then boot to it. Backup yr data & THEN rebuild the mirror. Trying to mirror a faulty drive can easily result in 2 bad HDs.

talldave2

Cloud failures happen, but local failures happen a lot more often.

Amazon cares a very great deal whether you lose your data, it’s critical to their whole business model.

Eventually it may make no more sense to own your own data storage than your own water storage. Now, in some situations you need your own water, but for the vast majority…

bh2

“You do realise clouds can evaporate, right?”

Which is why all systems, local or cloud, must have remote offsite backup and restore capabilities for ultimate safety. The only substitute for this capability is supreme confidence that nothing can seriously go wrong. Ever.

Someone once said experience is the hardest teacher; it first administers the punishment, then the lesson.

The hard lesson learned by most admins is that, eventually, *something* will go badly wrong.

Lessosn from experience are often both simple and obvious, as in this case. Purely static data can be reliably transferred to a hard copy non-volatile medium which is easily stored in your bank safe-deposit box (good). Or with a trusted friend in a location geographically distant (better). Or in a salt mine vault (best).

This simple step to provide ultimate salvage capability for critical data usually becomes obvious only after a major calamity, rather than before. Speaking from bone-headed experience, of course.

Michael P

Whatever is supposed to run on the virtual console (probably /sbin/login or something like it) is crashing at startup, possibly due to bit errors that the RAID controller didn’t detect. The “respawning too fast” message means that the supervisor process for them (called init, because it’s the program that the OS automatically runs at startup) is waiting in hopes that somebody will fix the problem — it isn’t smart enough to figure out that nobody can fix it in the current state.

I’d recommend making a “LiveCD” disc or thumb drive, and boot off that. You should be able to check the remaining data, and copy it elsewhere if need be. Installing a clean OS without wiping the existing data can be done, but should be done by someone with a moderate amount of Linux experience.

If you can’t feasibly make a live CD, you should be able to bit into single user mode by editing the kernel’s command line. Details of how to do that will depend on the bootloader that is set up.

Michael 2

I carry two bootable “thumb drives” just for this purpose. Emergency booting of a computer and then copy the disk that won’t boot onto a portable USB disk. Do it twice on different destinations; verify success of the copy. Then you can start rebuilding your system.

Gee… similar to my kit. A few bootable USBs of different releases, an empty TB USB disk, a stack of “live CDs” and “installation CDs”, a portable laptop pre-built dual boot to work from… The more your experience, the bigger the kit grows…

Vieras

If you look at that picture, it looks like the partitions are ok. So the data is most likely safe. I’d boot that computer with a usb-based live-Linux, mount the drives and copy the data to a safe place. Your best bet is to find some knowledgeable Linux-user nearby to do that. When your data is safe, you can then concentrate on fixing the problem.

Leo Smith

Very sound advice

bh2

Yes.

PSU-EMS-Alum

Before you do ANYTHING on this machine:

From a different machine, hunt down instructions to create a bootable Slackware USB flash drive.
Create the boot flash drive on that machine.
Use your new USB boot drive on this server, mount the RAID, copy the data somewhere else.

It actually ends up only being a few commands in total for this process (including possibly changing BIOS config to boot from USB), but commands will be specific to your situation.

Only once these steps are done should you make any attempt a rectifying the issue with your server.
For future reference, RAID is for availability, not backup.

Darren Stevens

That should be in bold and all-caps: RAID is for availability not backup.
Server failure, theft, or destruction should never endanger your “irreplaceable climate data”. Although I am not a fan of “cloud storage” it makes a reasonable off-site backup of last resort.

What you have here is the problem of mixing hardware “fake raid” and software raid.
Those adaptec raid controllers are primarily setup to be used on M$ based systems.
The linux software raid is the way to go.
Saying that, your data is not lost, it just will take many many hours to recover. Especially if there are multiple TB of data.
The mdadm tools are there to recover the data, I have to do this for clients once in a while, thank goodness this is not a common problem.
There are many many forums on the interwebs to help recover this data, it is just that they are very cryptic.
I’m sorry that I’m a little far from Chico, or I’d be over there right away.
The most important thing to do is get a brand new Hitachi or WD HDD and dd your best drive to it, that way you are not making unrecoverable mistakes.
For a storage server use software raid 1 with raid certified drives ( not all drives are friendly for raid ).
The best server environment I have used is the Koozali SME server found at contribs.org.
For just data storage use a ultra low wattage system and a name brand power supply.
I use a ZOTAC H87-ITX board with the lowest wattage laptop processor available and an Antec power supply.
Anything I can do from here in BC I will.

If it’s a mirror raid then there should be two identical copies of the data.
I have run many varieties of raid configurations… and lost a lot of data, but never with mirroring.
My worst data loss was with a 3 drive configuration that should have been able to rebuild except I did it with a virtual machine with soft raid and somehow lost 2 drives in quick succession.
I spanned 2 drives once, and lost 500GB of tv shows I recorded. I have yet to find out how Battle Star Gallactica turned out (not the original series but the remake).

Timo (not that one)

Don’t know anything about RAID drives or Linux, but I think you should feel happy you never saw how they ended Battlestar Galactica. Just keep your imagination about how they might have ended it.

beng135

Agree w/Timo — an awful/confusing ending to Battlestar second edition.

Bear

Don’t know if this helps but I saw this on line:

>This happens when you are using the serial console. To get rid of those
>messages, comment out all lines under “# TERMINALS” in /etc/inittab and
>execute “telinit q”. There should be six lines from c1 to c6.
This is partially right. Here are a few sequential lines on the
/etc/inittab:


# SERIAL CONSOLE
c0:12345:respawn:/sbin/agetty 9600 ttyS0 vt100

# TERMINALS
c1:12345:respawn:/sbin/agetty 38400 tty1 linux
c2:12345:respawn:/sbin/agetty 38400 tty2 linux
c3:12345:respawn:/sbin/agetty 38400 tty3 linux
c4:12345:respawn:/sbin/agetty 38400 tty4 linux
c5:12345:respawn:/sbin/agetty 38400 tty5 linux
c6:12345:respawn:/sbin/agetty 38400 tty6 linux


So, if I change the above to this,

# SERIAL CONSOLE
#c0:12345:respawn:/sbin/agetty 9600 ttyS0 vt100

# TERMINALS
c1:12345:respawn:/sbin/agetty 38400 tty1 linux
c2:12345:respawn:/sbin/agetty 38400 tty2 linux
c3:12345:respawn:/sbin/agetty 38400 tty3 linux
c4:12345:respawn:/sbin/agetty 38400 tty4 linux
c5:12345:respawn:/sbin/agetty 38400 tty5 linux
c6:12345:respawn:/sbin/agetty 38400 tty6 linux

it works. But if I change it to this,

# SERIAL CONSOLE
c0:12345:respawn:/sbin/agetty 9600 ttyS0 vt100

# TERMINALS
#c1:12345:respawn:/sbin/agetty 38400 tty1 linux
#c2:12345:respawn:/sbin/agetty 38400 tty2 linux
#c3:12345:respawn:/sbin/agetty 38400 tty3 linux
#c4:12345:respawn:/sbin/agetty 38400 tty4 linux
#c5:12345:respawn:/sbin/agetty 38400 tty5 linux
#c6:12345:respawn:/sbin/agetty 38400 tty6 linux

as you suggested, it does not work. I still get the same message and I get
more error messages. So, the answer, at least in my sparc64, is to comment
the serial console that is giving you problems. Ie.

# SERIAL CONSOLE
#c0:12345:respawn:/sbin/agetty 9600 ttyS0 vt100

At the end of the day get Centos installed, it’s a production grade OS that while behind the curve of the latest releases it is bullet proof for production and they do timely security updates.
FYI: I had a mirrored RAID setup on a production server once(Novell), when the controller got flaky both drives were hosed. After spending three days rebuilding I discovered the issue and had to restore for the day before backup, lost 8 hours of 26 editors work. Moved to RAID 5 with 6 hot-plug and play drives in external box.
Now I use USB Thumb drives for backup and long term storage, all my tape units and DVD optical crap is history. Hint: Tape backup is just that, don’t try to restore it two years down the road! DVD’s are close behind due to mechanical and media issues.

Do not use thumb drives for backup or archiving. Or SSDs for that matter. Best for low end are hard disks, refreshed every 2 years and replaced after 10.

Why not? 7 years ago I would agree with you but the latest USB 3.x drives are bulletproof. By running read/write diagnostics for a few hours to test the new drives before using them for long term archival storage. The MTF for mechanical drives is several times lower than NAND drives.

The electric charge on SSD, SD cards and Thumb Drives very slowly leaks away. You don’t notice in normal use since simply plugging them in refreshes the charge. BUT, put one in a safe deposit box for a few years and you can come back to find the data simply leaked away, or that in the early stages some bits have changed from 1 to 0.

IFF using those NAND storage systems (or similar) for backups, plug them in to a live system once a year to refresh the charge.

Oracle also gives their RedHat variant away, along with their VM server. All commercial grade. You just have to pay for support if you need it. But the software is all free.

Like others have said, use a live CD to boot and then copy the data to a different device. After that, why not just reinstall a modern Linux version (or other OS) from scratch – you don’t want to tell the world that you are running an ancient Linux version that likely doesn’t get any security fixes any more.

Alan Robertson

If you don’t get enough of an answer here, I might suggest contacting frequent contributor E. M. Smith, “Chiefio”. He has a clue or two about Linux.
It looks like you’ve already been steered in the right direction.

Thanks for the props.

Yes, I do Linux for a living and have since day one (I.e. started *nix on BSD pre Linux).

Anthony, I’m available for free if desired and can bring a bag of kit including spare USB disk, thumb drives of several flavors, laptop with Linux and more.

Hehe, I popped my cherry on Sys III, but spent a big chunk of the last 20 year using windows, I now have a VM Server that’s all on linux, and a bunch of linux vm’s I run as needed.

Did you ever have to drill coax to put an ethernet tap in? I hated that, it seemed so cheesy lol. I hope he guy who put it all in a box got very wealthy, he did a great service for the world.

Michael 2

“Did you ever have to drill coax to put an ethernet tap in?”

I came into thicknet right at the tail end. We had one wall covered in figure-8’s of thicknet in order to attach the server farm. You can only attach at equidistant marked spots on the coax to avoid standing waves. These taps were called “AUI” if I remember right.

They were at qtr wavelength for 100Mhz, so about 3 m, never had to wire a server room, what a mess that had to be, but I see how you got to figure 8’s lol
Most of mine were just on coils, or around a few cubicles. They were on First generation electronics design workstations, 68020 running Bsd. The story went around the founders knew the Sun guys from Berkley I presume, but gave them the design which they turned into the Sun3. Which Valid switched to instead of continuing to build their own workstations. Fun times.

Yes. Don’t remind me. (Raised ceiling over engineering desktops… DEC gear… At Apple in the ’80s…)

Nothing like panel dust in your face and shirt, balanced on a ladder, drilling an unsupported round thing (using a jig) while folks ask why you aren’t done yet to make you wonder why folks think computer work is all desk work and comfortable…

Dougmanxx

You need to comment out entries 1-6 in /etc/inittab as Bear said above. Generally this error is caused by keymappings in rc.conf passing unwanted keystrokes to a virtual terminal. (That’s the “console” error you see in the middle). Editing the init file should remove those possible console connections. Was this machine originally setup without a monitor and only logged in remotely? I’ve seen this when older machine are decommissioned and login is attempted locally instead of remotely. You may need to boot to a live instance to edit the files to get it to boot, or…. just save you data and set up something more modern 😉

If all else fails, try loading Microsoft Word and Word for Windows!

commieBob

Is this a troll?

PiperPaul

This is: “try loading Microsoft Word and Word for Windows” – but use the Mac versions!

Lars P.

I am no linux specialist, just a user 🙁 ….

Can you boot from a linux CD or USB and mount the drive?
Eventually then show the inittab?

Some more troubleshooting help:
https://www.lifewire.com/text-terminals-on-linux-2205461
Are you able to boot in single user mode?

Lars P.

By the way, you should be able to boot with a recent version of linux and mount the drives to copy/save the data. Do you need the old version running for any particular reason?
What is the status now?

Quick question: How much data and is it compressed? Just looked at a 1TB SSD drive for online read storage, long term bullet proof stuff, not so good for daily production but for READ-ONLY great stuff.

Don K

I’m not the Linux expert you’re looking for, but let me pretty much agree with everyone else. Yes, the symptom (not necessarily the underlying problem) is with your virtual consoles. It’s quite possible that most, maybe all, your data is intact, so your top priority probably should be not to inadvertantly destroy it. What I’d probably do is buy yet more drives, clone the ones you have, verify that the clones do the same thing as the current disks, then try to recover your data from them using a single “disk” linux system — usb stick, cdrom, linux installed in a spare partition if you have one.

Don’t overlook the possibility that your RAID controller, CPU, or memory might be failing in some weird fashion. That’s probably not the case, but if it is, it’s possible that nothing you try will quite work or make sense.

I also agree with PSU-EMS-Alum that all things being equal Slackware might be a good choice for the single disk system. It tends to be the most “plain vanilla” unix system around which means various cures/analytic procedures you find on the internet may well work. And my experience in the past is that the is intelligent life at http://www.linuxquestions.org/questions/forumdisplay.php?forumid=14 — something that is not true of all Linux distribution support forums.

I’ve used most versions of Linux. Above someone suggested Centos. It is good, but stodgy, and is now systemd for init so LOTS of conversion work for any customizations on the Slackware (that is BSD like rc.d init).

For ease of update, I’d likely keep it Slackware and update to current as a fresh install, then copy over any customizations and last reinstall the data disks.

On a second machine I’d be doing the disk cleaning and prooving up. Any Linux ought to do for that, as the file system is likely ext3. (Says ext2 for root but the RAID might vary)

Last on my list would be a full on conversion to a different OS with a new init system and all… and only if there was some good reason to do it. Arguing over your favorite flavor of Linux is not a good reason…

ozspeaksup

puppy linux on a usb drive to enable it to boot?

Kevin Atkinson

Another thing to do is run the dmesg command as root and post the output here, as it is, we can’t tell too much of what’s going on, with dmesg output, we could possibly trace down the errors.

I’m a Linux sysadmin. If I were you, I wouldn’t trust half of the answers you’re getting.

And for that matter, there’s no reason for you to trust me either: ANY jackass can claim to be a Linux expert.

Call Tony Heller. He’s someone you know, and I’m betting he knows his Linux.

pameladragon

I use Linux but my Linux expert keeps it healthy for me. His suggestion follows:

“Looks like he lost and|or corrupted data when his RAID failed. His trouble is with setting up the virtual terminals. Needs to look at /etc/inittab (not too familiar, his system is running very obsolete version of Linux) He will probably need to use a LiveCD session to fix. May have to comment out the offending init lines to get it to boot and then reinstall the associated visual terminal package that is broken some getty package needed for login.”

Good luck, I hope you can save your data!

PMK

Janice Moore

Praying for you, Anthony.

brians356

Try what Bear and Dougmanxx suggest. Of all the suggestions, that is simplest, with no possible adverse effects on the data. However, it assumes you can boot into a level allowing you to edit that /etc/inittab file. Perhaps someone can coach that aspect?

MarcK

Maybe this forum posting will help?
https://forums.gentoo.org/viewtopic-t-1051286-start-0.html
Googled “run level 3 Id “c1″ respawning too fast”
There may be a misconfiguration of your /etc/inittab file

You are going to need a specialist. I was a linux user for about 6 years, got sick of the unfriendly command line, and lack of hardware support, dumped linux and Went to windows 10 late last year.

brians356

This incident points up the importance of keeping an independent offline or nearline backup of “irreplaceable” data. LTO tape is one popular and inexpensive approach.

Don K

It’s been a decade, maybe two since I had to worry about it, and the world was simpler back then. But if the issue is just inittab (why would inittab and only inittab be bad?) isn’t there such a thing as single user mode? I vaguely think it might be usable to edit inittab.

I’d still make copies of the disk(s) before I did anything else.

CEH

Anthony, check this link out, it seems to address the problem that is shown on line four from the bottom, “/dev/console……..”

http://www.linuxquestions.org/questions/general-10/open-dev-console-input-output-error-383629/

gregfreemyer

I’m a data recovery expert. (See me on Linked-In). Tons of UNIX/Linux usage over the last 30+ years.

Step one, take a few deep breaths.

The odds are very high that 100% of your data is recoverable. That is until you destroy it in the process of trying to recover it.

Just go slow and work with copies of the drive(s), not the originals.

If you want more advice, connect up to me on LinkedIn and send me a message.

+1

I’m not anywhere near as technical, but I’ve been selling backup and disaster recovery systems to corporations for many years. Greg’s advice (and that of others) is spot on. Go after the data first. You don’t really care if you EVER fix the boot issue if you have the data and can load it on a new O/S image. Go after the data from the cloned drive because destroying data by trying to save it happens way more often than one would think (see Greg’s remark above and read it over three or four times while breathing deeply). This leaves the original intact if you accidentally blow it way on the clone.

Now for Step 2 which Greg didn’t address. The moment you have the data recovered, do not, I repeat, DO NOT go merrily on your way getting the old server running or even a new one. Repeat, DO NOT. Instead, make a G** D*** copy of the F***ing data on something that can go off site. Cloud, disk drive, tape cartridge, pick one. Then do the same for ALL YOUR OTHER COMPUTERS. Take the copy home in the trunk of your car and store in the closet if need be, if your business location has a fire or a flood or theft, no amount of technical expertise will help you. Copy the data, get it off site. Backup, backup often, off site. First rule of preparing for disaster recovery.

whiten

For what it could be worth.

I do not know much about servers, but I have my own computers…..and I had to give up on an old one, few weeks ago, and go for a new one due to a lot of virus attacks.

I could very easy have recovered my old hardware, but still thought at that stage it was not worth it, and as things stands I do not regret it at this point….

My old hardware, regardless of my effort, in the end suffered due to a simple but very effective virus, that always manged to cripple the functioning of the CD-drive and the USB drives and the external control from mouse and keyboard…
Regardless of the antivirus software installed, even the specific one against such a virus, I still had to do a lot to keep the hardware respond to the mouse and the keyboard and be under “exceptional control” …. and at the same time be happy with a partial usage of the CD-drive and the USB’s…..
The virus infecting that old hard-drive came from an infected USB-stick….In a moment that I got lazy and careless.

Strangely enough, a lot of virus attacks happening with my new hardware, including the one that crippled my old comp.
While I am not in the same pattern of the usage with my new hard-ware and not yet have inserted on it a USB-stick…

My new comp is warning me to install and use the same antivirus software that I had previously in my old Hard-ware…..strange indeed….but the point I am trying a make is, that some times viruses or malware can be very effective in the crippling of the basic functions of the access to the computer or the server without damaging actually the data and the hardware itself in it……
While in the same time a rush to fix it may resolve in a greater damage to the hard-ware and the data in it.

Not rushing and taking the necessary time needed is essential in these cases I think…

cheers

Jim Hodgen

To stay on topic – I hope – for a longer term fix… yes Centos is a good production system, but what you might want to consider is getting a hardware controller appliance. It will cost between $350 to $700 (there are more expensive but those should work, maybe better prices if you can get to Fry’s or go online) and you can have the appliance do a lot of the error checking and preventive management.

This allows other system to access the drives through a network, some allow hot swaps for the individual drives, you get error reporting and an operating system that is entirely separate from the drives themselves. This is a fixed cost model that has higher upfront but lower ongoing costs than a cloud storage option that will keep sending a bill for as long as you want to keep the data.

I never buy those – I’m not a dummy.

Since you are not a dummy, you would not run into the problem faced by our host.

No at all, not at all. I’ve been making money off computers since 1969. They haven’t exhausted all the ways they can bamboozle me yet.

I’m willing to bet that you have several O’Reilly books though 🙂

The easiest way to clone the drive with dd is download and burn a copy cf clonezilla.
Work from a copy of your good disk, never from the original.

Ian Macdonald

I would suggest downloading a copy of Knoppix on another computer, burning a CD or DVD and booting the affected computer from that. There is a good chance you might be able to read and recover the data that way. Knoppix mounts partitions readonly by default, which is the best approach when dealing which what might be a corrupted and fragile partition.

If it proves necessary to run data recovery tools then always do that on a copy of the original.That way if it makes matters, worse you still have the original.

http://knoppix.net

Incidentally at the boot prompt I usually type ‘knoppix no3d’ to stop the compiz effects from loading. Might be a matter of opinion but I just find them a nuisance.

Charles Curley

There are several pieces of good advice in earlier comments.

* I’d be talking with gregfreemyer right about now. His advice is good, and if he will work with you, great.

* Michael 2’s advice to copy off disk images using dd is excellent. Do that first. Make damned sure you get the if= and of= parameters correct. 🙂

A few other thoughts…

Take your time.

You may be able to boot to a modern data recovery CD. I use finnix (https://www.finnix.org/), but gregfreemyer may have a better recommendation.

It looks like your system uses lvm version 1.0.8 (2003). I have 2.02. After 14 years, it is possible that modern lvm no longer supports your disks. Building a boot CD from your slackware distribution might be the way to go.

You might also look at rdd to see if you can recover the failed drive. That will depend on what exactly failed in the drive.

There are user communities for all the software you need; feel free to ask for help.

Once the crisis is over….

A more recent distribution of Linux may be in order, as smalliot suggested. Centos is good. I use debian stable. Update things a bit more often than every 14 years.

Start doing backups. I use amanda, others can recommend other products. amanda is an industrial grade backup system which can be a bear to configure. Once you get it set up it runs forever and degrades gracefully. I back up to virtual tapes (files on disk) and copy to external USB drives for off-site backup.

Good luck!

Ian Macdonald

BTW, be very careful with the dd command, if you get the parameters wrong you can nuke the data.

I haven’t had to delve into a problem like this on Linux, but my take on things is that the RAID array is fine, but the kernel is confused about where it can have you login.

Do you normally login from a “login: ” on a mostly bare screen like the boot display or does the system start a graphics/window server?

I think the best comments are from Bear at https://wattsupwiththat.com/2017/03/22/is-there-a-linux-specialist-in-the-house/comment-page-1/#comment-2457854 and CEH at https://wattsupwiththat.com/2017/03/22/is-there-a-linux-specialist-in-the-house/comment-page-1/#comment-2457919

Beyond that, I’m not sure where I’d start. Is there a chance that you really do normally login on to the beast from some terminal line (e.g. for remote access) and that the cable is loose or fell out while you were working with the replacement disk drives?

R Fujii

A quick google says that those messages are coming from the alternate consoles, so the good news is that it doesn’t look like there is anything wrong with the disks. Are you sure everything is /exactly/ the same as when it was powered off (minus the drive)? Since this is a ~2007 machine, is a PS/2 keyboard attached? I see some talk that this might happen if your network cable isn’t attached (just the messenger here). As been suggested, if you just want to recover the data, it’s probably easiest to boot from a live image DVD and copy the data to another HD. From the looks of things, you’ll probably have to do this anyway and edit /etc/inittab to get it to stop spawning those processes so it will get further in the boot sequence. Be happy to help if you don’t mind remote help…

Alan Watt, Climate Denialist Level 7

I can write in more detail later. Interesting you say the mirroring is done on HW raid because the boot messages clearly indicate software raid with the “md” (metadevice) subsystem. The boot messages identify two SCSI disks, which in my experience it doesn’t do if hardware RAID were happening at the adapter level.

You will need to boot a recovery system from CD/DVD/USB depending on your hardware (which looks kind of old, so I suspect we’re talking CD here). Try here: http://www.system-rescue-cd.org/

The good news is the recovery system does not need to be slackware, as long as it supports the software raid and LVM. This will get you a working system and hopefully able to access your disk and its filesystems. You then at least have access to the data and can copy it off somewhere safe while you work on fixing the problem.

I can write more later.

The boot messages identify two SCSI disks, which in my experience it doesn’t do if hardware RAID were happening at the adapter level.

I in general agree. But I have run into (twice in my life) systems where h/w raid was enabled, the volume was logically partitioned, and then s/w raid run across the partitions. Long time ago, don’t remember all the details other than tech guys running around with hair on fire screaming “WTF why would anyone do that?” when the system went south.

I’ve administered linux since the 1.2.3 kernel. Started an ISP in 1996 on slackware then to redhat. This fake raid card is probably the problem. The hardware, works,,,, but,,,
To get anywhere boot in single user mode to do your checks,
cat /proc/mdstat will tell you what is happening with the array.
Fake raid is the worst of confounding situations, the worst.
3ware cards are full hardware raid and this issue would never be noticed or happened to the linux system.
Software raid repairs well. Fake raid will take a while, connect the drive to an onboard sata port to see if it will boot from there. If it does, all the better, fake raid was not the problem. If it does not boot from a bog standard port then it is an order of magnitude of a bigger problem.
Your problem is happening when the system is trying to go to runlevel 3, multiuser without the display manager. this is so the init occurs with multiple terminals and uses multiple threads efficiently. This is probably just a red herring as to the real problem.
keep to runlevel 1, read up on raid recovery, make a new system, use only software raid, transfer the data and software.
You can repair this os, Linux is 100% repairable, if you have buckets of time. $600 gets you a very good basic server. How much is your time worth?

Stephen Richards

I think you banned the one guy who could really help you. Could be wrong by Tony Heller never comments here ??

I believe Tony was/is primarily a microprocessor designer. That doesn’t necessarily mean he’d be good at fighting with recalcitrant Linux systems. Heck, my file system and other experience isn’t being much help….

Nor handle a CO2 frost brouhaha well. I still have scars from it.

crosspatch

This is a virtual console issue that happens when it can not find font files or any number of other issues. I would be interested to know what the /etc/inittab file looks like. Way to find that would be to boot from a bootable CD or flash drive then mount the system hard drive on /mnt and then have a look at what /mnt/etc/inittab looks like. In particular, I would be interested in entries that look something like:

# TERMINALS
c0:12345:respawn:/sbin/agetty 38400 vc/0 linux
c1:12345:respawn:/sbin/agetty 38400 vc/1 linux
c2:12345:respawn:/sbin/agetty 38400 vc/2 linux
c3:12345:respawn:/sbin/agetty 38400 vc/3 linux
c4:12345:respawn:/sbin/agetty 38400 vc/4 linux
c5:12345:respawn:/sbin/agetty 38400 vc/5 linux
c6:12345:respawn:/sbin/agetty 38400 vc/7 linux

I would see if I can edit that file and comment out the entries for c1 through c6 by inserting a “#” char at the start of the line, save the file, remove your boot media and attempt to reboot from the drive.

It does look like it loads the c0 device, though which might be tty0 rather than vc/0

What is happening is that it is attempting to initialize the console devices c1 through c6 and that is failing. c0 does appear to be initializing, though, which might be a serial console if c0 says its device is a tty device on your initab. So connecting the serial port (if it has one) to another machine’s serial port with a crossover cable might allow you to terminal in (if you know the baud rate, etc) and get a login prompt.

Your /etc/inittab might also have something like this:

# SERIAL CONSOLES
#s0:12345:respawn:/sbin/agetty 9600 ttyS0 vt100
#s1:12345:respawn:/sbin/agetty 9600 ttyS1 vt100

if they are not commented out (as this example IS commented out) you might have a 9600 baud login on a serial port available to you.

curly

That was my first thought after looking at the logs.
It looks like it’s happening during the start of going to multi-user mode.

Since you have console access, can you boot single-user?
And carefully look around and see what’s mounted, how it’s mounted?

The other thing that stuck out is that it said it’s mounted your root fs read-only,
and it’s an ext2 filesystem. Hoping that your root filesystem really is not ext2,
and really is ext4 (or at least ext3), and it maybe mounted it as ext2 because
there’s a problem with the fs journal.

But that’s just for diagnosing.

As others have written, protect that original good drive and carefully clone it.

Anthony says: “This machine was built circa 2007,”

https://en.wikipedia.org/wiki/Ext4 says: “Kernel 2.6.28, containing the ext4 filesystem, was finally released on 25 December 2008.”

I’d be very surprised if it uses ext4. Even if the OS was upgraded a few times.

curly

sounds like it is ext2 then. eek.
was hoping for at least ext3.
definitely past time for an upgrade.
not even going to ask the patch level for 2.6.28.

Dena

I have never used Linux but I have had many years in computers. The question is always what changed and that would be the new disk drives. What I suspect is something about the new hardware is causing issues in the 2007 software. A possible solution would be to update Linux on the cloned drives to get the latest I/O drivers that may recognize the newer drives.

A secondary possibility is both original drives were corrupted as the result of the drive failure. If so, you have a massive rebuild a head of you unless you have backups you can recover from.

Chimp

What happened to the post on offshore windmills?

Is it a victim of the Linux malfunction?

Chimp

Thanks.

I noticed it when looking for my reply to Griff on birds and offshore wind farms. Off topic here, but this is another link to the same Scottish court decision:

http://www.telegraph.co.uk/news/2016/05/12/birds-scupper-2bn-offshore-wind-farm/

Diving gannets are the coolest:
comment image

But their swarming behavior around schools of fish give the lie to Griff’s lack of worry. Bird migration routes have nothing to do with it. Gannets go where the fish are.
comment image

KevOB

I’m not going to offer a detailed solution but my suggestion from over 45 years working with,designing and building computer and as a former computer shop owner with a service department. Most of my family also use Linux.

First rule: “Do no harm”. Turn it off and keep it turned off until you find the right technician you can get it too.. They are scarce. Your problem may not even be software. With older machines components failure occurs in so many ways and IC’s not infrequently slowly die in stages. Not knowing yet exactly what the fault or possible combination of faults allowing the machine to be powered on can cause an even worse situation. Expertise in software is not necessarily sufficient when age and particular hardware is considered. A machine with a component in death throes can go from able to recover to impossible over a short time.
It may be software but you need someone skilled overall with this. One blessing Linux is robust and provided the read/write subsystem of the hard drives has not physically screwed up critical data your basic data is most likely to be intact. If so putting the disk in another machine for cloning and then attempting to copy the appropriate partitions is likely to be successful.

Tom

Not sure which Kernel version you are running, but there have been bugs introduced before that caused this in the older 2.6.31 kernel. When you boot up, if you have a grub menu with a previous kernel version, try booting from that instead and see if it works.

BallBounces

Anthony — please provide us with an update when this is resolved.

Non Nomen

I’ll keep my fingers crossed that this mess ends well. I lost data on a HDD a while ago and it was a bloody pain in the a*ss to get it recovered.

PiperPaul

Copy the drive contents and then take the old one and throw it into a dumpster. Before you throw it in, though, write “Tr*mp Ru*ssian Sekrets” (use the letter ‘k’ so it looks authentic, maybe write it backwards, too) on the external case of the drive and then call the New York Times with an anonymous tip. Bingo! The drive will be recovered and copied in no time all over the internet for easy access by you later. But I can’t guarantee that the content won’t be ummm, adjusted if some climate “scientists” are involved somewhere along the way.

I have worked with, built, sold,serviced, designed etc computers for 45 years. Also I am a intermediate Linux user.
Much good advice here but much ignores first rule of service: “Do no harm”. It is not what you think might work is important but the certainty that the data will not be further compromised. So turn crook machine off and do not attempt any recovery for something so important meanwhile.

I am also concerned with age of software and possibly hardware. Computers at all levels are never static as subtle upgrades are constant and so are internal conflicts arising, timing states etc.. These conflicts are not always significant or evident until catastrophic failure. A generation of hardware for a system builder is only 3 months–It’s been like that for 30 years (Look at the version numbers on parts..)

In my opinion you need an expert tech in older machines as well as a Linux one. A rebuild is probably called for.

Good news: unless the file system has been damaged at the physical read/write level your data is probably intact. Keeping the machine unpowered is your best protection at this time while you find the right tech.

Another hazard is a failing component. They can do the oddest things while dying and recovery may be possible at an early stage and impossible even only minutes later. The more your restart attempts the greater the risk.

Best regards for your fine work.

Man Bearpig

Do you still have the original drives. It may be that one went out of sync before the other.

The second post in the thread was correct. Go to linuxquestions.org. You WILL get expert advice there.

Don K

Anthony, It looks like you have lots of help. I’ll just bow out of this discussion, but let me caution you against a few common ways to screw up the recovery.

1. If at all possible, work on a copy of the disk, not the original

2. If you copy with dd (bit for bit copy) check the “of=” parameter several times. Take a short break then check it again. It **MUST NOT** be the source device. You can probably survive botching the “if” parameter, but not “of=/dev/whatever”

3. If someone tells you that you need to run cfdisk, sfdisk, fdisk, gparted, or parted, think about it long and hard. If you screw up your partitioning, your data is gone.

4. If you copy using file system tools cp, cpio, tar, etc (pretty much anything but dd) you will need to mount your source and destination devices. remember to unmount them when you are finished. If you don’t, buffers may not be flushed and you may lose data.

… And the unmount command is umount, not unmount

… And if you can’t umount, it is probably because you have moved yourself into the mounted file system with cd for some reason. It’s OK to do that, but you can’t umount until you move out of it.

5. Oh yes, if the amount of data you need to save isn’t enormous, you can back up occasionally to a usb flash memory stick. Devices up to 64gb are pretty cheap. It’s slow, but you can wander off and do goodly works elsewhere. Use a stick with an LED indicator and don’t pull it out ’til it quits flashing and the umount command exits. BTW, once you have written a backup, you can write subsequent backups of the same filesystem to the same device much more quickly with rsync. In your case, I’d do two or three backup sticks per server, rotate them, and backup every few days or weeks depending on how many days of data you’re willing to risk losing.