RAID: Keeping Your Computer Running During Failure

The Short Story

If a significant portion of your income is made by working on a computer, you should consider installing a redundant hard drive in your computer and making a RAID array. This will ensure that if one drive fails, the other can take over. And you can continue working during a disk failure, the most common hardware failure.



The Full Story

There are lots of things that can go wrong with your computer. Hard drives, curcuit boards, keyboards, monitors, they can all die without a moment's notice. If you make your living on a computer, having the system out of commission means losing money.

In my case, for example, that loss of money is very easy to compute. I can't make money without a functioning computer. Let's say I make (a very fictional, mind you) $500 per hour. If my keyboard breaks, and I have to take an hour out of my day to go to CompUSA and buy a new one, I've lost $500.

The key to avoiding this loss is redundancy. Just like with my backup sermon before, having replacements lying around can save a whole bunch of time. At $500 per hour, 1/2 hour of reclaimed downtime would pay for an entire spare computer.

Spare Hard Drives

One of the most common things to die on a computer is the hard drive. Hard drives are cheap, but they can be a pain to replace. And unless you're a computer expert, you may not have the knowledge to replace one yourself.

And there's another consideration that sets hard drives apart. When your mouse fails, you plug another one in, and you're running. When your hard drive fails, you have to install the new one and then restore from backup. This can be an all day process. And unless you have a helper, you'll be stuck minding your computer, unable to do anything else.

There is a simple solution that the engineers building big systems have come up with: it's called RAID. RAID stands for "redundant array of inexpensive disks", and here's the idea: you take your spare disk, and install it in your computer with your main disk. The computer will put your information on both drives, and if one drive fails, the other will continue on working. This way you can work through the failure!

How To Get RAID Working

If you happen to be using a Mac (with OS X) or a PC with Unix (probably not, but hey), then you automatically have everything you need to set up RAID. Your computer includes software that will let you combine your disks into RAID arrays, and you'll be off to the races. If you're using Windows, it's a different story.

The Windows Server editions include software like above, but the desktop versions, XP and Vista, do not (which is extremely unfortunate, by the way). If you want to use RAID on one of these computers, you'll have to buy some hardware. A RAID controller (see here, here, here) is a piece of hardware that will let you do the same thing for $150-$200 (on average). Let me fill in some details here...

RAID Details

I mentioned the word array before. When you set up RAID, the goal is to make several real hard drives (physical units) act as if they were one "virtual" hard drive (logical unit). This group of drives is called an array. Your goal will be to take your hard drives and make an array out of them. And there are several different ways you can structure this array:

  • RAID 1 (also called 'mirroring'): I recommend this for desktop systems. All the hard drives in the array are set up to be 'clones' of each other: they will hold exactly the same information. If you have two 300GB hard drives and put them in a RAID 1 array, you will get one 300GB logical drive. You need a minumum of two drives for this to work.

As you do research, you will also see other arrangements (RAID 2, 3, 4, 5, 6, 0, 1+0, etc.) I'm going to cover two other popular ones:

  • RAID 5 (no nickname, sorry): This configuration (typically) requires at least three drives to work (which is why I don't recommend it for desktops), but if you feel the need for a lot of redundancy, here's the benefit: With three of our 300GB disks in a RAID 1 set, the logical unit will be... 300GB (they're all clones, right?). In RAID 5, the logical unit will be 600GB. In general with RAID 5, when you have X disks, the logical size will be (X-1) times the disk size. Add a fourth disk, you get 900GB, etc.
  • RAID 0 (also called 'striping'): NEVER USE THIS FOR ANYTHING IMPORTANT!!! With RAID 0, the system takes the hard drives (say, again, three 300GB disks) and makes one 900GB disk. The system shuffles the information across the three disks for faster access. There is no redundancy (hence it shouldn't be called RAID), which means if any one of the drives fails, the whole thing is lost! However, if the data is not important, and you need fast access, RAID 0 is the way to go.

Hot Swap

If you're setting up RAID on your desktop computer, it's probably okay that you have to shut down the computer to replace a faulty drive. If we're talking about a server, on the other hand, things can get more complicated. Your server may be used by hundreds of people at once, and any downtime cost on the server can get multiplied accordingly. Take an office of 50, average pay $30/hr. Hard drive in the file server dies, and noone can work.

Without RAID, the server is down for six hours while the drive is replaced, and the last backup is restored. $30 * 50 people *6 hours = $9000 of lost productivity.

With RAID, there's no need to restore from backup, but the server still has to go down for an hour while the drive is replaced: $30 * 50 people * 1 hour = $1500 of lost productivity.

Those are some daunting numbers, and hardware engineers have developed a trick to get that number down. Some kinds of hard drives are hot-swappable: this means you can replace them without turning the computer off. $30 * 50 * 0 hours = $0 of lost productivity!

There are two really common types of hard drive that are hot-swap capable: SCSI (really fast, more expensive), and SATA (not quite as fast, cheaper). If you have a newer computer, it probably includes SATA capabilities. If you buy a RAID controller, make sure it's a SATA RAID controller, buy a couple of SATA hard drives, and get yourself set up.

Wrap Up

Probably nothing in your computer will break. Your hard drives will run fine for a few years, and you'll upgrade and get all new equipment. But think of RAID as an insurance policy: for an extra $300 or so, if your hard drive does kick the bucket, you won't be out of commission, and you won't be stressed out trying to find someone to fix your computer as quickly as possible. And the hard drive is one of the most common components to fail.

Further Reading

Wikipedia has a rather large article on all the different types of RAID. For the Mac-o-philes, Apple has some documentation, and frozennorth.org has a pretty straightforward HOWTO on setting up RAID on a Mac (N.B.: he mistakenly calls mirroring 'RAID 0'). Google has used their vast number of servers to perform a reliability study on hard drives (extra geeky).

WHY WAS I SO HARD ON STRIPING ABOVE? Well, let's look at the following scenario. You have two brand new hard drives. Let's say each hard drive has a 10% chance of failing within the first year (which might not be too far off according to the Google study above). This means for any given day, there's about 0.03% chance (1 in 3650) of one of the drives failing. And I'll also say, if a drive fails, you'll replace it the next day.

For the mirror set to fail in this scenario, both drives have to fail the same day. Over a year, the chance of that happening is roughly 0.003% (1 in 36,500). For the stripe set to fail, either drive can die on any day. Over a year, the chance of that happening is 20% (1 in 5). That's a huge, huge difference. And, you're worse off than if you had just one drive!

 

RAID Formulas

 

NOTE TO THE UNIX-HEADS: If you set up mirroring, make sure that your swap areas are mirrored, too! The point of RAID is to keep your system running in the event of a hard drive failure. That can't happen if your swap goes away while an executable is paged out.