Is RAID 5 a risk with higher drive capacities?

There’s a very interesting discussion going on over at ZDNet about RAID5 and hard drive capacities. The premise of the discussion is that unrecoverable read errors are uncommon, but statistically, we’re approaching disk sizes where it will start to matter. Here’s a quote from the blog entry:

“SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 100,000,000,000,000 bits, the disk will very politely tell you that, so sorry, but I really, truly can’t read that sector back to you. One hundred trillion bits is about 12 terabytes. Sound like a lot? Not in 2009.”

That would mean a bad block when trying to read. It wouldn’t be such a problem, except when it happens while you’re rebuilding a RAID array after a drive failure. New drive failures are 3% for each of the first three years, after that, the rates rise quickly, according to that author and Google, who he referenced for the numbers.

So the problem becomes a RAID 5 array with a drive failure. Pull the disk out, add a new one in, and the array has to rebuild. Once every 12TB on average, that rebuild will fail, according to statistics.

Commenters have pointed out that the loss of a single block doesn’t necessarily mean the array can’t rebuild, just that the non-redundancy means loss of that particular bit of data. With backups, you can restore the individual file and have a functioning array. I think it would depend on the controller, but I don’t have any data to back that up.

The author argues in favor of more redundant RAID mechanisms. RAID 6 can tolerate the loss of two drives, and other raids can lose even more, depending on the particular failure.

Just the other day, I had a RAID 0 fail, but that was from the controller dying. Have you ever had an array die during rebuild? How traumatic was it, and did you have a backup available to recover?

Also, if you could use a RAID refresher, I mentioned them a while back.

  • Michael Janke

    I can’t think how multi-drive failure on an ordinary RAID 5 controller would be anything other than ‘recover from tape’.

    I suspect that if that happened you’d be calling Ontrack and crossing your fingers.

  • JeffHengesbach

    I’ve been through many single drive failures in various Raid 1|5’s but thankfully no doubles /knocks on wood/.

    The other big concern today is the number of drives in a raid group. The more drives, the higher the probability of failure. The big players have always know this, but with the sheer amount of data today and the low cost of spindles, lots of medium/small shops haven’t thought about how scary the possibility of complete failure is when rebuilding a R5 made of several large drives.

  • Jim

    I had a RAID 5+0 overheat and fail on me during a rebuild (my fault, don’t ask) Thankfully, my boss (and I for that matter) are paranoid about losing data so we had a good backup from before we started the rebuild and were able to be up and running on monday. IT did make for a very long night though.

  • Peter

    just to add to this a long time into the future.. I just had a two disk raid failure!.
    While it was rebuilding a raid5 15 disk array.. the second disk failed.
    Whole system went down and lost 12Tb of data.
    this is not a pipe dream and needs to be taken seriously.. Raid 5 mathematically has a 1 in 5 chance of failing in the first 3 years.
    I know for sure that isnt a good deal in an enterprise environment.
    Interesting to see the probability of 4 or 6Tb drives and what happens to the MTBF