I come not to praise RAID-5...

Date August 3, 2012

...but not really to bury it, either.

Edit Just as warning, I have ignored many aspects of RAID-level choices here, and concentrated on that of the likelihood of catastrophic failure. There are lots of reasons to not select RAID-5 in a new array, but a comprehensive review of all aspects of the technology is way beyond the scope of a single blog entry. Take this entry for what it is: an explanation of a particular phenomenon of failure, and what makes it more likely or not. Don't base your decision solely on this article, but let it be what it was meant, as a piece of the puzzle.

RAID-5 is misunderstood by too many people.

It's cyclical, to be sure. We started off with no one knowing what RAID was, then we went to everyone knowing what RAID was (or at least, being familiar enough with it to mistake it for being a backup). RAID-5 was sweet, because it had a parity stripe, and if you had a bunch of drives shoved together, it could tolerate the loss of a drive. How awesome was that?

Then Robin Harris went and wrote Why RAID-5 Stops Working in 2009. It was an excellent article, and it was exactly right for the purposes that Robin meant it. It's been a good way to point out to people that maybe they shouldn't have RAID-5 configurations with 40TB worth of the cheapest SATA drives known to man.

But people started equating RAID-5 with being broken or bad. Or always wrong. And it isn't. You can't just take the shortcut and say that a technology like that is absolutely improper because it breaks down under certain circumstances. So I'm going to try to give you my perspective on this by recycling something I put together for an individual who had questions about when to use RAID-5 and when not to. I hope this helps clarify things.




Starting with the basics...hard disk drives that are not flash based operate by having small "heads" move back and forth across spinning platters (usually made of ceramic these days, but there are still metal ones floating around). Think of it like a record player arm where the needle is the head.

The head can read or write, depending on its instructions. Each hard drive has multiple platters, with one head per platter. Good so far?

OK, so the drive writes data and reads data...but it's reading incredibly small magnetic data from the platters. You know how sometimes the magnetic cards in your wallet get erased from being around magnets? Those bits are HUGE compared to the ones and zeros on a hard drive platter.

The bits in the drive are not stored in your wallet, of course; they're inside a metal case which is inside of your computer...but occasionally strange things will happen, like maybe something bumped the disk while the head was reading or writing, and it creates a gouge on the surface of the disk (remember, the platters are spinning at 5,400RPM at least!), or maybe dust got into the drive and is blocking the sector, or maybe even cosmic rays have flipped a bit on the platter (really!).

Whatever it is, something stops the head from being able to read the data on the disk. This is called an Unrecoverable Read Error (or URE for short).

The likelihood of encountering a URE is dependent on things like the build quality of the head and the speed, but mostly on the size of the sectors on the disk (remember how your magnetic stripe has big bits? those are harder to overwrite).

Manufacturers often figure out how likely a URE is for a particular drive, and they usually put it in the documentation for that drive. You can find it if you go to Pricewatch and pick a hard drive at random, then look up the product manual for that part number. Here's an example of the Western Digital Green Line, or maybe you'd prefer a Seagate Barracuda.

Anyway, get the product manual, and look in the specs for something like "Non-recoverable read errors per bits read", or "Non-recoverable read errors". What this gives you is the likelihood of your encountering a URE.

Both of those examples have "1 per 1014 bits read". If we use the handy-dandy Google calculator, you can see that it really means one URE per 11.3 terabytes.

So, putting all of this together...if you have a 3TB drive and you fill it to capacity, you could probably read from it over 3 times completely before you'd encounter a URE, and lose the data that was held by that particular sector.

That's kind of unsettling, right? 3 full reads on a 3TB drive, then a high likelihood of going kaput on the 4th read-through?

Fortunately, we use RAID levels which can protect us.

Lets start with RAID-1, which is mirrored disks. We are reading along happily when suddenly we get a URE on one of the drives! It would be sad but, on the other drive, we have an exact copy of the data as it was originally written! The RAID controller reads the data from the other drive, re-writes it on the drive that had the URE, and we continue merrily along.

Suppose we lose a RAID-1 drive. That leaves us with a good copy, but assuming the array was full, when we put a replacement drive into the array, we've got to re-read all of the data on the good drive. Keeping in mind that we're likely to experience a failure after 11.3TB, what is the statistical probability that we'll experience a URE reading 3TB? Around 1 in 4.

Now, lets move on to RAID-5. You need at least 3 drives in a RAID-5, because unlike the exact copy of RAID-1, RAID-5 has a parity, so that any individual pieces of data can be lost, and instead of recovering the data by copying it, it's recalculated by examining the remaining data bits.

So when we encounter a URE during normal RAID operations, the array calculates what the missing data was, the data is re-written so we'll have it next time, and the array carries on business as usual. But when a drive dies, we have to replace it, and that's when things get hairy.

In order to rebuild the array, the new drive needs to be populated, and in order to do that, the entire contents of the remaining drives need to be read, in order to calculate the parity information. Assuming we have a RAID-5 array that has 3 3-TB disks, we're now reading 6 terabytes of information. What is the statistical likelihood of encountering a URE?

1 in 2. A coin-flip.

Have a RAID-5 array with 4 3-TB disks? That's 1 in 1, almost certainly a failure. You can see how quickly this goes downhill.

Now, a lot of people see this, freak out, and say "oh my god, I'm never using RAID-5 again! RAID-5 is the devil! It's EVIL!", but remember what is driving the numbers...it's the URE rate.

Check out this Hitachi Ultrastar. It's URE rate is 1:1016 which is a kind-of-amazing 1.1 petabytes...so the odds of your UltraStar-based RAID-5 array dying during a rebuild because of a URE is...very low.

So you can't vilify RAID-5 across the board. It's very much a matter of what you're using it for, what the quality of the drives involved are, and what the capacity of the array is (or really, how much data is stored on the array).

Does that help you understand? Have questions, comments, or suggestions? Please leave them below, thanks!

  • Warner

    Well, sure. The (newer) rule of thumb is that RAID5 should not be used with dense storage. (IE. 1-2TB+) With smaller disks, it is still cool. RAID6 helps offset that but with the quantity of disks being used; you might as well go 1+0.

    If you care about your data and performance, with the lower cost of storage these days, you might as well use 1+0 in most cases.

    A nice write up though, a quality introduction.

  • Tim

    Great intro to how RAID5 works, but nothing about the performance penalty on writes to the disks? This is the principle reason that people "hate" raid5. I know I've struggled with phb's over the years trying to explain the difference between raid5 raid1+0 and why raid5 is not going to be used on my database server!

  • http://www.smbitjournal.com Scott Alan Miller

    That didn't really solve the problems with RAID 5. Yes, you can spend more money to make it safe (ish) but in doing so you've made it no longer cheap - which was its only saving grace to begin with. So you still aren't as safe as RAID 10 nor are you as performant. But you end up costing as much.

    Just because you can make RAID 5 safe enough to not be crazy, can you make it make sense in any particular deployment? Maybe in a very niche case where you are trying to squeeze a low performance, high capacity system into an already existing chassis and RAID 10 just won't fit? That would be very rare and that would imply some really high capacities that would rule out RAID 5 because of resilver times.

  • http://www.smbitjournal.com Scott Alan Miller

    Tina nails it. RAID 5 is hated because of the combination of reliability, performance and unavailability during recovery.

    We've always had ways to make RAID 5 reliable. SAS drives have had URE rates low enough to make them just fine.

    The article that you were discussing was based on the assumption that you are using RAID 5 to cut corners - to lower your budget. Going with expensive drives ruins that assumption and brings all of the other issues like performance and cost back into play so focusing on URE failures during resilver isn't really the factor at hand.

    So I don't see where you've address the issues.

  • http://peelman.us Nick

    It should also be worth noting that content context matters when choosing RAID, and for some things, the file system sitting atop the RAID array is going to compensate for those single bit errors. Are you storing a massive image archive, or content for a home media server? A few errant bits across many terabytes aren't going to hurt you too much.

    Also, 1 bit in 11 Terabytes is still pretty phenomenal, though admittedly the Ultrastar's number is better. Unless you are talking scientific data, where absolute bit-precision is paramount*, a single bit flipped isn't going to make much difference, especially if the file systems checksums catch it. To get multiple bits flipped, the odds change, to get multiple bits flipped in sequence, the odds change even more.

    So while I agree with Warner, than the benefits of RAID 5 fall off when using larger disks, its still better than JBOD, again, depending on the context of the data being stored.

    * I still argue that an absolute level of precision is impossible and you are still going to require parity data somewhere in the chain to maintain integrity.

  • http://www.smbitjournal.com Scott Alan Miller

    You should check out SpiceWorks. RAID 5 is discussed there more than anywhere else, I would imagine, and everything here has been debunked thoroughly for actual business use. The ability to use low URE drives was always assumed, just not found to apply in the real world.

    Here are some articles addressing this that you might want to check out:

    http://www.smbitjournal.com/2010/02/raid-revisited/
    http://www.smbitjournal.com/2012/05/when-no-redundancy-is-more-reliable/
    http://www.smbitjournal.com/2012/07/hot-spare-or-a-hot-mess/

  • http://www.smbitjournal.com Scott Alan Miller

    "Check out this Hitachi Ultrastar. It's URE rate is 1:1016 which is a kind-of-amazing 1.1 petabytes...so the odds of your UltraStar-based RAID-5 array dying during a rebuild because of a URE is...very low."

    But what would the resilver time be? Months? How long can you be without normal performance and quite possibly without availability on an array that large? And that is a long, long time to go without losing a second drive. Fear of losing a second drive isn't a factor until you start coming up with ways to make mammoth RAID 5 arrays then issues come back that really don't exist elsewhere.

  • http://www.smbitjournal.com Scott Alan Miller

    @Nick: The issue with UREs in RAID 5 (but not in RAID 1 or 10) is that if a URE is encountered during a resilver operation that the entire array is lost, not just one bit. It is the resilvering of the parity that causes this failure.

    So what sounds like "a bit flip" is actually "catastrophic array failure" if hit during a resilver. And a resilver has to read every bit on every disk in the array, sometimes more than once. So even a moderately small array is going to read many TB of data during a resilver. Lots of opportunity for the array to fail completely.

  • http://www.smbitjournal.com Scott Alan Miller

    With RAID 1 and RAID 10 (mirrored systems, not parity) a URE during resilver simply means that one bit is bad. No big deal (hopefully it wasn't in your database.) No different than hitting a URE on a regular drive on your desktop. You probably won't notice.

    It is the parity of RAID 5/6 that puts the array at incredible risk of UREs.

  • http://www.standalone-sysadmin.com Matt Simmons

    Hi everyone,

    to reply to people in general:

    No, I didn't address the performance penalties of RAID-5 (and no, I didn't mention RAID-6). I was concentrating on the chances of catastrophic failure during rebuild.

    There are performance reasons why you would not prefer RAID-5 (or RAID-6 for that matter), but all other things being equal, using RAID-5 is almost always a choice of economics, not technical feasibility.

    If you have a certain amount of money to spend, and a certain capacity you have to hit, then RAID-10 is maybe not a viable choice for you. RAID-5 might be, or it might not be. But if you are going to spend money on RAID-5 at all, then you should be aware of when it will die on you.

    Alternatively, lots of big arrays with many shelves of disks still use RAID-5 as their go-to RAID level. I've read many, many kneejerk responses from people who say, "RAID 5? I hope you don't like your data!" because they have heard RAID-5 will lose your data, but they didn't understand the underlying reasons, and that's what I was trying to communicate in this blog post.

    I did not want to discuss the economics of it, because everyone's budget is completely dissimilar. I work in a place where the guy setting the budget remarked to me during an interview, "It was only ten grand, and I wasn't going to let something small like that get in our way". Contrast that to the previous company I worked for, where $10k might have been two years worth of IT capex. And the previous company had higher uptime requirements than the one I'm working for now!

    Everyone is different, and each person knows more about their situation than anyone who doesn't work in it, so if I can help people understand the universal basics, such as "why RAID-5 loses your data, and when", maybe they'll be able to make better decisions about their infrastructure.

  • http://www.smbitjournal.com Scott Alan Miller

    The problem is is that RAID 5 is "corner cutting" system for budgetary reasons. So ignoring the cost factor makes it not meaningful to real world system admins. The fact that RAID 5 was dangerous only existed within the context of the cost of the system.

    You could have summed it all up by saying "RAID 5 remains technically viable as long as it no longer needs to be considered within the context of its cost."

    But you could apply that to almost anything.

    No enterprise storage vendor sells RAID 5 today. Even RAID 6 is effectively gone. What storage vendor are you using?

    People's dissimilar budgets isn't an issue. If I have $5 or $5m, it doesn't change that in order to make RAID 5 safe enough (but never safer) we have to waste money. Wasting money just to make a technical point.

    The cost of the drives necessary for RAID 5 offset any advantage you get from using it but don't address the other issues.

    Yes, there are many knee jerk reactions to RAID 5, just as there used to be knee jerk reactions for it in the other direction previously. But I feel that you are telling people that RAID 5 is safe because you can, in theory, eliminate one particular problem and then act like its other issues don't exist. People looking for an excuse to use RAID 5 will read this, think that you've address all the issues and put their businesses at risk unnecessarily.

    The issues with RAID 5 are far too complex and numerous to address in such a way. It is the combination of factors that eliminate RAID 5 from consideration from anyone that is not a storage expert as the factors are just too complex for anyone needing to read an article of this type.

  • http://www.smbitjournal.com Scott Alan Miller

    The fundamental problem with RAID 5 from a budget standpoint is that...

    If you are cutting corners on your RAID you will not be looking at good drives. If you had the budget to spend, you can balance out more cost effectively in more reliable ways.

    Continuing to use RAID 5 for existing deployments can be justified. We need not vilify every attempt to utilize what already exists as rip and replace can be very costly. But in continuous discussion on this topic, every day for two years in the SpiceWorks community with ardent supporters of parity RAID not yet has anyone come up with a "new" deployment scenario where RAID 5 could be justified. And each day as drives get larger, the likelihood of the real world, niche scenario where it might make sense dwindles.

    Is there a niche case where experts could justify it? Yes. Is there a niche case that could afford those specialists vs. overbuying for safety? I'm apt to say no.

  • http://www.standalone-sysadmin.com Matt Simmons

    >I feel that you are telling people that RAID 5 is safe because you can, in theory, eliminate one particular problem and then act like its other issues don't exist.

    I wasn't addressing the other issues, but they certainly exist. As you say in your next paragraph, it's a complex topic. A comprehensive blog post wouldn't be a blog post - it would be a novella.

    Be that as it may, though, the reality is that a *lot* of people out there (most, actually) don't have the time to become storage experts (and I'm not one either), but the reality is that they have to deal with RAID-5 for legacy reasons, and if you have an old system running RAID-5 and you go to set up a new array, you might automatically select RAID-5 because, hey, that's what you've already got. And that's a bad decision, but people need to know WHY it's a bad decision.

    I'm never a fan of encouraging people to make uneducated choices. I've added a disclaimer at the top of this entry to let people know that this isn't an all-inclusive article, but I stand by it as written.

  • http://www.smbitjournal.com Scott Alan Miller

    That's a tough one because if they are to educate themselves on all of the issues they would become a storage expert :) If they don't do that, then they are forced into a position of a decision that they don't understand. The big risk is in explaining only one facet because without understanding all of the factors those who were avoiding RAID 5 for reasons they did not understand might now justify it again, for reasons they don't understand.

    So the issue isn't solved. My argument is that if someone is going to implement RAID and not understand all of the factors, then they must use RAID 10 or take on incredible risk because they don't understand, necessarily, when RAID 5 will be safe or, at the very least, safe for them.

    Any argument for RAID 10 can safely be stated as "unless you know exactly why to do otherwise, stick with RAID 10." Any argument for RAID 5 requires a complete dissertation because there are so many dangers and so many of them are so complex.

    Basically if you err on the side of RAID 10 even when wrong, it's not a bad choice. If you err on the side of RAID 5, when it is wrong it is often disastrous.

  • http://www.standalone-sysadmin.com Matt Simmons

    Mirroring is always a decision of economics. If you have the extra money, then mirror. But if you don't, then what do you do?

    Lots of people don't have the funds, and they've got to make a decision. You can't force someone into being a storage expert, but you can help give them information.

    Write a blog entry that takes all of the things into account, and I'll link to it.

  • chris

    Okay Matt, you've sufficiently scared me.

    My home fileserver is likely going to be converted to RaidZ2.

    Thanks for that.

  • http://www.standalone-sysadmin.com Matt Simmons

    Scott: See, success! ;-)

  • http://www.smbitjournal.com Scott Alan Miller

    LOL, well that's a start :) For those unaware, RAIDZ2 is a version of RAID 6 with some extra niceties implemented in the ZFS filesystem.

  • http://www.smbitjournal.com Scott Alan Miller

    My point about mirroring, though, is that it is not a question of economics, at least not when compared to RAID 5 (compared to RAID 6, I'll grant.) The reason being that the cost of making RAID 5 safe enough to use also causes it to become more expensive that a mirror based system while still being slower, less reliable and with the risks of rebuild performance.

    If RAID 5 systems were all free and mirrored systems all were not, then yes, RAID 5 can be overbuilt enough to make it very useable. But the cost of that overbuilding defeats the use cases where it would make sense.

    In theory, someday, super low URE drives will exist at a very low price point. That would bring RAID 5 back towards an acceptable place. But this is not expected. Drive size growth has always outpaced drive reliability improvements.

  • http://www.standalone-sysadmin.com Matt Simmons

    URE rates on SSDs are amazingly low too, if the mfg numbers are to be believed. I'm not saying it's an awesome idea, but you could totally make a RAID 5 set out of SSDs. ;-)

  • http://www.smbitjournal.com Scott Alan Miller

    Yes, RAID 5 on SSD can make sense. SSDs tend to be very small as well so they are years beyond spindles on the curve for URE / size rations.

    But SSD users tend to have large budgets too :)

  • Trever

    Don't believe everything you hear about SSD. I've seen ~3 of them fail in the last 6-9 months in my office. About 30 desktops each with an 80-120GB ssd. Some of the failures were ~3 year old devices, at least one failure was less than 6 months old.

  • http://www.standalone-sysadmin.com Matt Simmons

    Trever: I've heard that pretty much everyone has had trouble with the controllers (since that's kind of where the magic is. The underlying flash is all made by just a couple of companies).

    Of course, when your drive dies, it doesn't matter that it's the flash or the controller, you're still rebuilding.

  • Andy

    Assuming that you need to replace one of the disks in a RAID 5 array, if a URE is encountered on one of the remaining disks during the rebuild, does this result in an immediate fail of the rebuild? This seems like a huge consequence from a tiny issue. I may be misunderstanding a few concepts here so apologies for anything ridiculous in the following...

    If a URE occurs on a regular drive that isn't part of a RAID array, I assume the only consequences relate to the inability to read the file which is stored (in part) on the unreadable bit. If that file happened to be a photo, for example, that's just one photo lost - the rest of the drive is still usable. But if a URE occurs on a disk in a RAID array you essentially lose everything when it becomes necessary to rebuild the array? Does the RAID controller not have any mechanism to deal with this? If a bit can only be a 0 or a 1, I'd prefer the controller have a guess and maybe throw up a warning than to give up entirely. I appreciate that's a very simplistic view and a wrong bit can have serious impact, but I was hoping that the RAID controller might have a few tricks up its sleeve?

    Excellent article though.

  • http://www.smbitjournal.com Scott Alan Miller

    @Andy Yes, a URE failure encountered during a RAID 5 rebuild means total lose of the entire array even though no additional drive has failed. As this is orders of magnitude more likely than losing a second drive, this is why we point out that counting the number of drives that RAID protects you against failing is misguided. UREs are the big risk, not second drive failures.

    I've never heard of any array that can survive a URE failure if using parity. If you don't want the risk, just avoid parity RAID. RAID 1/10 don't have that problem.

  • http://www.smbitjournal.com Scott Alan Miller

    If you check my "Hot Spares or a Hot Mess" link above, Andy, there is a good breakdown (I think) of how RAID 5 handles URE failures - in teh context of why a hot spare actually makes things worse as it automates your disaster before you have time to manually stop it.

  • Ryan Malayter

    The RAID-5 haters out there conveniently overlook the nasty failure modes of RAID-10.

    First, with 3 TB drives, you still have a 1/4 chance of a URE and blown array during rebuild. It's basically the same risk as a small RAID-5 set.

    Secondly, the odds of double drive failure taking you down are only reduced by 1/N, where N is the array width. You are not protected against double disk failure to the degree that you are with RAID-6.

    Finally, large controller and OS caches ameliorate most of the performance problems of parity RAID.

    NetApp

  • Ryan Malayter

    Got cut off there. NetApp and other array manufacturers do advocate double-parity RAID for many (most?) use cases, as the caches absorb the performance hit and make the extra safety worthwhile. The newer purpose-built all-flash array vendors use double-parity (and even inline dedupe!) almost exclusively.

    Parity RAID is poised for a serious comeback as solid-state storage takes over. We all use "parity RAID" for our server RAM already; ECC and parity RAID are the essentially the same thing.

  • http://www.smbitjournal.com Scott Alan Miller

    @Ryan No one has overlooked RAID 10 failure modes. It has been discussed both in the comments here and in the links provided in the comments and in the description. URE has no impact on mirror rebuilds. The risk is the parity failing on rebuild. The URE risk to array rebuild is unique to parity RAID (RAID 5/6.) Mirrored RAID will remirror happily with or without encountering a URE.

    And even if URE was a risk with RAID 10, which it is not, you never have a RAID set as small as a capacity equivalent RAID 5. The closest you would ever get is 67%. And that would be rare as RAID 5 is never recommended with so few spindles because of the performance hit.

  • http://www.smbitjournal.com Scott Alan Miller

    Talking about double disk failure is really a trivial thing. Yes, RAID 6 sacrifices array reliability and performance in order to make the marketing note that it can "always" survive multiple disk failures. Anyone talking about this and not pointing out how pointless this is is just providing misdirection.

    There are three ways that stating that is misleading:

    1) While RAID 6 always has double parity for every bit, it suffers from the URE risk just like RAID 5. So if using high URE SATA drives, for example, the chances that RAID 6 will be unable to survive even a single disk failure starts to become a very real possibility. It is certainly safer than RAID 5, but safer than unsafe isn't necessarily safe. Remember, RAID 1 and RAID 10 don't have the parity URE risk so this is of incredible significance.

    2) RAID 6 takes a long time to resilver as it must resilver the entire array and, if encountering just one URE, it has to do it twice. So, like RAID 5, it carries the "data not lost but availability or performance loss" risk over a potentially long period of time. The longer the resilver the greater the risk of a second drive (or third) failing. A RAID 6 resilver could potentially take days or weeks - a very long time for a system to be working as hard as it can while being completely fragile and at risk. RAID 1/10 don't do this and only make a direct copy of a single drive in the array so no matter how large the array gets the rebuild time is the time to make a single drive copy. So the risk of another drive failing is very small in comparison.

    3) While RAID 6 always can lose two drives (assuming nothing else goes wrong), RAID 10 can survive up to half of its drives being lost. In a four disk array RAID 6 sounds better. In an eight or sixteen drive array it is pretty obvious to see that RAID 10 is way safer even from this unlikely scenario. Both because it is less likely to lose drives in the first place (it burdens them less than RAID 6), because it doesn't have extended periods of fragility and because it is more likely than RAID 6 is a moderately sized array to be able to lose more drives.

    Talking about multiple drive loss out of context is what you do when you want to hide why parity RAID is dangerous. You get people to focus on the mechanism of redundancy rather than the actual reliability of the array. Did you read the links provided? I'm just stating what has already been published.

    Think of having two straw houses or one brick house and wanting to survive a windstorm? Which is more reliable? We all know the answer. Parity RAID is the straw houses of the storage world. Redundancy without reliability. Just misdirection.

  • http://www.smbitjournal.com Scott Alan Miller

    I don't foresee parity making a comeback with SSDs for one reasons - latency. Even when working perfectly parity RAID (5/6) increases latency and this is the place where SSDs really shine. So the parity overhead in latency terms would be really significant compared to what it is on spinners. Unless someone overcomes this, which I see as unlikely at an affordable price, I don't see this catching on at any significant level.

    RAID 5 is all about being cheap. It was never about reliability or performance, even back in the day when it made sense with tiny arrays and expensive drives, it was always about cutting corners to save some dough. It will be quite some time before SSDs are likewise being purchased to be cheap. Until they are, people won't likely buy expensive SSDs but then cut corners on the RAID to go with them.

    Not, at least, until SSDs are chosen for capacity rather than for performance.

  • Ryan Malayter

    @Scott Alan Miller The only double-drive failure I've enountered in 17 years was two mirrored members of a RAID-10 array. On a circa-2001 Dell server, one drive failed, followed by another a few minutes later. This was very likely a firmware bug, as the drives came up clean when sent for RMA, but it still blew the whole array and we had to restore from backups.

    I cannot see how RAID-10 could possibly be immune to URE during rebuild: the data has to come from somewhere. If you just report the block as bad but don't fail the whole array, that's fine. But there is no reason the same cannot be done with parity RAID schemes. This is, in fact, what NetApp and ZFS do with RAID-DP and RAID-Z2 respectively.

    I must again point to NetApp and other HDD array manufacturers that use dual-parity RAID as default. NetApp in particular has defaulted to RAID-DP for years. I have not heard NetApp customers screaming about the unreliability of their arrays. And NetApp customers are not typically "penny pinchers".

    As for the latency of dual-parity schemes with SSDs, that is a non-isuse on modern hardware. The parity calculation itself takes nanoseconds, and the additional seeks on SSDs take microseconds. As with mechanical disk arrays, the raw write is often stored in NVRAM and the physical writes are done asynchronously. Again, I point to the newer all-flash arrays such as PureStorage, Nimbus, etc. which use parity for reliability rather than mirroring. They don't seem to have any latency issues.

  • http://www.smbitjournal.com Scott Alan Miller

    Mirrored RAID (1/10) is immune to URE (by this the RAID is immune, not the filesystem underneath) because it does not do parity. It is parity that is affected by URE, not RAID in general. Because it is a mirror it is able to copy the data, as is, from the mirrored drive, URE or not. Parity RAID doesn't have this capability - it must reconstruct the data.

    Think of it from a different perspective. Imagine you have a folder with 100 files in it (no RAID, just think about a folder on a file system.) Now if you have a URE hit you, it is one single block so only one of those 100 files will be impacted. 99 files won't know that anything bad happened.

    No imagine if you had zipped all of those files together into a single zip archive. Now if there was a URE the single zip file would be corrupt and all 100 files inside of it are gone because it cannot be uncompressed to pull them back out.

    Parity RAID is like this. Your data doesn't exist on its own - it has to be restored computationally after a drive is lost. Mirrored RAID has no process like this - it never "computes" data based on other data.

  • http://www.smbitjournal.com Scott Alan Miller

    RAIDZ2, for example, does not do what you said it does by default. If it encounters dual failures, it fails. That is a RAID 6 system, so it has protection against the first level failure. But it has no magic to work around the limitations of parity RAID.

    Parity is dangerous because the data is in-flight during the encountering of the URE. A parity RAID system is not stable during its resilver process. Mirrored RAID is.

    There are many articles on this. Mirrored RAID's safety comes from its lack of a destructive process. At no time does mirrored RAID put its own data in danger while performing an algorithm. Parity RAID must do this to resilver.

  • http://www.smbitjournal.com Scott Alan Miller

    NetApp uses enterprise drives and RAID 6 (well, a combination of RAID 4 and 6 that they call DP) so they mostly avoid the issue by spending their way out of it. However, as has been discussed ad nauseum in the SpiceWorks community and elsewhere, no matter how many clients complain about a SAN or NAS vendor's systems, how exactly would you expect to hear about that? Big businesses do not publish failure rates nor do the vendors. Using a lack of information to determine that no problem exists is misguided.

    NetApp makes good stuff but I for one have had them fail under load when a Linux box easily handled the same load. But that isn't published anywhere and NetApp just ignored the issue. It's a different issue, but you probably haven't heard about it. And that's my point. No lab has hundreds of all these things from different vendors just to see which are reliable and which are not.

    Same goes for any RAID array. There is no shop on the planet collecting the stats that you need. Any shop with those resources already knows not to run RAID 5 and isn't going to run gobs of them just to prove to you that they will lose money.

  • http://www.smbitjournal.com Scott Alan Miller

    Here are some numbers on the RAID levels and their write penalties in IOPS.

    http://www.yellow-bricks.com/2009/12/23/iops/

    Yes, you can cache your way out of many problems today. But only so many, it depends on your needs, and, again, shops willing and able to spend so much on other things are hardly going to be the same shops trying to cut corners and run RAID 5.

    The issue with RAID 5 is a cultural one, primarily. Shops cut corners and run risky, consumer SATA then they cut corners again and run risky RAID 5. Put the two together and everything blows up.

  • http://www.smbitjournal.com Scott Alan Miller

    Here is a more modern recap of the article above. I think that it does a good job of talking about how RAID-DP works due to the combination of not actually being RAID 6 exactly (it is dual parity like RAID 6 but not dual parity RAID 5 like 6 is but more like dual parity RAID 4... which has no number, but probably should. But RAID 7 is generally accepted to likely to apply to things with triple parity RAID 5 such as RAIDZ3.)

    http://theithollow.com/2012/03/21/understanding-raid-penalty/

    It is RAID-DP's software nature combined with the filesystem all in one that allows the WAFL + RAID-DP system overcome unnecessary write penalties.

  • Ryan Malayter

    @Scott Alan Miller: RAID-Z2 absolutely protects against the failure I mentioned by default. Any combination of URE or disk failures on two devices does not result in data loss. Same with NetApp WAFL.

    As for your secnario where a RAID-10 re-mirrors bad data when a URE is encountered during rebuild, I suspect you're wrong. I'll bet that most controllers in the field fail a whole disk when a URE is encountered, no matter the RAID level in use.

    Even if they didn't fail the whole disk, you are in the same situation as with parity RAID, asumming the array returns "bad block" instead of failing a whole disk and taking the array offline in both cases.

    Think about it this way: if I get a URE on a parity array, and don't have enough parity to reconstruct, I can always assume "URE block has all zeros" and continue, but still return an error up the stack for that block. But I do NOT need to fail the whole disk, or a whole array. This results in the same situation as you suggest with "mirroring a URE" in a RAID-10 rebuild: one bad block reported to the OS, which may or may not be critical depending on the filesystem and data itself.

    There is nothing at all "destructive" about parity RAID schemes, and in fact the "real data" is stored in the clear on N stripes, only the parity data is calculated.

  • http://www.smbitjournal.com Scott Alan Miller

    Nothing destructive? Have you never seen a parity array kill itself out of confusion? In the real world parity arrays do destructive things all the time.

  • http://www.smbitjournal.com Scott Alan Miller

    Your belief that RAID 10 will fail is based on the belief that the array needs to return "bad block" but it does not. A mirroring operation doesn't need to do anything at the entire array level like parity RAID does. Someone could sabotage the array but they don't need to do so. That would be crazy.

    It is a common myth that RAIDZ (from ZFS) somehow avoids all parity issues when all it actually does is overcome the write hole problem. I've never seen it seriously suggested that they've discovered a workaround to the URE issue with parity. Please provide documentation if possible.

  • http://www.thats-too-much.info/ Aaron Mason

    @Trever - I've had an SSD fail 11 days after purchase. Thankfully it was an early gen Sandforce drive - I now use an Intel 320 SSD and it's been trouble-free.

    Thanks for an informative article.

  • Ryan Malayter

    @Scott Alan Miller: in RAID 10, if one disk fails, and a URE is encountered from its mirror during a rebuild, the controller has NO CHOICE but to return a bad block up the stack the next time the URE sector is read, or fail the drive with the URE and therefore fail the entire array. The data simply doesn't exist anymore in a usable form, and known-corrupt data cannot be returned to the application.

    As for how RAIDZ(n) handles UREs during recovery, I am claiming that it does not fail a whole disk and therefore the whole array. It does not magically recover data if there is not enough parity to recover it. Documentation link.

  • Pingback: Some rethinks about today’s RAID 5 | Peter Luk's Blog

  • http://blog.jarfil.net jarfil

    So, what I understand is, I can either have a 3TB drive with 1/1E14 URE, or a 1/20 sized 5x priced 1/1E16 URE one?
    Meaning, a 100 times more reliable, but a 100 times more expensive-per-GB drive.

    Sweet.

    So you could go with the cheap ones, and RAID 1 the hell out of them up to whatever reliability, overall capacity reduction and price point you wish, all the way up to a 100 drive RAID 1, where they would meet the price, reliablity and capacity point of 20 of the Enterprise-grade ones (that would be 5 RAID5 4-drive sets).
    Of course you might consider a 100-drive RAID 1 somewhat of an overkill, and just stick to something like 10-drive RAID 1. That would be a 1/1E15 URE, with half the cost of a RAID 5 of Enterprise-grade disks. Seems like a good compromise.

    On the other hand... URE refers to 1 bit errors. What is the probability of THE SAME bit presenting errors on more than one drive?

    I dare say much MUCH lower than 1/1E16. So just a 3 drive RAID 1 (with error notification), or a 4 drive RAID 4 (without error notification, taking a vote for the most popular bit), should be more than enough to boost reliability sky high, while keeping costs barely close to a single Enterprise grade drive.

  • http://www.smbitjournal.com Scott Alan Miller

    Jarfil, you are absolutely spot on. The problem with RAID 5 is that to make it safe enough to use you have to make it so expensive that you could have had a faster, more reliable... AND CHEAPER... RAID 10 array instead. People are implementing RAID 5 to prove a point that it "can" be done forgetting that the reason not to is that it "shouldn't" be done.

    I wrote an article specifically addressing this backwards thinking (feeling that RAID 5 is "good enough" when there is no winning factor)...

    http://www.smbitjournal.com/2012/08/nearly-as-good-is-not-better/

  • itpro4

    Does the media scanning/consistency checking that many raid controllers perform reduce the risk of experiencing a URE during a rebuild?

  • http://www.standalone-sysadmin.com Matt Simmons

    itpro4: From what I understand, controllers that constantly scrub have a lower chance of corrupted data, but I don't have any numbers to back that up.

  • http://www.smbitjournal.com Scott Alan Miller

    @itpro4 Yes, it absolutely reduces it. However, all of the calculations are done ASSUMING that that is already included. So think of it that if you don't have that feature that the URE risks are higher than stated. We assume best case scenario when doing these to make sure there is no wiggle room for stating the dangers.

    We also assume variable stripes and no write hole, which are advanced options and not widely available. So if you aren't running ZFS, for example, your chances of hitting problems just get worse and worse.

  • itpro4

    Maybe several other questions to ask is - Is a URE caused by a damaged/worn sector on the disk? Is there other factors that would cause a URE?

  • http://www.smbitjournal.com Scott Alan Miller

    I'm sure that that encourages it but a URE can happen in any sector at any time. I assume some sectors are more likely to be hit than others. But overall, it is just a generic failure rate.