Flash Drives as Replaceable Parts

I’m back from Storage Field Day 2 in San Jose, so you can expect a lot of posts about the cool things I saw while I was there. But first, I want to tell you about an idea that came to mind while listening to some panelists on the Next Generation Storage Symposium.

I don’t think it’s a bad idea to think of flash drives in an enterprise environment as race car tires. Treat them as things to make you go fast, but to replace on a regular schedule, because they’re wearable parts.

First, lets lay some groundwork.

You may be under the impression that flash drives are fast. You’re sort of right, because it depends on how we define “fast”.

Regardless of the disk, before we get any kind of speed profile, we need to know what the workload is. There can be a huge performance difference between the various kinds of IO, for reasons that you’ll see shortly. The options are:

Linear Read
Linear Write
Random Read
Random Write

Those four types of IO are really the biggest determinant in how fast or slow your disks go. Very few workloads are purely one or the others; you typically get a percentage of all of them. Those percentages make a big difference in the performance of your storage:

Looking at spinning disks, we have a few variables that determine speed (how fast the drive can get us data…) and latency (…and how long it takes us to find the data on-disk). A traditional hard drive rotates at a certain speed (say, 7,200RPM). That means that it takes around 8ms (7,200 rotations per minute divided by 60 seconds in a minute is 120 rotations per second, and 1/120 is 0.0083) for the disk to do one complete rotation. Since there’s only 1 head, you can assume that the data, on average, takes half of that time to reach the read head, so on average, the latency of the disk is 4ms.

There’s also the arm that the drive head is attached to. It takes a certain amount of time to go from one position to the another. This is seek time, and it varies wildly depending on the quality of hard drive you’re dealing with. A decent server SAS drive can perform with a seek time around 3.5ms, while your run of the mill desktop SATA drive is more like 9.5ms. The difference is much more than the price tag.

So it should be clear, then, that if your workload is mostly linear read or linear write, with a spinning disk, you will have a MUCH faster IO profile. Neither the arm has to move nor the disk has to turn much to get more data to the head; the data it wants next is literally right beside the head, so there’s almost no seek time and almost no latency, and therefore linear access is very fast.

Random IO is pretty much the polar opposite of linear IO when you’re dealing with spinning disks. Since the data is placed randomly around the disk, the head frequently has to move to get to the right track on the platter, and the disk may have to spin the full rotation to get the right sector under the head. This is very slow, and it doesn’t matter if you’re reading or writing randomly – latency is long and you’re subject to seek time.

Now, contrast this with flash. There aren’t any moving parts with a flash disk – so latency and seek time are 0, instantly. This means that random IO has the same performance as linear IO, all things being equal.

But sadly, things aren’t equal. Bits on a flash chip are stores in cells. Each cell is either one bit (in Single Layer Cells (SLC) or two bits (in Multi Layer Cells (MLC)). Incidentally, Triple Layer Cells are coming up soon, and I’ve heard rumors of 4LC a ways away.

Anyway, these cells are organized into clumps, called “blocks”. Each block can be read independently. These “read” blocks are relatively small, say 4k. Unfortunately, though, Flash can’t write to an individual read block. No, in order to write to flash, you need to write to a much larger chunk of cells, called a “erasure block”, and is about 64k, depending on the flash chip.

The reason it’s called an erasure block is because, as I said, you can’t write an individual cell, or even a 4k read block. You need to write the entire erasure block at once, effectively erasing it.

So, you want to change something in a read block, but you can’t just write to it, you need to write everything else, too. So what do you do? You need to read everything in that entire erasure block into cache, change the data in memory to what you want the read block to be, then you write everything out to flash again. And all of this happens every time you want to write anything.

As you can see, flash writes are slower than flash reads. I mean, they’re still faster than disk writes (usually), just because there’s no seek time and no latency. Plus, there is a another downside to this whole “writing” thing.

Every time you write to flash, you electronically damage the cells that you write to. Eventually, the controller goes to read the data from the cells, but it can’t determine what is a bit and what is electrical noise…and the cell is effectively gone.

It’s inevitable, and it happens to every flash drive, from the one in your laptop to the one in your USB stick, to the one in your hundred thousand dollar storage array. MLC wears away faster than SLC, but they all go eventually.

It’s that eventuality that I wanted to write about today.

In many (probably most) cases, modern storage arrays use a “tiered” approach to data storage. Because different types of disk have different IO profiles (as we’ve seen), data typically comes in from the host into SDRAM. This is both a working buffer and in some cases, a battery-backed storage tier. SDRAM is really expensive, though, so the amount used like this is typically very small, in the dozens of gigabytes (at the time of this writing).

One of the trends today is to use flash as a cache for “hot blocks” (data that has been recently accessed or the storage controller thinks is likely to be accessed soon), but whenever someone mentions using flash as a cache, people cringe because of the limited lifetime problem.

It occurred to me that a flash cache (particularly one where writes are common) isn’t necessarily a bad thing because the lifetime is limited. In fact, in a properly engineered storage solution, it makes a lot of sense.

The Bugatti Veyron is an amazing car, and it has a theoretical top speed of 290 miles per hour (“only” 258 with the governor). At that speed, it goes through tires in 15 minutes…which is ok, because it runs out of gas in 13.

2008 Bugatti Veyron.

(The Bugatti Veyron in the picture is one I actually got to see in person. It’s pretty)

The F1 car in the picture at the top goes through tires a little less quickly, but still, at a prodigious rate compared to my grocery getter at home. Race tires are expected to last a certain number of laps, not the entire race (usually, anyway). And when they get a set amount of wear on them, the tires are replaced. They’re a wearable part. And there’s no reason to treat flash drives any differently.

The thing that scares people about flash drives is that they hold data…precious, valuable data that we need to be available. But flash wear is very predictable. You can determine how worn a flash drive is, and use that data to determine when to replace it. When the time comes, you remove that disk from the cache pool (which does impact performance, but since this is an elective operation, as it were, you can time it so that it isn’t a problem), and swap the disk with a new one.

I honestly don’t know of anyone that’s doing this at this very moment, but I’ve heard people mention it, and it wouldn’t be surprising if we see flash drives as wearable parts in the near future. It seems like a pretty obvious kind of thing, once you change the mindset from “the drives can’t fail or I’ll lose data” to “I know the drives will fail, so lets control the failure”.

What’s your take on this whole “flash drive” thing? Comment below!

  • Steve

    irregardless literally means without without regard. Which is likely not what you intended.

  • Ugh. Fixed. I blame whatever device’s autocorrect I was using at the time. Thanks Steve!

  • Ben

    Just one small thing: “irregardless” is not a word.

    Otherwise, great article. We’ve yet to have a single flash drive here as they’re just too damned expensive for our needs. Maybe one day. I’ll probably get one in a home computer/laptop before we get one at work. Although I am pushing for some in our web caches/load balancers.

  • Ben

    Bah, sorry. By the time I got here I was not only late, but double-posting. My apologies.

  • furicle

    The possible fallacy of your argument is tires wear evenly and predictably. Total failure only happens after many warning signs, and even then probably won’t crash the car (at speeds less than 200mph :-). I don’t think that’s true of flash. Failure once in any one cell is lost data. It’s digital failure.

  • rpetre

    I’m nervous about the same thing as furicle: I am not confident I can predict properly how worn out a drive is (which is pretty much the same for both types of drives), but if I understabd properly the flash drives fail catastrophically each time, unlike the traditional drives which tend to moan a lot and lose only bits of your data.

    Another point is that as far as i understand, there is little randomness in failure, so RAID-ed drives might correlate their failures, which is A Bad Thing.

    Hopefully in time there will be extensive studies of the failure scenarios which will shape our storage strategies, but until then I am a bit wary :)

  • Jazz

    Irregardless unnotwithstanding (why can’t It people get over trivial minutiae and simply answer the question they know they were asked?), read/write to a flash is much more so a logical operation than physical, right? Spinning disk being some factor higher into the physical realm, as a comparison. It being logical, it seems logical that the decay rate for the degradation can be more than casually accurately predicted with maths. I say Simmons’ argument is valid. after all, spinning disks have a MTBF; or, a statistical average life-span, why not flash (albeit with a much finer pinpoint as to when that failure will probably occur).

  • It isn’t necessary to have a single solitary drive – you can mirror the flash devices, which makes installing the replacement that much easier.

  • Nemo

    @Jazz — Because trivial minutiae is exactly what IT does. The tiniest thing matters. If you don’t see how I would never want you monitoring or working on someones servers.

  • OK, lets keep it friendly.

    I fixed the original error, and this particular trivial matter isn’t likely to cause downtime ;-)

  • Rob

    It seems like many of the problems that people are citing exist with spinning media as well and are not unique to flash.

    Hard drives get bad sectors, what happens with those? They get remapped, if the drive can detect it, or the data is lost if it can’t. How is that different with an SSD?
    If you are worried that the drive can’t deal with it, use a higher level of protection, like file systems with block checksums and/or raid.

    I don’t think the inability to write to one cell fails the whole drive. Most drives are over provisioned to deal with that situation, and use wear leveling to extend the life of drive.
    Could some workloads wear out the drives sooner? Sure, same as on physical media.
    I’m sure that there are certain data access patterns that are harder on the motors for the drive head in spinning drives than others. Just like a car. Drive a certain way, and you will wear out your tires sooner.

    Storage on any type of medium is a time game. It will all fail eventually. Netapp and EMC try to get you to get a new system from them every 3 years or so or they will charge you an arm and a leg to extend the support. Why? As the system gets older it’s more likely to fail due to wear and tear and age. So instead of the support money going into their pocket(like it does the first 3 years or so), it goes towards real failures, which they don’t like.

    Also, for furicle, I can only think of 1 storage system that I have used in the past 8 years that
    alerted me to a failing drive. Most alert me that it failed, and that the raid set is rebuilding to a spare drive.
    Now, one can argue that some of those system may have predictive failure methods at work, (in fact, I know they do) but that mostly just seems to make the rebuilds easier on them. Retrieve the blocks you can read from a failing drive, and copy those to a spare, rebuild blocks you can’t read from parity, and then fail the drive. Many were outright failures. Like potholes that pop your tire.

    Some of this also assumes that if it wasn’t a digital failure that you could see the signs of the impending failure. My sister had gall stones, and they were apparently building for quite a while before she needed to be hospitalized for them. For her, she went from feeling fine one day to agony the next. To her, it was a digital failure.

    Were there signs that the stones were building? Maybe. Were they obvious, or subtle?
    Were they mistaken for something else? (stomach ache, flu, etc) Possibly.
    To your car dealer, it may be obvious that you need new tires. To an untrained eye, may they look fine.

  • You really need to do the math on flash drive wear. Odds are, it isn’t a factor at all and you don’t need to consider a flash drive a replaceable part:


  • There are actual SMART attributes defined to track the wear-status of drives. 177 (Wear Leveling Count) tells you the maximal number of erase operations performed on a single block. This is an important one since flash endurance is measured in erase-cycles. Paying attention to this value and comparing it against the drive’s listed endurance will tell you when it’s time to proactively replace the drive.

    181 and 182 (Program/Erase Fail Count) tells you how many blocks are already bad. Unlike spinning disks, this is not a sign of imminent doom as flash is designed to wear like this.

    Like all disks, each vendor uses these attributes a little differently but they are there. The RAID system makers are beginning to account for these.

    As for endurance, as Tracy pointed out it’s all about the math. Take a theoretical 450GB SSD, and write to it at 300MB/s. How long will those cells last?

    300MB/s, on 450GB drive, so a full-drive write takes 1536 seconds, or 25.6 minutes.

    Given the 10K erase endurance of “standard” MLC cells, each cell will get 10K program cycles in 256,000 minutes, or 178 days. Thats…. not so good. You don’t want to use these drives for such a ludicrously high-write environment.

    “High Endurance MLC” (MCL-HET) has been out for a while now, and has much better endurance. 90K by reliable reports. Doing the math, that 450GB drive now takes 1600 days (4.38 years!) to hit the endurance line.

    Complicating these numbers are the amount of reserve blocks each SSD keeps, so the actual wear experienced by each cell will be lower than what I’ve lined out above. However, cell failure is probabilistic so individual drives may crap out way before the math says they should, or they could last a decade.

    I do know that none of my workloads come close to 300MB/s sustained for an individual drive, so MLC-HET is actually a drop-in replacement for rotating media for me. And might even last longer, but I’m not banking on it yet.