November 14, 2012
I'm back from Storage Field Day 2 in San Jose, so you can expect a lot of posts about the cool things I saw while I was there. But first, I want to tell you about an idea that came to mind while listening to some panelists on the Next Generation Storage Symposium.
I don't think it's a bad idea to think of flash drives in an enterprise environment as race car tires. Treat them as things to make you go fast, but to replace on a regular schedule, because they're wearable parts.
First, lets lay some groundwork.
You may be under the impression that flash drives are fast. You're sort of right, because it depends on how we define "fast".
Regardless of the disk, before we get any kind of speed profile, we need to know what the workload is. There can be a huge performance difference between the various kinds of IO, for reasons that you'll see shortly. The options are:
Those four types of IO are really the biggest determinant in how fast or slow your disks go. Very few workloads are purely one or the others; you typically get a percentage of all of them. Those percentages make a big difference in the performance of your storage:
Looking at spinning disks, we have a few variables that determine speed (how fast the drive can get us data…) and latency (…and how long it takes us to find the data on-disk). A traditional hard drive rotates at a certain speed (say, 7,200RPM). That means that it takes around 8ms (7,200 rotations per minute divided by 60 seconds in a minute is 120 rotations per second, and 1/120 is 0.0083) for the disk to do one complete rotation. Since there's only 1 head, you can assume that the data, on average, takes half of that time to reach the read head, so on average, the latency of the disk is 4ms.
There's also the arm that the drive head is attached to. It takes a certain amount of time to go from one position to the another. This is seek time, and it varies wildly depending on the quality of hard drive you're dealing with. A decent server SAS drive can perform with a seek time around 3.5ms, while your run of the mill desktop SATA drive is more like 9.5ms. The difference is much more than the price tag.
So it should be clear, then, that if your workload is mostly linear read or linear write, with a spinning disk, you will have a MUCH faster IO profile. Neither the arm has to move nor the disk has to turn much to get more data to the head; the data it wants next is literally right beside the head, so there's almost no seek time and almost no latency, and therefore linear access is very fast.
Random IO is pretty much the polar opposite of linear IO when you're dealing with spinning disks. Since the data is placed randomly around the disk, the head frequently has to move to get to the right track on the platter, and the disk may have to spin the full rotation to get the right sector under the head. This is very slow, and it doesn't matter if you're reading or writing randomly - latency is long and you're subject to seek time.
Now, contrast this with flash. There aren't any moving parts with a flash disk - so latency and seek time are 0, instantly. This means that random IO has the same performance as linear IO, all things being equal.
But sadly, things aren't equal. Bits on a flash chip are stores in cells. Each cell is either one bit (in Single Layer Cells (SLC) or two bits (in Multi Layer Cells (MLC)). Incidentally, Triple Layer Cells are coming up soon, and I've heard rumors of 4LC a ways away.
Anyway, these cells are organized into clumps, called "blocks". Each block can be read independently. These "read" blocks are relatively small, say 4k. Unfortunately, though, Flash can't write to an individual read block. No, in order to write to flash, you need to write to a much larger chunk of cells, called a "erasure block", and is about 64k, depending on the flash chip.
The reason it's called an erasure block is because, as I said, you can't write an individual cell, or even a 4k read block. You need to write the entire erasure block at once, effectively erasing it.
So, you want to change something in a read block, but you can't just write to it, you need to write everything else, too. So what do you do? You need to read everything in that entire erasure block into cache, change the data in memory to what you want the read block to be, then you write everything out to flash again. And all of this happens every time you want to write anything.
As you can see, flash writes are slower than flash reads. I mean, they're still faster than disk writes (usually), just because there's no seek time and no latency. Plus, there is a another downside to this whole "writing" thing.
Every time you write to flash, you electronically damage the cells that you write to. Eventually, the controller goes to read the data from the cells, but it can't determine what is a bit and what is electrical noise…and the cell is effectively gone.
It's inevitable, and it happens to every flash drive, from the one in your laptop to the one in your USB stick, to the one in your hundred thousand dollar storage array. MLC wears away faster than SLC, but they all go eventually.
It's that eventuality that I wanted to write about today.
In many (probably most) cases, modern storage arrays use a "tiered" approach to data storage. Because different types of disk have different IO profiles (as we've seen), data typically comes in from the host into SDRAM. This is both a working buffer and in some cases, a battery-backed storage tier. SDRAM is really expensive, though, so the amount used like this is typically very small, in the dozens of gigabytes (at the time of this writing).
One of the trends today is to use flash as a cache for "hot blocks" (data that has been recently accessed or the storage controller thinks is likely to be accessed soon), but whenever someone mentions using flash as a cache, people cringe because of the limited lifetime problem.
It occurred to me that a flash cache (particularly one where writes are common) isn't necessarily a bad thing because the lifetime is limited. In fact, in a properly engineered storage solution, it makes a lot of sense.
The Bugatti Veyron is an amazing car, and it has a theoretical top speed of 290 miles per hour ("only" 258 with the governor). At that speed, it goes through tires in 15 minutes…which is ok, because it runs out of gas in 13.
(The Bugatti Veyron in the picture is one I actually got to see in person. It's pretty)
The F1 car in the picture at the top goes through tires a little less quickly, but still, at a prodigious rate compared to my grocery getter at home. Race tires are expected to last a certain number of laps, not the entire race (usually, anyway). And when they get a set amount of wear on them, the tires are replaced. They're a wearable part. And there's no reason to treat flash drives any differently.
The thing that scares people about flash drives is that they hold data…precious, valuable data that we need to be available. But flash wear is very predictable. You can determine how worn a flash drive is, and use that data to determine when to replace it. When the time comes, you remove that disk from the cache pool (which does impact performance, but since this is an elective operation, as it were, you can time it so that it isn't a problem), and swap the disk with a new one.
I honestly don't know of anyone that's doing this at this very moment, but I've heard people mention it, and it wouldn't be surprising if we see flash drives as wearable parts in the near future. It seems like a pretty obvious kind of thing, once you change the mindset from "the drives can't fail or I'll lose data" to "I know the drives will fail, so lets control the failure".
What's your take on this whole "flash drive" thing? Comment below!