February 3, 2011
and I’m all out of bubblegum.
I have an interesting problem. In My God, It’s Full of Files…, I discussed some of the things I had to deal with on our production application server stack, and I used the following picture to explain things:
In that article, I briefly outlined my plan to reduce wasted space by eliminating roughly half of the data (eliminating data is always the easiest way to optimize). That plan is still in development, but it’s only addressing half the issue. The other half is…”wow, an TB filesystem? Is that a good idea?”
Looking at the diagram above, lets pretend that my “staging file storage” is already switched over to consisting entirely of symlinks, and that I’m only dealing with the production file storage (sitting right now at ~800GB). If it were only an 800GB LUN, I would be worried, but as it stands, things are much worse than they seem at first glance. (If you’re not familiar with LVM or virtual disks, you can skim over my Introduction to LVM in Linux column before going to the next section).
I originally started with what I thought was a decently-sized chunk of storage: a 500GB LUN. I mean, when I started, the data set was around 200GB, so I thought “I’m going to more than double the size, that should buy me several years”. Fortunately for my paycheck, but unfortunately for my data set, business has been good, so my growth rate was…somewhat higher.
As it stands now, my production file storage looks sort of like this:
This is much worse than a single 900GB LUN. As it stands right now, all it takes is any of those 5 LUNS to be unavailable to wreck the filesystem. Even if it WERE on one filesystem, how long do you suppose it’ll take to run ‘fsck’ on that? A long damn time. And that’s only right now.
The blue is the total size, and the red is the used size. This one graph really shows a lot of things…most obviously, you can see that I’ve added additional storage frequently – that’s the stair-step pattern. As I’ve grown the filesystem, the amount of available storage grew as well.
At the end of August, I finally got everyone to agree on a massive (250GB) purge of old useless data. You can see, though, at this point, I’m adding nearly 100GB a month. My current method of adding more storage to the existing filesystem just isn’t going to work. (As an aside, this graph really brings to light the amazing job our sales staff has been doing. Take a look at the growth rate back in February versus December. Business has been good.)
The way I’m planning to attack this is two-fold. First, I’m going to try to reduce the amount of data I have to deal with. Only the previous X number of months of daily reports will be available online (where X is defined by the client services staff). This will cut down the amount of data necessary to keep, but with the growth rate we’re experiencing (looking at that graph, the degree of the curve) is such that even if we only keep 6 months live, that’s still 600GB, and the management is planning on doubling our revenue this coming year, which will likely lead to doubling our report production, too. Exponential growth can’t continue indefinitely, but it can be a pain in my ass for the next year or two.
If we double to gaining 200GB a month and I have to retain 6 months, that’s still 1.2TB on a single filesystem. And you KNOW that there are exceptions to that 6 months (for instance, monthly reports as well as end-of-month reports will be kept indefinitely, apparently).
Now that I’ve made my case for something needing to be done, here’s my plan: I’m going to shard my dataset.
If you’re unfamiliar with the term, sharding typically refers to databases, where you have a single mammoth database and you break it up into manageable chunks.
You can look at a filesystem as a database, and there are many similarities, so if you can shard a database, why can’t you shard a filesystem? Lets look at this logically:
I have a single mountpoint right now: /mnt/deploy (as you can see from the above graph). The directory structure looks a lot like this:
That’s a single FS on top of several LUNs. It’s a tower that’s waiting to be toppled over by a single missing-or-misconfigured LUN. Instead of continuing to expand my dataset into that one filesystem, what I want to do is to break it apart:
Such that each directory under /mnt/deployFS/ (1, 2, … N) is its own 500GB filesystem.
Because the application expects to see everything in /mnt/deploy, my plan is to symlink from /mnt/deployFS/X/ClientN to /mnt/deploy. This should be transparent to the application itself, and also give me a TON of flexibility. Actually, the more I thought about this, the more appealing it became, mostly because of all of the unintended benefits:
- Filesystems are locked to a single size
- Increased reliability
- Flexible growth
- Storage Tiering
This has several ramifications. Most obviously, I know how big the FS is going to be. Determining how big to make the next LUN is no longer a question.
Neither is a changing percentage of disk. Interesting side-effect of organically growing your filesystem: if you generate critical alerts at 5% free disk space, on a 400GB filesystem, that’s 20GB free. If you’ve got a 1TB filesystem, that same 5% alert is 50GB. If your rate of growth is the same for both sets, the critical alert at 1TB is not NEARLY as critical.
My current solution is waiting to fall over. By breaking each LUN into its own filesystem, I’m not increasing the likelihood of a LUN being available (I already have multiple LUNs, so nothing will change there), but I’m vastly decreasing the likelihood that a failure will bring down the entire dataset.
If you’re going to have a failure, I think that we can all agree that having a subset of the data unavailable is usually preferable to the entire dataset being unavailable. In addition, in the event of an accidental corruption of the filesystem, the amount of data I have to restore will shrink impressively.
Since the filesystem is a set size, you can dictate that when you hit a certain level, say 80% usage, that you make a new filesystem. At that point, you have a number of existing filesystems that are “established”, and a new empty filesystem. At this point, you can not only start creating new clients on the new disk, you can shuffle existing clients onto it, and move clients around however best suits their growth pattern at the time.
It’s going to be important to graph the growth of each client, so that I know what to expect, and so that I don’t let one crowd out a group of others, but with this scheme, if I have one growing exceptionally, I can move it to its own filesystem so it won’t impact others. Think of it as wear-leveling on the client scale ;-)
This was definitely not my initial reason, and I didn’t even consider it for the first week or so, but by breaking clients into separate filesystems on specifically located LUNs, I can also dictate where those LUNs are located in the storage array…so if I were to get a disk array enclosure of very fast SSDs, I could theoretically put clients which needed increased performance on LUNs located on those disks.
There are a lot of advantages, and really only a couple of drawbacks: primarily that the application wasn’t developed with this in mind, so it doesn’t natively know about the sharding. This will have to be solved with symlinks until a “real” solution can be engineered.
I’m not sure that many people have done this before, honestly. Google searches for “shard a filesystem” have 0 results. A search for “shard file system” consisted entirely of typos of “shared file system”. It might be that I’m doing something new and novel, but it’s more likely that I’m doing something that I should be looking for under a different name (or, alternately, I could be doing something so dumb that no one else would even consider it).
This is where you come into play. Please let me know what you think of my idea. I asked twitter about the ability to have multiple mountpoints into one directory (to eliminate the need for symlinks), and one third of the people responding said “use UnionFS“. Another third said “Use Gluster” (the other third said “Dear God No! Don’t use Gluster!”). But I wasn’t asking this particular question (mostly because it took 1500 words to explain what I wanted to do).
I should also say that Bash Cures Cancer thinks this is a terrible idea ;-)
So what do you think? Please let me know in the comments!