I'm here to shard data and chew bubblegum...

Date February 3, 2011

and I'm all out of bubblegum.

I have an interesting problem. In My God, It's Full of Files..., I discussed some of the things I had to deal with on our production application server stack, and I used the following picture to explain things:

In that article, I briefly outlined my plan to reduce wasted space by eliminating roughly half of the data (eliminating data is always the easiest way to optimize). That plan is still in development, but it's only addressing half the issue. The other half is..."wow, an TB filesystem? Is that a good idea?"

Looking at the diagram above, lets pretend that my "staging file storage" is already switched over to consisting entirely of symlinks, and that I'm only dealing with the production file storage (sitting right now at ~800GB). If it were only an 800GB LUN, I would be worried, but as it stands, things are much worse than they seem at first glance. (If you're not familiar with LVM or virtual disks, you can skim over my Introduction to LVM in Linux column before going to the next section).

I originally started with what I thought was a decently-sized chunk of storage: a 500GB LUN. I mean, when I started, the data set was around 200GB, so I thought "I'm going to more than double the size, that should buy me several years". Fortunately for my paycheck, but unfortunately for my data set, business has been good, so my growth rate was...somewhat higher.

As it stands now, my production file storage looks sort of like this:

This is much worse than a single 900GB LUN. As it stands right now, all it takes is any of those 5 LUNS to be unavailable to wreck the filesystem. Even if it WERE on one filesystem, how long do you suppose it'll take to run 'fsck' on that? A long damn time. And that's only right now.

Everyone reading this knows that you should graph your data usage, right? It's sleep prevention, though, because it gives you things like this:

The blue is the total size, and the red is the used size. This one graph really shows a lot of things...most obviously, you can see that I've added additional storage frequently - that's the stair-step pattern. As I've grown the filesystem, the amount of available storage grew as well.

At the end of August, I finally got everyone to agree on a massive (250GB) purge of old useless data. You can see, though, at this point, I'm adding nearly 100GB a month. My current method of adding more storage to the existing filesystem just isn't going to work. (As an aside, this graph really brings to light the amazing job our sales staff has been doing. Take a look at the growth rate back in February versus December. Business has been good.)

The way I'm planning to attack this is two-fold. First, I'm going to try to reduce the amount of data I have to deal with. Only the previous X number of months of daily reports will be available online (where X is defined by the client services staff). This will cut down the amount of data necessary to keep, but with the growth rate we're experiencing (looking at that graph, the degree of the curve) is such that even if we only keep 6 months live, that's still 600GB, and the management is planning on doubling our revenue this coming year, which will likely lead to doubling our report production, too. Exponential growth can't continue indefinitely, but it can be a pain in my ass for the next year or two.

If we double to gaining 200GB a month and I have to retain 6 months, that's still 1.2TB on a single filesystem. And you KNOW that there are exceptions to that 6 months (for instance, monthly reports as well as end-of-month reports will be kept indefinitely, apparently).

Now that I've made my case for something needing to be done, here's my plan: I'm going to shard my dataset.

If you're unfamiliar with the term, sharding typically refers to databases, where you have a single mammoth database and you break it up into manageable chunks.

You can look at a filesystem as a database, and there are many similarities, so if you can shard a database, why can't you shard a filesystem? Lets look at this logically:

I have a single mountpoint right now: /mnt/deploy (as you can see from the above graph). The directory structure looks a lot like this:

/mnt/deploy/Client1
/mnt/deploy/Client2
/mnt/deploy/Client...
/mnt/deploy/ClientN

That's a single FS on top of several LUNs. It's a tower that's waiting to be toppled over by a single missing-or-misconfigured LUN. Instead of continuing to expand my dataset into that one filesystem, what I want to do is to break it apart:

/mnt/deployFS/1/Client1
/mnt/deployFS/1/Client2
/mnt/deployFS/1/Client...
/mnt/deployFS/1/Client(N/X)
/mnt/deployFS/2/Client(N/X)+1
/mnt/deployFS/2/Client(N/X)+2
/mnt/deployFS/.../...
/mnt/deployFS/X/ClientN

Such that each directory under /mnt/deployFS/ (1, 2, ... N) is its own 500GB filesystem.

Because the application expects to see everything in /mnt/deploy, my plan is to symlink from /mnt/deployFS/X/ClientN to /mnt/deploy. This should be transparent to the application itself, and also give me a TON of flexibility. Actually, the more I thought about this, the more appealing it became, mostly because of all of the unintended benefits:

  1. Filesystems are locked to a single size
  2. This has several ramifications. Most obviously, I know how big the FS is going to be. Determining how big to make the next LUN is no longer a question.

    Neither is a changing percentage of disk. Interesting side-effect of organically growing your filesystem: if you generate critical alerts at 5% free disk space, on a 400GB filesystem, that's 20GB free. If you've got a 1TB filesystem, that same 5% alert is 50GB. If your rate of growth is the same for both sets, the critical alert at 1TB is not NEARLY as critical.

  3. Increased reliability
  4. My current solution is waiting to fall over. By breaking each LUN into its own filesystem, I'm not increasing the likelihood of a LUN being available (I already have multiple LUNs, so nothing will change there), but I'm vastly decreasing the likelihood that a failure will bring down the entire dataset.

    If you're going to have a failure, I think that we can all agree that having a subset of the data unavailable is usually preferable to the entire dataset being unavailable. In addition, in the event of an accidental corruption of the filesystem, the amount of data I have to restore will shrink impressively.

  5. Flexible growth
  6. Since the filesystem is a set size, you can dictate that when you hit a certain level, say 80% usage, that you make a new filesystem. At that point, you have a number of existing filesystems that are "established", and a new empty filesystem. At this point, you can not only start creating new clients on the new disk, you can shuffle existing clients onto it, and move clients around however best suits their growth pattern at the time.

    It's going to be important to graph the growth of each client, so that I know what to expect, and so that I don't let one crowd out a group of others, but with this scheme, if I have one growing exceptionally, I can move it to its own filesystem so it won't impact others. Think of it as wear-leveling on the client scale ;-)

  7. Storage Tiering
  8. This was definitely not my initial reason, and I didn't even consider it for the first week or so, but by breaking clients into separate filesystems on specifically located LUNs, I can also dictate where those LUNs are located in the storage array...so if I were to get a disk array enclosure of very fast SSDs, I could theoretically put clients which needed increased performance on LUNs located on those disks.

There are a lot of advantages, and really only a couple of drawbacks: primarily that the application wasn't developed with this in mind, so it doesn't natively know about the sharding. This will have to be solved with symlinks until a "real" solution can be engineered.

I'm not sure that many people have done this before, honestly. Google searches for "shard a filesystem" have 0 results. A search for "shard file system" consisted entirely of typos of "shared file system". It might be that I'm doing something new and novel, but it's more likely that I'm doing something that I should be looking for under a different name (or, alternately, I could be doing something so dumb that no one else would even consider it).

This is where you come into play. Please let me know what you think of my idea. I asked twitter about the ability to have multiple mountpoints into one directory (to eliminate the need for symlinks), and one third of the people responding said "use UnionFS". Another third said "Use Gluster" (the other third said "Dear God No! Don't use Gluster!"). But I wasn't asking this particular question (mostly because it took 1500 words to explain what I wanted to do).

I should also say that Bash Cures Cancer thinks this is a terrible idea ;-)

So what do you think? Please let me know in the comments!

  • http://obfuscurity.com/ Jason Dixon

    This is not a new problem, but has been mostly ignored for a while due to the pervasiveness of logical volume managers, both open source and commercial. ISPs used to deal with this years ago in the form of growing mail storage demands. I think the reason your searches are coming up empty is because "sharding" is a relatively modern term; back in the day this was commonly referred to as "distributed X", as in "distributed mail storage". Routing client requests (e.g. mail proxies) would be a typical way to distribute the client base across a number of backend hosts or storage appliances. I'm certain that your approach is reasonable, but is almost certainly not new or novel. ;)

  • https://w3ttr3y.com William Triest

    When I worked for the <a href="http://chemistry.ohio-state.edu"Department of Chemistry at The Ohio State University, they did something similar for home directories only without all the symbolic links.

    Basically we had a big storage pool that was the default for home directories which was mounted as /home on all the unix-like operating systems. Then if a research group needed more space (and some needed a lot more) we would create /home/{professor's name} and we would mount a disk they bought there. Since research groups tended to have more money then the centeral funding, it was a great way to allow groups that needed it (i.e. primarily the theoretical chemistry groups) more space.

    This was a solaris server with solaris, linux, irix, and aix clients and except for Linux they were mostly ancient versions.

    I never thought of it as anything special or to give it a name. If there isn't a standard name, its worth coming up with one as commen nomenclature helps with discussions and makes it easier to find best practices.

  • Nathan Huff

    One risk I could see would be if you had a single client's storage needs grow larger than your file system size. This would break your naming scheme and would quickly lead to the symlink hell that Bash Cures Cancer is talking about.

    I have no idea what the chances are of that happening in your environment but based on the growth of data in your graphs and in my own experience I wouldn't be too comfortable betting against it either.

  • http://nathanpowell.org Nathan Powell

    I think this sounds like a perfectly reasonable plan. The only thing that was making me uncomfortable while I read it was the use of symlinks.

    I agree with BCC, but not because symlinks are inherently evil as the title of his post suggests, but because anything cutesy that is not readily apparent to any admin walking up to the box is a red flag. I think managing the symlinks will suck, and keeping track of that is what makes it sub-optimal. As Maxim suggests, I would document the hell out of it :)

    Having said that, sometimes we must get the work done, and cutesy solutions have to be used until it can be made better. And I would for sure seek to get rid of the symlink dependency as soon as possible.

    Not knowing your environment entirely, my gut tells me you need to make your application mount point aware, and make it do the management of remembering where data is.

  • Steve

    We have the same scenario you describe but set up on a Solaris environment and we have the added complexity of Solaris zones and NFS in the mix.

    The symlinks have caused problems multiple times, mostly it's been trying to remember how the links work, because it's not something we change very often. It's also a pain for new staff to get up to speed with.

    Our plan is to migrate this environment to ZFS and start using zpools. I'd rather manage one gigantic zpool than a bunch of small file systems. Missing LUNs could be a problem but the complicated file system set up has caused more outages over the past 5 years than our SAN has.

  • Pingback: Tweets that mention I’m here to shard data and chew bubblegum… | Standalone Sysadmin -- Topsy.com()

  • http://andreas-lehr.com/blog/ Andy

    Naming Gluster, you should also take a look at HDFS, Openafs, Lustre or MogileFS.

  • Brad

    I would first look upstream as to data efficiency ( is your data being stored in the most efficient manner )
    I would also look into a DFS, see what your requirements are ( read / write / posix compliance )
    If you are intending on growing its definitely something to look at.

  • http://saintaardvarkthecarpeted.com/blog Saint Aardvark the Carpeted

    Have you thought about using autofs to handle the let's-have-this-filesystem-appear-at-the-right-place-when-needed? It can do lots more than just NFS. And autofs might be a good way of making your setup obvious, thus filling the "document the hell out of it" requirement.

    (Sorry about my unhelpful answer on Twitter. That was not what you needed right then. :-) )

  • Bob Plankers

    I think the name for what Mr. Triest is thinking of might be AFS, an early distributed filesystem.

  • Doug Weimer

    I've had to do some symlink tricks like this in the past. Configuration managment tools like puppet or cfengine are a good way to manage them. Checking for both the new mount points and the symlinks in your monitoring system helps a lot as well.

  • http://phasorburn.com Trever

    Why not go deep with the sharding. Instead of a filesystem per customer, just have a lot of filesystems of your optimum size and then have a symlink farm point down into many many different filesystems, subdirs of same, to individual files?

    You can then rsync your data to different storage over time as needed, change the symlinks after the data has moved, clean the old data out, as needed.

    BTW, I did this back in 1999 at a stock photography site that was later bought up by Getty Images.

  • http://www.paulparadise.com/ Paul Paradise

    I was one of the respondents on twitter mentioning UnionFS, but given some more detail on your setup, I don't think that's necessarily the right approach.

    Using symlinks isn't a bad idea at all, but the one concern I'd have is whether you'll always be able to fit all the content on a single system - eventually you run out of room for disks. It sounds like you've got a storage array which means it's not the current pressing issue, but the "classic" solution I've always seen is to have a bunch of NFS servers and use the automounter (autofs) to maintain the NFS mounts at the right place.

    As Saint Aardvark the Carpeted said, it's not strictly required to use NFS with autofs - even without NFS, it might just as workable as symlinks while giving you some flexibility beyond what symlinks to filesystems can provide.

  • Avery Payne

    I'm still wrapping my head around this (in just 10 minutes) so I apologize in advance if I'm missing some critical point.

    First, I wouldn't start "top-down" with the design of the system. I would start "bottom-up", factoring in device and hardware failure as potential show-stoppers, and from the result, layer on volume management, followed by the filesystem.

    At the storage level, how are you assuring continuity during a drive failure? This is probably a bigger factor over re-arranging your filesystem layout. Having mirrored drives (or at least mirrored data) will cut your storage in half, but avoids outages. If mirroring is not practical, then I think your concept of segregating data makes more sense; a single drive failure that omits a portion of data is better than a single drive failure that omits all your data. Graceful degradation, and all that.

    LVM does have some interesting things in it, notably the ability to mirror volumes. This would be a useful tool to "move" data from one storage pool to another. I would seriously re-think the volume management you have and look at making several volume groups, say 2-3 drives to a group. The idea here is simple: if you notice that a drive's health is faltering, you can use LVM mirroring to start transitioning data to a volume group that's healthy, then break the mirror and re-mount the new (healthy) volume at the current mountpoint. This gives you an additional layer of failover when used with mirrored drives. Not only can you tolerate a single drive failure (at the storage level with a bad drive) but you can assure safety of data (by migrating data to healthy drives). Again, more graceful degradation.

    If you absolutely need to use LVM striping between drives (as in your current arrangement) so that you can get a large filespace, then at least use the mirrored-drives/separate-LVM-volumes concept when carving out space. You can continue to expand space (ext3 hot expansion, anyone?) as needed. The problem I have with striping this way is that recovery becomes an absolute mess - I always think about the worst-case situation, which would be a filesystem recovery. Having chunks of data strewn over multiple drives becomes, well, less-than-stellar. Having data chunked to a single drive keeps all of the relevant sectors within reach, an important point to consider. Still, if you absolutely need to have more space, and striping gets you there, then seriously consider striping over the top of redundant drives.

    I know you mentioned using rsync to move data from staging to production, but LVM mirroring is faster, md-based RAID mirrors faster still, and re-mounting a filesystem at a different mount point fastest of all. Something to consider at the filesystem level. You can have 256 active logical volumes in LVM, and while that's alot, it's still a show-stopper if you have 500+ clients (for example). If you're thinking about sharding, then look carefully at grouping clients, say by first letter of their company name - which reduces the number of logical volumes down to something like 26, and keeps it a fixed number. Further, it doesn't matter what the customer's dataset size is, because you will hot-expand your filesystem on-demand. So while you have very few customers with the letter "Q" in front of them and lots with "S", it won't matter because the logical volume for "Q" can remain small - say 200 Gb - and the volume set for "S" can be grown again and again, say to 700Gb.

    So let's stitch these suggestions together into an entire picture:

    At the storage level, see if you can make mirrored drives with the md device, using 2 drives to a mirror. This will cost you some money but gain you peace of mind. If you can't afford more drives (which are going for less than $100 per Terabyte for SATA) then consider making several separate volume groups, with no more than 2-3 drives per VG.

    At the volume level, use an allocate-on-demand approach for volumes, allocating only the space you need for each volume. Hot volume expansion and hot expansion of ext3 will accommodate this without taking the volumes offline. LVs are logically arranged by customer naming, using the first letter of the customer's name. If md-based mirrors are not available, then consider using LVM mirrors on different volume groups (which forces the data to be segregated into different physical drives), allowing you to move and/or replicate data on demand. Using LVM mirrors also means being able to migrate data from sick to healthy drives. If you do have md-based mirrors, you can carve off several PVs on different VGs, and then stitch them together into a single stripe that covers multiple VGs. The downside is that recovery of this data becomes difficult in a disaster-recovery mode; but this "criss-cross" allocation also makes your filesystem sizing very dymanic (I call it that because if you graph it out, logical volumes "criss-cross" over the top of your storage). Follow this up with UnionFS (as mentioned earlier) and you can keep your existing mountpoints, because the already-existing directory structure is maintained, but the data is logically laid out behind the scenes by LVM.

    Hopefully I have not made it tl;dr, and hopefully it makes sense (given I've pounded this out in about 5 minutes).

  • http://www.sgenomics.org/~jtang Jimmy Tang

    How about looking at sector/sphere ?

  • Avery Payne

    D'oh, I just back-researched VG interaction and found you can't supposedly stripe over different VG's (ick!). So some of the posting was for naught. Sorry.

  • http://www.standalone-sysadmin.com Matt Simmons

    Avery: The storage underlying the LUNs is an EMC SAN, so the individual drives are abstracted away (and are accounted for in RAID-10 (or 6) and hot spares).

    I'm engineering this solution to get away from growing a volume size. it's very handy under a lot of circumstances, but in this case, it's causing more problems than it's fixing, in my opinion.

    Thanks for the comments and for sharing.

  • http://www.kickflop.net/ Jeff

    I've always called storage management or ... system administration.

    I don't know who you work for or how important this data is (seems pretty important), so I can only speculate ("Reasonable" is relative).

    I'd do it with a ZFS zpool of LUNS created from properly fault tolerant storage (which you don't seem to have). No symlinks. Could gain a shitload of space by turning on filesystem compression as you see fit (ZFS "filesystems" are almost as lightweight as directories). No fsck (who the hell does this still?).

  • http://www.standalone-sysadmin.com Matt Simmons

    @Jeff
    Thanks for the comment. If it's alright, I'm going to ignore the scoffing tone, and discuss the content of your comment.

    If there were a viable ZFS solution for Linux, and this wasn't residing on a SAN (with fault-tolerance, mind you), then maybe. My problem right now is that I'm using LUNs carved from disk pools, which I'm then taping back together with LVM to make them one logical filesystem. I'm working on a way to avoid that.

    The actual space involved isn't going to change much (although by implementing my rolling file archive policy, I'll slow the growth), but the capacity isn't the problem (yet), it's managing it efficiently. If I can separate the one block of data into smaller blocks, I gain flexibility and stop relying on LVM to expand my usable file space.

    I want to grow horizontally instead of vertically. I think this is a good way to go with it, anyway.

  • http://www.kickflop.net/ Jeff

    I understand well what you have in place right now.

    I misunderstood, perhaps, your call to commenting.

    If you're asking, "I'm ready to pull the trigger on this plan. Does anyone see anything majorly wrong with it, without changing or adding any technology?" then my answer is "Nope"

    We'll have to shrug and disagree on the fault tolerance issue if you're worried about LUNs disappearing at all.

  • http://www.standalone-sysadmin.com Matt Simmons

    @Jeff - I might have taken your previous comment a bit too antagonistically. If so, sorry.

    I'm not opposed to ZFS (or other solutions which involve purchase of additional hardware), but I just can't make those changes right now. I don't have the budget (or A budget, but that's a different discussion). I'm basically stuck with the hardware solutions that I have and I'm trying to improve the organization of the data. I really did want to know what solutions out there solve this, but as it is, yes, I've got Linux running on a low-end EMC clariion that I'm fully capable of screwing up to the point where a single LUN doesn't show. It hasn't happened, but it's possible. When that does happen, I'd rather the whole system not fall down around my ears.

    Plus, yeah, under Linux, fsck is reality. Filesystems can be dirty (or at least, unclean), so a 4 hour fsck isn't out of the realm of possibility...something else I'd like to avoid.

  • ChrisW

    Matt,

    Apologies if you've covered this and I've missed it, but you say you're using an EMC Clariion, can you not use MetaLUNS? I'm not a huge fan of taping lumps of storage together but I know only to well that rapidly increasing storage demands can result in having to put in place solutions that don't always make us smile.

    Chris

  • http://www.standalone-sysadmin.com Matt Simmons

    @ChrisW

    I didn't cover it. Actually, I've got an AX4-5, which as you probably know, is scraping the bottom of the Clariion barrel ;-) I don't have that ability with my storage array, sadly.

  • stuart

    Interesting that most people are picking up on the disk related aspects and only one comment appears to have picked up on autofs.

    We used it similarly with several independent student / staff filesystems all automounted via NFS to /home on the appropriate hosts.

    if you need to automount to a different level just adjust the mount mappings. you can also automount from one filesystem into another so that you can separate your actual filesystem structure and your application views into separate locations.

  • Rex white

    I might just be a country bumpkin. But all those Lunsford and mounts scare me. Copy data from lun to lun as/if one grows to large, lots of symlinks, more scary. Get a scalable NAS system and just mount one NFS dir. Most of the modern ones let you expand the storage space in real time. I'm personally partial to Isilon.

  • andrew

    My 2 cents...

    ZFS on OpenSolaris is great when it works, hell when it does not (at least for a Linux centric, over-worked SysAd). All I can argue is to not try a new technology in production if the old one works.

    Symlinks that are host resident are a pain. Symlinks in network shared filesystems work great. More specifically:
    /localdisk/symlink -> NFS = bad
    /localdisk/NFS/symlink -> other NFS = okay
    This way all of the symlinks are in one place and only one place.
    Managing a host of symlinks across a dozen systems is a nightmare.

  • flare

    Have a look at cephalic for the future and the bind option for mount might be better than symlimks

  • flare

    Have a look at ceph for the future and the bind option for mount might be better than symlimks

  • Pingback: Back From the Pile: Interesting Links, February 4, 2011 – Stephen Foskett, Pack Rat()

  • http://lukecyca.com Luke

    autofs, ftw.

    We have an autofs map generated by Bcfg2 by referencing an XML list of projects and the NFS server they live on. It pushes it out to all of our servers and clients so that all of the directories are available in the same namespace on each host. We can move projects between filers with only a couple of minutes of downtime!

    You could have an autofs map of bind mounts too, which would be like your symlink idea, but more self-documenting, and a little more abstracted for users of the consolidated filesystem.

  • Jason

    I've never heard of the term "sharding" when talking about what you are doing. The term is "partitioning". Certainly in databases partitioning is splitting a table across multiple LUNs.

  • http://bashcurescancer.com Brock Noland

    I would look at Gluster or sharding at a higher level. By the latter I mean, shard the data into 32 buckets. Then if you need to add disk, you up that to 64 buckets and you app migrates the data.