Appropriate Redundancy

Date June 25, 2008

I think we can all agree: more uptime is better. The path to that goal is what separates us into different camps. The men from the boys, or perhaps more accurately, the paranoid from those who have never had a loaded 20amp circuit plugged into another loaded 20 amp circuit (true story from the colocation we're at. The person responsible no longer works there).

When it comes to planning for redundancy, there's always a point where things get impractical, and that line is probably the budget. Most companies I've worked for are willing to put some money into the redundancy as long as it's not "a lot" (whatever that is), and to even get them that far, you're probably going to have to work for it. The logic of buying more than one of the same thing is sometimes lost on people who balk at purchasing such frivolities as ergonomic keyboards.

Assuming you can get your company behind your ideas, here's how you might plan for redundancy:

  1. Power
    If you can at all avoid it, don't buy servers without dual power supplies. Almost every server you can get today has this feature, and there's a good reason for it.

    Dual power supplies are hot-swappable, which means when one dies, you can replace it without stopping the machine. If you're not in a physical location with redundant power, you can also hook each of the power supplies into a different battery backup (UPS).

    Many servers come with a 'Y' cord to save you from using extra outlets. It looks cool, but ignore it. If you plug both power supplies from the same server into the same battery and that battery dies during a power event, so does your server. Each cord goes to it's own battery. That allows the the batteries to share the load, and it buys you some insurance if one of them bites it.

  2. Networking
    Again, almost all servers these days come with on-board dual network cards. I used to wonder why, since I've only rarely put the same server on two networks, and setting up DNS for dual homed computers is a pain in the butt.

    After much research, I learned about bonded NICs. When you bond NICs, your two physical interfaces (eth0 and eth1) become slaves to a virtual interface (typically bond0). Depending on the bonding mode, you can get redundancy AND load balancing, essentially giving you 2Gb/s instead of 1.

    Two NICs acting like one is only halfway ideal, though. In the event of a switch outage (or more likely, a switch power loss), your host is still essentially dead in the water. To remedy this, use two switches for twice the reliability. For my blades, I've got all of the eth0 cables going to one switch, and all of the eth1 cables going to another. I also use color-coded cables so I can easily figure out what goes where. Here's a pic:

    As before, make sure that your switches are plugged into different UPSes, if possible.

  3. Redundant Servers
    Despite your better efforts, a server will inevitably find a reason to go down. Whether it's rebooting for patching, electrical issues, network issues, or some combination of the above, you're going to have to deal with the fact that the server you're configuring may not always be available.

    To combat this affront to uptime, we have the old-school option of throwing more hardware at it. To quote Hadden from the film Contact, "Why build one, when you could have two for twice the price".

    Getting two servers to work together is contingent on two things: What service you're trying to backup, and where you're getting your data. Simple things, such as web sites, can be as easy as an rsync of webroots, or maybe remotely mounting the files from a fileserver over NFS. Other things, such as NFS, require expensive Storage Area Networks (SANs) or high-bandwidth block level filesystem replication across the network. Either way, it's something that you should research heavily, and is outside of the scope of this blog.

  4. Hopefully you can use this information to give you some ideas on redundancy in your own infrastructure. It's a rare sysadmin who can't afford to spend some time to consider things that would make their network more fault tolerant.

7 Responses to “Appropriate Redundancy”

  1. Clif said:

    Thanks for the reminder. Redundancy is one of the most important aspects of a quality infrastructure, but convincing the Powers-That-Be is quite difficult. Enjoy reading your blog, keep it up!

  2. Ben C said:

    An excellent article that discusses many of the fine points of redundancy, but I feel like you left a few things out.

    1. Redundant admins. Not always possible in a small shop, but you should at least have someone who can handle basic troubleshooting, and maybe has elevated privileges for a limited set of commands/servers. After all, even the best sysadmins need to take a vacation sometimes.

    2. Redundant buildings. If you have two physically redundant servers, you're best served putting them in separate buildings (different cities are even better, provided you can be assured of the physical security, and you can get to it when necessary).

    3. Virtualization. If most of your services aren't too much of a drain on resources, you have two options. You can run them on a bunch of small machines (I inherited 5 desktops each running a separate service), or you can get two beefy machines and run your services on a virtual instance (VMWare, Xen, etc). If each physical machine has a copy of the VM, you can be reasonably assured you'll have reliable uptime (barring the catastrophic sudden failure of both servers)

    4. Redundant disks. Any critical service should live on a RAID array.

    5. No unnecessary redundancy. When budgeting is an issue, pay close attention to what needs to be kept up, and what can afford to take an hour or a day of downtime.

    Inflation seems to have grown my $0.02 into a nickel, so I'll stop now, before I end up writing my own post.

  3. Matt said:

    Clif:

    Thanks for the encouragement! I'll keep it up. Keep reading! :-)

    Ben:

    You're right, those are all very important aspects to redundancy, most especially the RAID suggestion.

    Unfortunately, too many of us are without the means to have redundant admins. That's mostly why I started this blog. Along with that comes other small-infrastructure issues like not having multiple buildings, and maybe not being able to utilize virtualization to it's full potential.

    That's not to say Virtualization isn't useful, and can't be a boon to uptime and lower costs, but most standalone sysadmins wouldn't be able to afford a single ESX server, let alone the array that would make them really worthwhile.

  4. Ben C said:

    Matt, I understand that a lot of what I suggested isn't applicable in small circumstances, and that is the primary audience. I do think that often you can find one of your users who is savvy enough to at least be able to do basic troubleshooting, even if it's just power-cycling the right box. Not always, but on occasion. As for virtualization, you're right that ESX server is way more than most people can afford. However, the VMWare server product is free, and with some creative deployments, can function well enough to provide a degree of redundancy that isn't otherwise available.

    Like Clif, I really enjoy reading your blog, so please don't take my comments as criticism. I work at a large university, but I'm the only Linux sysadmin in my department, so I get the best (and worst!) of both worlds. Keep up the good work!

  5. Matt said:

    Ben,

    Sorry if it sounded like I was being defensive. I didn't mean it to be that way at all. I just meant to explain why I left out things like virtual machines and the like.

    Please, continue to read and comment, and definitely continue to add to the discussion. Your first post was right on mark. Those are all very important tools to use in maximizing usability.

    I admit, my target audience is the small systems administrator, but I don't want to alienate any admins of larger infrastructures, either. The people who are out there doing the things the rest of us are trying to learn are great sources of information, and everyone is welcome here. :-)

    Again, thanks a lot for commenting!

  6. Matt said:

    Just a comment, I found this story was linked to from ComputerWeekly.com here:
    http://www.computerweekly.com/Articles/2008/07/23/231602/how-to-cope-with-broadband-failure.htm

  7. PowerConnect switches, Juniper firewalls and esx redundancy - Admins Goodies said:

    [...] Appropriate redundancy: http://www.standalone-sysadmin.com/blog/2008/06/appropriate-redundancy/ [...]

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*