September 3, 2009
Ah, the good old days. Remember when your computer's uptime hit 365 days? Weren't you proud? If you never got there, you probably saw 100, and were impressed with yourself and happy to get to triple digits. For a long time, a system's uptime was considered a badge of honor. Almost perversely, the higher the uptime, the more hesitant we were to bring down the machine for updates or repairs. I remember this stage very well. I ran Slack. Heck, that was Slack's big claim to fame. The boxes were so stable, they could literally run for years, and inevitably the hardware conked out before the software did.
At some point during the last decade or so, system administration as a whole wised up, or at least a lot of us did. Software patching, kernel updates, security fixes, they all combine to rob of us the big numbers when we run 'uptime'. Uptime is really just an ever increasing number, though. Why did we ever get attached to it in the first place?
Sure, humans in general "like" increasing numbers. We relate to them, and the bigger they are, the more awe we feel. Just look at the brouhaha about Y2k. I don't mean the completely legitimate "computers are going to break" concern, I mean the "end of the world Jesus is coming look busy" concern that people felt. It's just a big round number. We like that sort of thing. If you doubt me, how many pictures of odometers would it take to convince you?
But on a more obvious level, we were proud of our systems. Uptime was a signal of what good shepherds we were. A system with a multi-year uptime had been providing uninterrupted service. It could be relied upon, and by extension so could we. Definitely a badge of honor for the sysadmin who had the longest-running machines.
Now the pendulum has swung the other way. We don't have increasingly bigger uptime to brag about, and if someone tries, the first question they get asked is "what's the matter, don't apply security patches?" There's a valid argument there, too. Look at an exploit database and tell me you don't want to update your kernel as soon as you read through it.
Ironically, though individual machines' uptime has decreased, there has been a concerted effort by sysadmins to increase the reliability of the service that their servers provide. As paradoxical as it sounds, lower uptimes really can contribute to meeting (or exceeding) your service level agreement. The trick? Redundant servers.
Five years ago, in most small shops, you probably had a web server. It wasn't a particularly beefy machine, unless you had a very busy website. It was probably a previous-generation fileserver, or something similar. If it needed the kernel updated or rebooted for something, you grumbled and came in at night to take it down, then came in late the next day (if you were lucky!).
Now, for a couple thousand dollars, you can get a hardware load balancer and put a few old servers up as web servers. Now when one needs an update, you just do it. The load balancer detects that the machine is down and seamlessly forwards traffic to the other servers. The machine comes back up, and suddenly it's in rotation. For less than the price of a new server, a load balancer is an incredible deal, and it does all sorts of services, not just web.
Likewise with file servers. Where it used to be one big tower with a ton of disks, now it's a lightweight SAN device that you paid a few thousand for, but the benefit is that you can have several servers all reading from the same data (using clustering or what have you), allowing you to take down any one of them without a loss of service.
Where this has gotten us is to a place and time where uptime as we knew it is dead. In its place is a new sort of uptime, where the length should be measured not by running the "uptime" command, but by the months and years between service outages. Design your infrastructures such that the loss of no single machine will destroy your service uptime and your customers will be very happy.
Incidentally, many IT organizations create Service Level Agreements (SLAs) which give them targets to achieve for system availability. If you don't have one, consider the possibility. There are articles available to give you more information on the subject, so look around.
I'm interested in hearing about what you do to engineer your systems to be more reliable. Also, if you've got an SLA (or are thinking about one), I'd be really interested to learn about it. I don't have one yet, and although my company isn't that structured yet, I can see it heading in that direction. Thanks!