Modern uptime – measured from the outside in

Ah, the good old days. Remember when your computer’s uptime hit 365 days? Weren’t you proud? If you never got there, you probably saw 100, and were impressed with yourself and happy to get to triple digits. For a long time, a system’s uptime was considered a badge of honor. Almost perversely, the higher the uptime, the more hesitant we were to bring down the machine for updates or repairs. I remember this stage very well. I ran Slack. Heck, that was Slack’s big claim to fame. The boxes were so stable, they could literally run for years, and inevitably the hardware conked out before the software did.

At some point during the last decade or so, system administration as a whole wised up, or at least a lot of us did. Software patching, kernel updates, security fixes, they all combine to rob of us the big numbers when we run ‘uptime’. Uptime is really just an ever increasing number, though. Why did we ever get attached to it in the first place?

Sure, humans in general “like” increasing numbers. We relate to them, and the bigger they are, the more awe we feel. Just look at the brouhaha about Y2k. I don’t mean the completely legitimate “computers are going to break” concern, I mean the “end of the world Jesus is coming look busy” concern that people felt. It’s just a big round number. We like that sort of thing. If you doubt me, how many pictures of odometers would it take to convince you?

But on a more obvious level, we were proud of our systems. Uptime was a signal of what good shepherds we were. A system with a multi-year uptime had been providing uninterrupted service. It could be relied upon, and by extension so could we. Definitely a badge of honor for the sysadmin who had the longest-running machines.

Now the pendulum has swung the other way. We don’t have increasingly bigger uptime to brag about, and if someone tries, the first question they get asked is “what’s the matter, don’t apply security patches?” There’s a valid argument there, too. Look at an exploit database and tell me you don’t want to update your kernel as soon as you read through it.

Ironically, though individual machines’ uptime has decreased, there has been a concerted effort by sysadmins to increase the reliability of the service that their servers provide. As paradoxical as it sounds, lower uptimes really can contribute to meeting (or exceeding) your service level agreement. The trick? Redundant servers.

Five years ago, in most small shops, you probably had a web server. It wasn’t a particularly beefy machine, unless you had a very busy website. It was probably a previous-generation fileserver, or something similar. If it needed the kernel updated or rebooted for something, you grumbled and came in at night to take it down, then came in late the next day (if you were lucky!).

Now, for a couple thousand dollars, you can get a hardware load balancer and put a few old servers up as web servers. Now when one needs an update, you just do it. The load balancer detects that the machine is down and seamlessly forwards traffic to the other servers. The machine comes back up, and suddenly it’s in rotation. For less than the price of a new server, a load balancer is an incredible deal, and it does all sorts of services, not just web.

Likewise with file servers. Where it used to be one big tower with a ton of disks, now it’s a lightweight SAN device that you paid a few thousand for, but the benefit is that you can have several servers all reading from the same data (using clustering or what have you), allowing you to take down any one of them without a loss of service.

Where this has gotten us is to a place and time where uptime as we knew it is dead. In its place is a new sort of uptime, where the length should be measured not by running the “uptime” command, but by the months and years between service outages. Design your infrastructures such that the loss of no single machine will destroy your service uptime and your customers will be very happy.

Incidentally, many IT organizations create Service Level Agreements (SLAs) which give them targets to achieve for system availability. If you don’t have one, consider the possibility. There are articles available to give you more information on the subject, so look around.

I’m interested in hearing about what you do to engineer your systems to be more reliable. Also, if you’ve got an SLA (or are thinking about one), I’d be really interested to learn about it. I don’t have one yet, and although my company isn’t that structured yet, I can see it heading in that direction. Thanks!

  • I’m FINALLY getting to catch up on some of my reading and yes, this is one of them.

    One of the things that needs to be remembered in this profesional and polished environment we live in, uptime be it measured as pure uptime or available time, while important, can be completely undone by bad time to repair.

    Sure it’s impressive that someone’s machine has 456 days of uptime. We already know that this means there’s a lack of patches and “maintenence” done to the system. So what happens when the sprinklers go off? What? You’re gonna quit? No you’re not, you’re like the rest of us and you’re going to spend the next sleepless nights trying to reconstruct what was running instead of rebuilding the machine using current patch levels and then copying over the customer’s applications/webpages/other from the backups, looking like a genius and going home at a reasonable hour.

    I used to be a projectionist and got taught that “The difference between amateurs and professionals is that professionals make their mistakes quietly.” I also learned that speed of recovery is the measure of a professional.

  • Bob

    ‘A system with a multi-year uptime had been providing uninterrupted service.’

    I have aways been sceptical about people who brag about system uptime. That a box itself is online for years never meant the services that the box provided where available to users for the same amount of time.

    System uptime is just what what the word implies, system uptime. I am much more interested in service availability, though it is nice to have triple digits when i typ my ‘uptime’ command.

  • Many moons ago, we hired an electrician to help us live move a Solaris server since it had an uptime of over 450 days. When we finally did shut that server down, it hadn’t been rebooted in almost 700 days. Oh the days of wanton IT spending.

  • chewy_fruit_loop

    thankfully i don’t have to worry about things like that, I just have to make sure that theres no major interruptions to peoples workflows.

    the only real thing i need to do is keep everything up for about 12 hours a day, the rest of the time its unlikely people will notice an outage.

    but we’re now starting to serve data to sites that are following the sunlight :( but its not a big deal for the most part

  • Bart

    System uptime is nothing to brag about, all it says about you is that you do a lousy job at keeping your servers secure. A few months ago I had a discussion with a sysadmin who claimed patches weren’t required because his servers were on a secure vlan (whatever that may be). A few weeks later, all his servers were infected by conficker and caused an outage that cost millions. I haven’t heard from him ever since, I’m quite sure he got fired..

    Its all about service availability, I constantly reboot servers during production hours but thanks to good redundancy, nobody will ever notice. In fact, the services have been more reliable. The actual method employed here depends on the service but most use load-balancers.

    Other than security there are many more reasons to reboot. If you never reboot a server, how will you know that it will boot correctly? I’ve seen servers develop faults on parts of the disk that were only used during boot. I’ve often encountered sysadmins who install some service, start it manually and forget to add it to the startup scripts.. these are not the kind of things you want to be dealing with at 3AM when a power outage causes that server to reboot..
    Another advantage of redundant servers is that you need some way of keeping your configuration in-sync, which often involves some kind of configuration management. This will eventually result in faster repair times when something goes wrong.

    Proper load-balancers are required to make many services truly redundant. But I will never let anything with a Barracuda badge inside the datacenter. If you are going to load-balance many critical services, do you really want to pull all traffic through some cheap box? I’d go for one with a big red light on the front. It may be more expensive, but its worth it.

  • Devin

    Now you ruined my pride at seeing my odometer hit 66666.6

  • Good article. I’d also add software-loadbalancing into the mix, like haproxy. Can be wrapped into what’s essentially an appliance for deployment, too.

    I do think you introduce one fail as part of the solution, and that’s that lightweight SAN you talk about. The SAN becomes a single point of failure (and they do/will fail!), and even SANs need upgrades. Because more things rely on them, both these things cause hassle.

    I prefer local storage, aka shared nothing. And I make servers not just redundant but expendable. That is, if one fails completely, there’ll be at least one (if not more) copies of both data and processing power elsewhere. Then you can just take out anything without a worry.

  • In the last job I worked at, we had a sales director who insisted we use the term “Towards 100% Uptime!” … to me it just smacked too much of “To infinity and beyond!” However, the real problem I had with it was that it created too unrealistic an expectation.

    The more customer sites I saw, the more I realised that the mainframe teams always had it right – uptime is not about continuous availability, but continuous negotiated availability. Midrange system administrators in particular have too often become hung up on perfect uptime records, when in actual fact a bit of downtime now and again can sometimes be good – for performance, for maintenance and stability, etc.

    Here’s an example – a customer once (a major financial institution in Australia) had some business improvement consultants in, and they surveyed the IT people and asked them what their system availability percentages were like. The general consensus was around the 90% mark. On the other hand, the end users, when asked, presented a strikingly different average – around 60% or perhaps even less. The main reason it was determined for the difference was that the IT people were measuring uptime, whereas the users were measuring responsiveness.

    Uptime by itself isn’t a good enough system metric. Negotiated uptime, on the other hand, is much better.

  • I’ve been working alot with F5 Big IP boxes. They’re pretty frikkin sweet.

  • Pingback: Linux machines with no rebooting…? Is this what we want? | Standalone Sysadmin()

  • Pingback: Planet Network Management Highlights – Week 36()