"Time Since Install" is the new "Uptime"

Date July 16, 2014

Welcome to the converged future where everything is peachy. We use configuration management to make our servers work exactly as we want, our infrastructures have change management procedures, test-driven design, and so on. Awesome!

party-gif

We're well past the date where uptime (as in, the amount of hours or days a specific server instance) is a bragging right. We all know implicitly that the longer a server is running, and the more changes that we make to it, the less likely it is to start up and return to a known-good state. Right? Because that's a real thing. Uptime of several years isn't just insecure, it's a dumb idea for almost all server OSes - you just don't know what state the machine will be in WHEN it restarts (because eventually, it will restart). I think we can take that as read.

So that leads me to an observation that I had recently... I think that the concept of repeatability and reliability of services following a reboot can be extended to the time since installation. Clearly, configuration management is meant to allow for utterly replicable machines. You're defining exactly what you want the machine to do in code, and then you're applying that configuration from the same code. A leads to B, so you have control.

The other, uglier side of that coin, is that modern configuration management solutions are application engines, not enforcement engines. So, you can write your, say, puppet code, so that a change is applied. But what if you make a change outside of configuration management? An overly simple example might be using puppet to install Apache, then manually installing PHP though the command line.

OK, OK, OK. I know. No one is going to do that. That's stupid. I know. I said it was overly simple. But the truth is, unless you tell puppet specifically to enforce a resource (one way or another - if you don't want PHP installed, you can't just not define it...you must say "ensure => absent" or puppet doesn't care). That's what I mean by enforcement. The dictated configuration is not "this and only this".

So while you're not going to do something monumentally stupid like installing php manually, what about when you change your configuration management so that a resource that you WERE managing is no longer under management? Our environment here has over 300 resources JUST to install packages. Over time, with a lot of catalog compilations, as that number increases, it's not going to scale, and we're going to have to take some shortcuts.

When a resource that was managed becomes unmanaged, there is no enforcement mechanism in place to ensure that the previously managed <whatever> is taken care of, removed, or dealt with whatsoever. The question then becomes, how does that previous resource affect your remaining services?

If I have a bit of code that installs a package, and as a result, makes some other change (either installing a dependency, or creates or removes a file), and I build my infrastructure with that dependent, but unmanaged, resource existing, it's possible that I become dependent on it. If the resource that caused the dependency to happen is then unmanaged, chances are good that the dependent resource remains, and I never know that anything is different...until I attempt to run the code on a fresh install which never had the original resource. Well, that sucks, huh?

The fix for this is, of course, a testing environment that includes some sort of functional testing using Docker or Vagrant (or $whatever) to create a fresh environment each and every time. When you've got that going, the only sticking point becomes, "How good are your tests?"

In any event, I've recently been thinking about a sort of regular reinstallation cycle for servers, much like many places have a quarterly reboot cycle, where each quarter, they reboot a third of the machines to ensure that all of the machines will reboot if they need to in an emergency.

What do you think of my observations? Is there a reason to reinstall servers relatively regularly in production? Why or why not?

  • Chris St. Pierre

    I think you're about 95% spot-on, but your discussion of the limitations of CM tools is a little lacking. Several of the more "boutique" CM tools have features to find, detect, and revert or remove local changes. Bcfg2 has robust features for this sort of thing, for instance -- merely by virtue of *not* specifying a package, you can instruct Bcfg2 to remove it (and all other such "extra" packages). It also generally does a good job of finding local changes to files and reverting them when possible (or reporting about them when not), finding and removing extra users, and so on.

    That said, even Bcfg2 leaves portions of the system unmanaged -- e.g., the content of files installed with the RPM %ghost directive -- or cannot revert certain changes -- e.g., local changes to files flagged as %config in RPM. And, of course, if someone dastardly pulls 'chattr +i' out, all bets are off. (Including that person's bodily safety.) So in the end, even an extremely strict bondage-and-discipline CM approach might be able to prolong the time between reinstalls, but it can't obviate that need entirely.

    I think it'd be really interesting to figure out what takes more time and effort -- successfully and elegantly running very strict CM, or automating monthly (e.g.) reinstall. Certainly periodic reinstalls have other gains as well, like highlighting HA pain points. (For instance, if you can't do monthly reinstalls of your NFS servers because it sucks to recover from an NFS failover, maybe it's time to fix that. You'd never be forced to deal with that under even the most brutal CM regime.)

  • pc²

    "regular reinstallation cycle" etc. reminds me of the Immutible Server Pattern. See http://martinfowler.com/bliki/ImmutableServer.html

  • Tyler Jonco

    I make a point to reboot every VM/server/desktop at least once a year to deal with the whole coming up in an un-known state issue. Eventually I hope to have stuff setup that from user perspective we never have downtime but in our datacenter we have at least one VM/server/desktop rebooting every two weeks.

  • Pingback: Top 50 Site Reliability and Reliability Engineering Blogs and Online Resources()