July 16, 2014
Welcome to the converged future where everything is peachy. We use configuration management to make our servers work exactly as we want, our infrastructures have change management procedures, test-driven design, and so on. Awesome!
We're well past the date where uptime (as in, the amount of hours or days a specific server instance) is a bragging right. We all know implicitly that the longer a server is running, and the more changes that we make to it, the less likely it is to start up and return to a known-good state. Right? Because that's a real thing. Uptime of several years isn't just insecure, it's a dumb idea for almost all server OSes - you just don't know what state the machine will be in WHEN it restarts (because eventually, it will restart). I think we can take that as read.
So that leads me to an observation that I had recently... I think that the concept of repeatability and reliability of services following a reboot can be extended to the time since installation. Clearly, configuration management is meant to allow for utterly replicable machines. You're defining exactly what you want the machine to do in code, and then you're applying that configuration from the same code. A leads to B, so you have control.
The other, uglier side of that coin, is that modern configuration management solutions are application engines, not enforcement engines. So, you can write your, say, puppet code, so that a change is applied. But what if you make a change outside of configuration management? An overly simple example might be using puppet to install Apache, then manually installing PHP though the command line.
OK, OK, OK. I know. No one is going to do that. That's stupid. I know. I said it was overly simple. But the truth is, unless you tell puppet specifically to enforce a resource (one way or another - if you don't want PHP installed, you can't just not define it...you must say "ensure => absent" or puppet doesn't care). That's what I mean by enforcement. The dictated configuration is not "this and only this".
So while you're not going to do something monumentally stupid like installing php manually, what about when you change your configuration management so that a resource that you WERE managing is no longer under management? Our environment here has over 300 resources JUST to install packages. Over time, with a lot of catalog compilations, as that number increases, it's not going to scale, and we're going to have to take some shortcuts.
When a resource that was managed becomes unmanaged, there is no enforcement mechanism in place to ensure that the previously managed <whatever> is taken care of, removed, or dealt with whatsoever. The question then becomes, how does that previous resource affect your remaining services?
If I have a bit of code that installs a package, and as a result, makes some other change (either installing a dependency, or creates or removes a file), and I build my infrastructure with that dependent, but unmanaged, resource existing, it's possible that I become dependent on it. If the resource that caused the dependency to happen is then unmanaged, chances are good that the dependent resource remains, and I never know that anything is different...until I attempt to run the code on a fresh install which never had the original resource. Well, that sucks, huh?
The fix for this is, of course, a testing environment that includes some sort of functional testing using Docker or Vagrant (or $whatever) to create a fresh environment each and every time. When you've got that going, the only sticking point becomes, "How good are your tests?"
In any event, I've recently been thinking about a sort of regular reinstallation cycle for servers, much like many places have a quarterly reboot cycle, where each quarter, they reboot a third of the machines to ensure that all of the machines will reboot if they need to in an emergency.
What do you think of my observations? Is there a reason to reinstall servers relatively regularly in production? Why or why not?