Kerbal Space System Administration

Date July 28, 2014

I came to an interesting cross-pollination of ideas yesterday while talking to my wife about what I'd been doing lately, and I thought you might find it interesting too.

I've been spending some time lately playing video games. In particular, I'm especially fond of Kerbal Space Program, a space simulation game where you play the role of spaceflight director of Kerbin, a planet populated by small, green, mostly dumb (but courageous) people known as Kerbals.

Initially the game was a pure sandbox, as in, "You're in a planetary system. Here are some parts. Go knock yourself out", but recent additions to the game include a career mode in which you explore the star system and collect "science" points for doing sensor scans, taking surface samples, and so on. It adds a nice "reason" to go do things, and I've been working on building out more efficient ways to collect science and get it back to Kerbin.

Part of the problem is that when you use your sensors, whether they detect gravity, temperature, or materials science, you often lose a large percentage of the data when you transmit it back, rather than deliver it in ships - and delivering things in ships is expensive.

There is an advanced science lab called the MPL-LG-2 which allows greater fidelity in transmitted data, so my recent work in the game has been to build science ships which consist of a "mothership" with a lab, and a smaller lightweight lander craft which can go around whatever body I'm orbiting and collect data to bring to the mothership. It's working pretty well.

At the same time, I'm working on building out a collectd infrastructure that can talk to my graphite installation. It's not as easy as I'd like because we're standardized on Ubuntu Precise, which only has collectd 4.x, and the write_graphite plugin began with collectd 5.1.

To give you background, collectd is a daemon that runs and collects information, usually from the local machine, but there are an array of plugins to collect data from any number of local or remote sources. You configure collectd to collect data, and you use a write_* plugin to get that data to somewhere that can do something with it.

It was in the middle of explaining these two things - KSP's science missions and collectd - that I saw the amusing parity between them. In essence, I'm deploying science ships around my infrastructure to make it easier to get science back to my central repository so that I advance my own technology. I really like how analogous they are.

I talked about doing the collectd work on twitter, and Martijn Heemels expressed interest in what I was doing, since he would also like write_graphite on Precise, so I figured that other people probably might want to get in on the action, so to speak. I could give you the package I made, or I could show you how I made it. That sounds more fun.

Like all good things, this project involves software from Jordan Sissel - namely fpm, effing package management. Ever had to make packages and deal with spec files, control files, or esoteric rulesets that made you go into therapy? Not anymore!

So first we need to install it, which is easy, because it's a gem:


$ sudo gem install fpm

Now, lets make a place to stage files before they get packaged:

$ mkdir ~/collectd-package

And grab the source tarball and untar it:

$ wget https://collectd.org/files/collectd-5.4.1.tar.gz
$ tar zxvf collectd-5.4.1.tar.gz
$ cd collectd-5.4.1/

(if you're reading this, make sure to go to collectd.org and get the new one, not the version I have listed here.)

Configure the Makefile, just like you did when you were a kid:

$ ./configure --enable-debug --enable-static --with-perl-bindings="PREFIX=/usr"

Hat tip to Mike Julian who let me know that you can't actually enable debugging in the collectd tool unless you actually use the flag here, so save yourself some heartbreak by turning that on. Also, if I'm going to be shipping this around, I want to make sure that it's compiled statically, and for whatever reason, I found that the perl bindings were sad unless I added that flag.

Now we compile:

$ make

Now we "install":

make DESTDIR="/home/YOURUSERNAME/collectd-package" install

I've found that the install script is very grumpy about relative directory names, so I appeased it by giving it the full path to where the things would be dropped (the directory we created earlier)

We're going to be using a slightly customized init script. I took this from the version that comes with the precise 4.x collectd installation and added a prefix variable that can be changed. We didn't change the installation directories above, so by default, everything is going to eventually wind up in /opt/collectd/ and the init script needs to know about that:

$ cd ~
$ mkdir -p collectd-package/etc/init.d/
$ wget --no-check-certificate -O collectd-package/etc/init.d/collectd http://bit.ly/1mUaB7G
$ chmod +x collectd-package/etc/init.d/collectd

This is pulling in the file from this gist.

Now, we're finally ready to create the package:

fakeroot fpm -t deb -C collectd-package/ --name collectd \
--version 5.4.1 --iteration 1 --depends libltdl7 -s dir opt/ usr/ etc/

Since you may not be familiar with fpm, some of the options are obvious, but for the ones that aren't, -C changes directory to the given argument, --version is the version of the software, as opposed to --iteration is the version of the package. If you package this, deploy it, then find a bug in the packaging, when you package it again after fixing the problem, you increment the iteration flag, and your package management can treat it as an upgrade. The --depends is a library that collectd needs on the end systems. -s sets the source type to "directory", and then we give it a list of directories to include (remembering that we've changed directories with the -C flag).

Also, this was my first foray into the world of fakeroot, which you should probably read about if you run Debian-based systems.

At this point, in the current directory, there should be "collectd_5.4.1-1.deb", a package file that works for installing using 'dpkg -i' or in a PPA or in a repo, if you have one of those.

Once collectd is installed, you'll probably want to configure it to talk to your graphite host. Just edit the config in /opt/collectd/etc/collectd.conf. Make sure to uncomment the write_graphite plugin line, and change the write_graphite section. Here's mine:


  
    Host "YOURGRAPHITESERVER"
    Port "2003"
    Protocol "tcp"
    LogSendErrors true
    # remember the trailing period in prefix
    #    otherwise you get CCIS.systems.linuxTHISHOSTNAME
    #    You'll probably want to change it anyway, because 
    #    this one is mine. ;-) 
    Prefix "CCIS.systems.linux."
    StoreRates true
    AlwaysAppendDS false
    EscapeCharacter "_"
  

Anyway, hopefully this helped you in some way. Building a puppet module is left as an exercise to the reader. I think I could do a simplistic one in about 5 minutes, but as soon as you want to intelligently decide which modules to enable and configure, then it gets significantly harder. Hey, knock yourself out! (and let me know if you come up with anything cool!)

Happy SysAdmin Appreciation Day 2014!

Date July 25, 2014

Well, well, well, look what Friday it is!

In 1999, SysAdmin Ted Kekatos decided that, like administrative professionals, teachers, and cabbage, system administrators needed a day to recognize them, to appreciate what they do, their culture, and their impact to the business, and so he created SysAdmin Appreciation Day. And we all reap the benefit of that! Or at least, we can treat ourselves to something nice and have a really great excuse!

Speaking of, there are several things I know of going on around the world today:

As always, there are a lot of videos released on the topic. Some of the best I've seen are here:

  • A karaoke power-ballad from Spiceworks:
  • A very funny piece on users copping to some bad behavior from Sophos:
  • A heartfelt thanks from ManageEngine:
  • Imagine what life would be like without SysAdmins, by SysAid:

Also, I feel like I should mention that one of my party sponsors, Goverlan, is also giving away their software today to the first thousand people who sign up for it. It's a $900 value, so if you do Active Directory management, you should probably check that out.

Sophos was giving away socks, but it was so popular that they ran out. Not before sending me the whole set, though!

I don't even remember signing up for that, but apparently I did, because they came to my work address. Awesome!

I'm sure there are other things going on, too. Why not comment below if you know of one?

All in all, have a great day and try to find some people to get together with and hang out. Relax, and take it easy. You deserve it.

Goverlan press release on the Boston SysAdmin Day party!

Date July 22, 2014

Thanks again to the folks over at Goverlan for supporting my SysAdmin Appreciation Day party here in Boston! They're so excited, they issued a Press Release on it!

Goverlan Press Release

Click to read more

I'm really looking forward to seeing everyone on Friday. We've got around 60 people registered, and I imagine we'll have more soon. Come have free food and drinks. Register now!

Online Ticketing for 2014 Boston SysAdmin Appreciation Day Party powered by Eventbrite

Slackware is 21? Wow I'm old!

Date July 18, 2014

I saw that Slackware Linux is now officially old enough to drink (in the US, anyway). That's pretty amazing!

Patrick Volkerding sent out the initial announcement that Slackware was being worked in way back in 1993.

I didn't start using it then. Heck, I didn't even have a computer then! My first machine was an IBM Aptiva 486 DX2/66. Basically, this:

A friend gave me a copy of the 1996 version of the InfoMagic Linux Developer's Resource, and Slackware (3) was the only distribution I could get working on my machine. At around the same time, another friend's dad found me a copy of Sam's Teach Yourself UNIX in 24 Hours. And that was the genesis of my learning Linux.

I ran Slackware continually, on my desktops and servers, until around 2006, when I needed to do more enterprise-y things at work, and switched to RedHat based systems because they could auth against Active Directory, but I still have a lot of fondness in my heart for Slack.

A while back, I wrote about how instrumental Slackware was to my early knowledge, and from the Hacker News thread, I can see I'm not alone.

If you've never run it before and you'd like a taste, it's easy to get Slack and install it in a VM. You should do this, especially if you've never installed a Linux in anything other than an X-Windowed environment. Today, if you install a desktop Linux, the entire process is graphical, but Slack was never like that. The installation itself used curses, but once you rebooted into your fully functioning machine, you were at a text prompt, and you were expected to configure everything you needed from scratch. Ah, the good old days ;-)

"Time Since Install" is the new "Uptime"

Date July 16, 2014

Welcome to the converged future where everything is peachy. We use configuration management to make our servers work exactly as we want, our infrastructures have change management procedures, test-driven design, and so on. Awesome!

party-gif

We're well past the date where uptime (as in, the amount of hours or days a specific server instance) is a bragging right. We all know implicitly that the longer a server is running, and the more changes that we make to it, the less likely it is to start up and return to a known-good state. Right? Because that's a real thing. Uptime of several years isn't just insecure, it's a dumb idea for almost all server OSes - you just don't know what state the machine will be in WHEN it restarts (because eventually, it will restart). I think we can take that as read.

So that leads me to an observation that I had recently... I think that the concept of repeatability and reliability of services following a reboot can be extended to the time since installation. Clearly, configuration management is meant to allow for utterly replicable machines. You're defining exactly what you want the machine to do in code, and then you're applying that configuration from the same code. A leads to B, so you have control.

The other, uglier side of that coin, is that modern configuration management solutions are application engines, not enforcement engines. So, you can write your, say, puppet code, so that a change is applied. But what if you make a change outside of configuration management? An overly simple example might be using puppet to install Apache, then manually installing PHP though the command line.

OK, OK, OK. I know. No one is going to do that. That's stupid. I know. I said it was overly simple. But the truth is, unless you tell puppet specifically to enforce a resource (one way or another - if you don't want PHP installed, you can't just not define it...you must say "ensure => absent" or puppet doesn't care). That's what I mean by enforcement. The dictated configuration is not "this and only this".

So while you're not going to do something monumentally stupid like installing php manually, what about when you change your configuration management so that a resource that you WERE managing is no longer under management? Our environment here has over 300 resources JUST to install packages. Over time, with a lot of catalog compilations, as that number increases, it's not going to scale, and we're going to have to take some shortcuts.

When a resource that was managed becomes unmanaged, there is no enforcement mechanism in place to ensure that the previously managed <whatever> is taken care of, removed, or dealt with whatsoever. The question then becomes, how does that previous resource affect your remaining services?

If I have a bit of code that installs a package, and as a result, makes some other change (either installing a dependency, or creates or removes a file), and I build my infrastructure with that dependent, but unmanaged, resource existing, it's possible that I become dependent on it. If the resource that caused the dependency to happen is then unmanaged, chances are good that the dependent resource remains, and I never know that anything is different...until I attempt to run the code on a fresh install which never had the original resource. Well, that sucks, huh?

The fix for this is, of course, a testing environment that includes some sort of functional testing using Docker or Vagrant (or $whatever) to create a fresh environment each and every time. When you've got that going, the only sticking point becomes, "How good are your tests?"

In any event, I've recently been thinking about a sort of regular reinstallation cycle for servers, much like many places have a quarterly reboot cycle, where each quarter, they reboot a third of the machines to ensure that all of the machines will reboot if they need to in an emergency.

What do you think of my observations? Is there a reason to reinstall servers relatively regularly in production? Why or why not?

New SysAdmin Party Sponsor: GOVERLAN

Date July 15, 2014

I'm really happy to announce that my SysAdmin Appreciation Day party has another sponsor! This one is GOVERLAN.

goverlan-logo

GOVERLAN isn't something that I was familiar with until just the other day. It seems like a pretty cool piece of technology that ties into Active Directory to help manage your Windows infrastructure. That's pretty vague, I know, but if you like things like "Active Directory Integration" and like managing Windows resources, you should head over to their YouTube channel and watch some videos, which can explain it much better than I can.

Thanks, GOVERLAN, for supporting the SysAdmin Appreciation Day party here in Boston. We appreciate it!

I originally opened 70 tickets, but pulled back 20, since I was having trouble getting sponsors. With this recent news, I'm adding 25 more, so we're at 75 tickets total (of which a lot have been claimed, so grab yours today!

party-confetti