Pre-emptive Troubleshooting

Date March 10, 2009

Troubleshooting is a very reactive process. By its nature, you're fixing an already existing problem. As good as it is to be able to troubleshoot, it's better to prevent weird problems from cropping up in the first place.

As sysadmins, we have numerous ways to do this. First, as Michael Jenke very sensibly suggests, is to use structured systems management, by administering via script instead of editing files by hand, or worse yet, clicking through the interface.

Another very potent tool that Chris Siebenmann brings up today is using checklists to perform complex tasks. (Incidentally, Chris mentions a term I've never heard before..."rubber duck debugging". I think I'm going to try to expense one of these for troubleshooting purposes)

I'm a firm believer in checklists for anything that isn't (or can't be) automated. On our internal wiki, I have checklists for things like adding and removing users from the infrastructure, adding new machines, etc etc. They're great, and I don't have to "remember" everything that needs done, I can just do it and it's always accurate. And if I've left something off the list, I add it to the list, and it's more accurate.

I wasn't always such a checklist person. It took a while for their usefulness to sink in. Tom Limoncelli does a great job of explaining why in his blog post, Transforming an art into a science, where he explains that back in the bad old days, planes kept crashing until pilots started pre-takeoff checklists. Similarly, Boston.com has an article about doctors using checklists, which resulted in 36% fewer complications and deaths in the operating room.

Checklists take complex, fun tasks and make them boring.

  • Matt

    wow, I slaughtered that. Sorry Chris! It's fixed.

  • Ben C

    So if I buy a pirate duck and use it to troubleshoot, does that make me an idea pirate? Arrrrr!

  • Matt

    *groan*

    Walk the plank, ye scurvy cur!

  • AJ

    Matt, this is a great post. I do use checklists but the way I use them is still inefficient. I usually throw a whole bunch of things I need to do for a particular job on a sheet of loose leaf paper, and I do cross them off, and I usually work on them in order of importance but I am still missing a step prior to all of this. Mainly being that I end up with a whole bunch of different loose leaf sheets all over the place from being over whelmed with jobs and then I start working on one, then work on another, then somewhere along there I end up losing a sheet or not getting to something important on that sheet.

    What I need to do and will hopefully start today if I am smart is creating a "root" checklist. Then making a "sub-root" checklist for all of the junk there after.