Devotion to Duty (xkcd)

Date February 22, 2010

I get the feeling that this will be making the sysadmin-blog rounds :-)

Today's XKCD is excellent, and already has a huge following from the sysadmins on twitter.



The alt-text is:

The weird sense of duty really good sysadmins have can border on the sociopathic, but it's nice to know that it stands between the forces of darkness and your cat blog's servers.

He's right, of course. Sysadmins in general can develop a hero complex.

It's a complex topic, but the smartest people in systems administration today (read: not me) have been vocal that sysadmin heroism should be discouraged. I can agree with that, to a point. We should never rely on heroism to save the day, because that means our designs have failed. When we stop believing in miracles and start relying on them, we have made bad design decisions and the reliability of our network will suffer.

On the other hand, there are sometimes events which happen that are beyond our control, and it's up to us to make it right. In those cases, there's no rule or mandate that says "you - sysadmin: go above and beyond the call of duty and be a hero!" I think it's more our mental alignment that says "It's my job to make sure that things work. In order to make things work, I've got to climb on top of the roof in the middle of a blizzard and restart the generator" (something my boss has done multiple times, and I'm sure some of you have as well). It's just the way it works. We think logically, if the job needs done, and it's our job to do it, then we need to do the job. The peripheral variables are unimportant.

I think the comic is hilarious, but like most Mission Impossible / Jack Bauer / Die Hard scenarios, it's a rare event. Don't go take ju-jitsu just in case someone cuts your network cables. Have a redundant infrastructure so that it doesn't matter if they get cut.

  • http://itsunixnoteunuchs.com Daniel J. Doughty

    Yeah, the hero complex can be a problem and it comes out of people you might never expect it from. Normally quiet and sociable fellows are suddenly stomping on others to explain how they've been doing something for 48 hours straight. Yet when things calm down, they never quite manage to fix the design that led them to the horrible outage. Why? Because they loved the attention so much that they'd like to receive the attention again.

    But it can also get a bit goofy because sometimes the executives will praise the "heroic" efforts and then never mention sys admins for another 7 years. So the message sent can be a confusing one.

  • rpetre

    I totally agree. My experience is always of the guy where problems come to be solved, and with very few avenues to pass these problems to somebody else. Ergo, I'm the ultimate problem solver. Combined with sufficient laziness, this makes for quite some self-esteem-building situations (I went through quite a few http://xkcd.com/208/ situations), but also for some horriffic burn-outs, when things refuse to stay under your control.

    But what can I say, that's why I love this job.

  • AJ L.

    Well said sir! In total agreement here.

  • John M

    While the the praise and recognition is all well and fine, emphasis should (and for the most part is) be put on the issue of reliable systems.

    We cannot expect to be prepared for anything that may happen (asteroids from space crashing into my switches is my favorite..), we should have the ability to resolve what we can in the most efficent way possible.

    Something that is not mentioned, and that 'Heros' never seem to do, is an incident resolution or after action report. This is an essential exercise to find the root cause, and create a plan to remediate the issue (Laser Cannons and armored switch cabinets to start... :-) ).

  • http://itsunixnoteunuchs.com Daniel J. Doughty

    @John M

    Actually, I've seen AAR(After Action Reports) and IR(Incident Reports) for the last 11 years in IT, but they never seem to do any good. The terms come from a military perspective where you tend to have massive personnel redundancy and a large work force as I first heard it from Light Infantry mentors in 1987. Sadly, IT never has either of these things.

    I think design is really where we can do the most good given that there is no considerable influx of new talent to the IT field and constant pressure to drive down cost.

    But I may just be acting like a stick in the mud because I hate meetings.

  • John M

    @Daniel

    I work in an pharmaceutical network environment, and we do formal and informal IR's here, because our work is critical for the business.

    While design is a critical component, we have to deal with corporate overlords, that sometimes don't (or will not) take into consideration what our environment requires.

    When we have an incident here (which happens too often) we are required to find the root cause, and try to eliminate the cause from happening again, if at all possible.

  • http://dannyman.toldme.com/ Daniel Howard

    I know a guy who used to try to pose an interview question in which a SysAdmin has to get into the locked network closet RIGHT NOW without the key, and are they motivated enough to take the nearby ax to break the door down to get the systems back online. Because that is the kind of person you want guarding your uptime.

    Once while discussing this novel interview technique, someone pointed out that you should never be in such a position where you can not gain emergency access to a network closet, and that a situation where this can come to pass is likely indicative of substantially larger problems that you can not control, and the correct resolution would be to resign.

  • http://www.standalone-sysadmin.com Matt Simmons

    I think I could see both sides, but personally, the door is going down. You can resign after leaving the door in toothpicks. Plus, great story.

  • http://drwho.virtadpt.net/ The Doctor

    @Daniel J. Doughty: Not always.

    Fixing a bad design that caused a major outage requires even more downtime. After getting burned by a production network going ker-flooey once, management isn't necessarily all that excited about scheduled downtime to make sure it never happens again. They tend to be more interested in cleaning the egg off of the company's face and getting us in the data center to keep what currently exists running than they are letting us fix what blew up in the first place.

    Been there, done that.

  • John M

    The real issue here is a from what perspective is the bad design (be it network, or process) being viewed from.

    The Business is not going to want anymore downtime.

    The IT department wants to resolve the issue, and put a process in place to remedy the incident, even if it means downtime now for resolution.

    We in IT see this as a logical process to solving problems, and business does not. That is why they hire us, to (hopefully, asteroids withstanding :-) ) anticipate these problems, and work out resolutions to the Business' benefit.
    I agree with Matt... the door is going down. The issue of why access was denied in the first place can be resolved later.

  • Pingback: @mperedim redundant infrastructure :) ht… « Stsimb Tweets()

  • Dr. Kenneth Noisewater

    More often than not, I've found that proper design ranks quite low on the list of judging criteria among the non-technical managers who oversee and fund tech projects. So, AARs end up being 'venting' documents for frustrated admins who know how things _should've_ been done but who were ignored for any number of reasons, of variable stupidity. So, over time, those AARs go away.

    That's organizational rot, and in this economy, there are plenty of admins who will put up with it to pay the rent. Myself included, alas.

  • John_M

    While I agree with your post, I have the blessing (or curse... :-) ) of working in an environment that supports and expects those type of reports, and for most part, integrates them into processes.

    I have also worked in the type of environment that you stated, and frankly, it scary how much organizational rot is out there.

  • Daniel J. Doughty

    @Dr. Kenneth Noisewater

    You are dead on right. Just announcing that you will have an AAR does not mean that your organization has the desire to enact the changes to prevent the incident again. Many of the times, the outages did not produce public egg on the face. I would actually commiserate with the organization if that was why they didn't want to fix the problem. More often it was just that the management above the geeks didn't understand why things should be fixed.