Flashback: Infrastructure Upgrades through Forest Fires

Date March 3, 2010

It’s the end of a long day. You lean back in your chair, sigh, and you’re glad it’s time to go home. Someone asks you what you did all day. You just sort of shake your head and say “fought fires”.

Fire fighting, as a sysadmin, means you don’t make any progress. You only work very hard to stay where you are. Working against entropy is difficult, and it can take a lot out of you. Some days are harder than others.

One day in early June, not long after I started this blog, I experienced a major setback. Also, a major power outage. Our entire backup facility lost power, and what’s worse, the generator refused to kick on. Our secondary site was down hard for days, until the power was restored to the downtown area of the village we were located in.

During the problem, though, we were able to turn a major issue into a net gain. Read on for the rest of the story…


It’s funny, sometimes, how we tolerate suboptimal or downright malproductive arrangements in our infrastructures, just because it’s inconvenient or inopportune to do it the “right way”. It seems like “the right way” either never comes, due to projects getting phased out, or it gets fixed during a cataclysmic upheaval, when it has become an immediate concern.

The case in point is my mail server. We have an A and a B mx record. Originally the B MX just stored mail until the A came back up, then it would get delivered. Everyone checks mail on A, so it can’t really be down during the day, and about 6 months ago, the office that B was at relocated and B was never set up. This left us with just A. To make matters worse, A was old enough that it was physically located in our backup site, which used to be our primary site. This was suboptimal. Of course there was talk about moving it to the primary site, but when could a maintenance window be created? And we’d risk the entire period of non-connectivity when it was being moved. No, management said, lets just leave it where it was.

Great strategy. It actually worked fine though, until this weekend.

I came in on Saturday, ready to do some major work on the blade systems I’m building for our new site. I sat down at my desk, ready to dive into work. Since I was alone, Raiders of the Lost Ark was playing on the laptop. I had just logged into the first server when the lights went off, and the telltale screech and whine from the server room told me that we’d lost main power.

In Granville, OH, that’s not a strange thing. We’ve got backup AC and a backup generator, so I wasn’t worried. It does have to be manually started, so I jogged into the server room and turned on the CFL floor lamp. At least I tried to. I looked at the generator control panel and it confirmed my fears. No generator power.

I tried for several minutes to start it, but nothing gave me the impression that anything would change, so I called my boss to let him know the situation, and that I was going to start shutting down machines. Since the only critical thing was mail, I suggested that he change DNS to point to an as-yet unassigned IP at the colocation, and that I could setup a postfix process there to queue the mail. He said that it would work, but he suggested an alternative approach.

Why not relocate the physical mail server to the colocation? A lightbulb went off. Of course, not only could I take care of that long standing problem, but because there was no power at all in the datacenter, the normal policy of no-downtime-for-repairs-and-upgrades was out the window.

The next morning, I left work to go home at 5am. The previous 15 hours had been spent completely rehauling the backup datacenter. With the mail relocated to the primary facility, once the power came on in the backup, I had free reign to cull everything unnecessary that had been accumulating.

There is now a pile of cables covering a square yard or so around 6 inches deep of power, ethernet, and copper/fiber cables. There are something like 96 ports worth of switches that I took out, multiple servers, KVMs, fiber switches, and general cruft. The servers are also arranged so that no half-depth servers are hiding between full depth. That was always a pet peeve of mine.

I thought about it while I was doing this, and if fighting normal issues is considered firefighting, then what I went through should have been considered forestfire fighting. And just like a forest fire, good can come from it. It takes the massive heat of a forestfire to crack open some pine cones. It also takes massive infrastructure downtime to make significant changes.



3 Responses to “Flashback: Infrastructure Upgrades through Forest Fires”

  1. Anthony said:

    Sometimes a disaster can be a good thing.

    On a not-directly-related note, I’ve often found that even when NOT fighting fires, you can spend an entire day doing ’stuff’ and at the end of the day not be able to say what it was that you accomplished.

    I used to have a whiteboard where I kept track of all the “Projects” I was working on, and at the end of each day I’d look at it and say “I haven’t done anything towards any of these projects today – so what the hell was I working all day?”

    I’ve found in IT, it’s important to track your time – not for the Finance/HR department benefit because you have to prove you were working all day (though they probably want that anyway) but for your own sense of accomplishment. The nature of the IT job is to be fixing problems and hopefully preventing problems and it’s not uncommon for there to be periods when you don’t have a ‘big project’ that you are working on to get a sense of accomplishment from.

    In these situations it is very important that you record all those stupid little glitches, user questions, monitoring activities and small little improvements you are making so that at the end of the day you can look at your list and say “wow.. I got a lot done today.”

  2. Chris Muncy said:

    Matt,

    I recently did a very similar thing, but it was when Hurricane Ike passed through Houston. I had 4 days to completely strip the server room, reconfigure the racks, replace cables with those of proper length, and even color coordinate them.

    Time well spent, even if it was by battery operated lights.

  3. Matt Simmons said:

    Anthony,
    You’ve got an excellent idea there. A task list of things that need to be taken care of.

    One of the things that I’ve started doing is keeping a wiki page for each of our data sites, and the only contents of that page are things that need done. That way when we’ve got some time, instead of saying “oh, there’s nothing to do”, we know what needs to be done.

    Good call!

    Chris,

    Doesn’t it feel good to be able to turn something negative into a positive? That’s great that you managed to basically get a new server room out of the deal :-) Congrats!

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Easy AdSense by Unreal

Switch to our mobile site