June 23, 2008
I briefly touched on having a monitoring system for your network, but that's only half the battle. The other half is getting the alert and taking action on it.
The way my system is configured, I have Nagios monitoring a couple of hundred resources around the network, from pings to diskspace to server room temperature. In the event that something goes wrong, an alert is generated, and an email is sent out. On the mail server, I have a rule specified to forward that email to my blackberry's email address (since we don't have a Blackberry
Enterprise server). My phone then rings to let me know I've got mail. Depending on the severity, members of the operations or management team are notified as well, and my gmail is also set to get alerts. The overall idea is that I become notified regardless of where I am. I've even toyed with the idea of an AIM bot that connects and sends me messages if I'm online.
It's definitely a trade-off. Having this alert system makes me feel a lot more confident in knowing that the network is up and running, but at the expense of my personal life. It's unfortunate, but since I'm really the only administrator, I bear the responsibility of making sure it's working. That's what they pay me for, and I don't get paid by the hour. It's much nicer if you have another person available with whom you share on-call duties. Until you get to that point, you have even more vested interest in making the network reliable and fault-tolerant.