Did you remember to set your clocks back?

Date November 1, 2009

I did. Which didn’t do me a lot of good at 1am this morning when all 80ish of the servers I’ve got in Nagios decided to cry out at once.

A single nagios alert is enough to make me wake up and read the message. 80 nagios alerts in the span of 30 seconds is enough to make me jump out off bed, run into the living room, and be awake enough to think “how can anything be this badly broken yet still able to send messages???”

As it turns out, I am my own worst enemy. All of the check_time plugins I saw relied on the “time” service being enabled on the servers. This seems needless to me, when I’ve got a perfectly good net-snmpd server running and reporting the date as well as a ton of other stuff. So I wrote my own plugin. It has functioned very well, however it appears likely that I wrote it this summer sometime, because it had a bug.

It’s always something small or insignificant. Like a decimal place. Or accidentally lopping off the remote time zone specification. Imagine that. Normally, it doesn’t matter. 363 days a year, the assumed timezone on the remote server is identical to the actual timezone on the remote server.

Apparently GNU date decides that unless you give it a specific timezone, it defaults to the first timezone of the day. Today at 1am, the actual timezone of the day was different from the first timezone of the day. And suddenly every machine looked like it was 360 seconds different than the internal time servers. And so, Nagios sent an alert. For every machine. Ouch.

So this morning at 1am, I was debugging and patching my check_date plugins in order to stop the bleeding. I thought I got my pre conference pain over, but I guess not. As it turns out, Saint Aardvark the Carpeted also had some pre-conference fun.



3 Responses to “Did you remember to set your clocks back?”

  1. Preston de Guise said:

    You can be thankful you weren’t driving – search for “Ukraine” on this Risks Digest article to see what I mean: Risks 20.16.

  2. jimb said:

    check_ntp_time


    [root@sql1 ~]# cat /etc/nrpe.d/mycorp.cfg
    allowed_hosts=10.10.24.66
    command[check_disk_root]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /dev/mapper/VolGroup00-LogVol00
    command[check_disk_san]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /dev/mapper/mpath1
    command[check_ntp_time]=/usr/lib64/nagios/plugins/check_ntp_time -H -w 0.1 -c 0.2

    Did you really think this basic a wheel needed to be re-invented?

  3. Ryan said:

    “A single nagios alert is enough to make me wake up and read the message. 80 nagios alerts in the span of 30 seconds is enough to make me jump out off bed, run into the living room, and be awake enough to think “how can anything be this badly broken yet still able to send messages???””

    I had one of these moments a couple days ago at about 4:00 am, not a good thing on a computational cluster :-P

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Easy AdSense by Unreal

Switch to our mobile site