Tag Archives: monitoring

Nagios Config Howto Followup

One of the most widely read stories I’ve ever posted on this blog is my Nagios Configuration HowTo, where I explained how I set up my Nagios config at a former employer. I still think that it’s a good layout to use if you’re manually building Nagios configs. In my current position, we have a small manual setup and a humongous automated monitoring setup. We’re moving toward a completely automated monitoring config using Puppet and Nagios, but until everything is puppetized, some of it needs to be hand-crafted, bespoke monitoring.

For people who don’t have a single ‘source of truth’ in their infrastructure that they can draw monitoring config from, hand-crafting is still the way to go, and if you’re going to do it, you might as well not drive yourself insane. For that, you need to take advantage of the layers of abstraction in Nagios and the built-in object inheritance that it offers.

Every once in a while, new content gets posted that refers back to my Config HowTo, and I get a bump in visits, which is cool. Occasionally, I’ll get someone who is interested and asks questions, which is what happened in this thread on Reddit. /u/sfrazer pointed to my config as something that he references when making Nagios configs (Thanks!), and the original submitter replied:

I’ve read that write up a couple of times. My configuration of Nagios doesn’t have an objects, this is what it looks like

(click to embiggen)

And to understand what you are saying, just by putting them in the file structure you have in your HowTo that will create an inheritance?

I wanted to help him understand how Nagios inheritance works, so I wrote a relatively long response, and I thought that it might also help other people who still need to do this kind of thing:


 

No, the directories are just to help remember what is what, and so you don’t have a single directory with hundreds of files.

What creates the inheritance is this:

You start out with a host template:

msimmons@nagios:/usr/local/nagios/etc/objects$ cat generic-host.cfg
define host {
    name generic-host
    notifications_enabled   1
    event_handler_enabled   1
    flap_detection_enabled  1
    failure_prediction_enabled  1
    process_perf_data   1
    retain_status_information   1
    retain_nonstatus_information    1
    max_check_attempts 3
    notification_period 24x7
    contact_groups systems
    check_command check-host-alive
    register 0
}
# EOF

So, what you can see there is that I have a host named “generic-host” with a bunch of settings, and “register 0″. The reason I have this is that I don’t want to have to set all of those settings for every other host I make. That’s WAY too much redundancy. Those settings will almost never change (and if we do have a specific host that needs to have the setting changed, we can do it on that host).

Once we have generic host, lets make a ‘generic-linux’ host that we can have the linux machines use:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat generic-linux.cfg 
define host { 
    name     linux-server
    use generic-host
    check_period    24x7
    check_interval  5
    retry_interval  1
    max_check_attempts  5
    check_command   check-host-alive
    notification_interval 1440
    contact_groups  systems
    hostgroups linux-servers
    register 0
}

define hostgroup {
    hostgroup_name linux-servers
    alias Linux Servers
}
# EOF

Alright, so you see we have two things there. A host, named ‘linux-server’, and you can see that it inherits from ‘generic-host’. I then set some of the settings specific to the monitoring host that I’m using (for instance, you probably don’t want notification_interval 1440, because that’s WAY too long for most people – a whole day would go between Nagios notifications!). The point is that I set a bunch of default host settings in ‘generic-host’, then did more specific things in ‘linux-server’ which inherited the settings from ‘generic-host’. And we made it ‘register 0′, which means it’s not a “real” host, it’s a template. Also, and this is important, you’ll see that we set ‘hostgroups linux-servers’. This means that any host we make that inherits from ‘linux-server’ will automatically be added to the ‘linux-servers’ hostgroup.

Right below that, we create the linux-servers hostgroup. We aren’t listing any machines. We’re creating an empty hostgroup, because remember, everything that inherits from linux-servers will automatically become a member of this group.

Alright, you’ll notice that we don’t have any “real” hosts yet. We’re not going to yet, either. Lets do some services first.

msimmons@monitoring:/usr/local/nagios/etc/objects$ cat check-ssh.cfg
define command{
   command_name   check_ssh
   command_line   $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
   }
# EOF

This is a short file which creates a command called “check_ssh”. This isn’t specific to Linux or anything else. It could be used by anything that needed to verify that SSH was running. Now, lets build a service that uses it:

msimmons@monitoring:/usr/local/nagios/etc/objects/services$ cat generic-service.cfg 
define service{
        name                            generic-service  
        active_checks_enabled           1     
        passive_checks_enabled          1      
        parallelize_check               1       
        obsess_over_service             1        
        check_freshness                 0         
        notifications_enabled           1          
        event_handler_enabled           1           
        flap_detection_enabled          1            
        failure_prediction_enabled      1             
        process_perf_data               1
        retain_status_information       1 
        retain_nonstatus_information    1  
        is_volatile                     0   
        check_period                    24x7 
        max_check_attempts              3     
        normal_check_interval           10     
        retry_check_interval            2       
        contact_groups                  systems
      notification_options    w,u,c,r        
        notification_interval           1440
        notification_period             24x7       
         register                        0            
}
# EOF

This is just a generic service template with sane settings for my environment. Again, you’ll want to use something good for yours. Now, something that will inherit from generic-service:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat linux-ssh.cfg
define service { 
    use generic-service
    service_description Linux SSH Enabled
    hostgroup_name linux-servers
    check_command check_ssh 
}
# EOF

Now we have a service “Linux SSH Enabled”. This uses check_ssh, and (importantly), ‘hostgroup_name linux-servers’ means “Every machine that is a member of the hostgroup ‘linux-servers’ automatically gets this service check”.

Lets do the same thing with ping:

define command{
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
 }

 define service { 
    use generic-service
    service_description Linux Ping
    hostgroup_name linux-servers
    check_command check_ping!3000.0,80%!5000.0,100
}

Sweet. (If you’re wondering about the exclamation marks on the check_ping line in the Linux Ping service, we’re sending those arguments to the command, which you can see set the warning and critical thresholds).

Now, lets add our first host:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat mylinuxserver.mycompany.com.cfg 
define host{
       use linux-server
       host_name myLinuxServer.mycompany.com
       address my.ip.address.here
}

That’s it! I set the host name, I set the IP address, and I say “use linux-server” so that it automatically gets all of the “linux-server” settings, including belonging to the linux host group, which makes sure that it automatically gets assigned all of the Linux service checks. Ta-Da!

Hopefully this can help people see the value in arranging configs like this. If you have any questions, please let me know via comments. I’ll be happy to explain! Thanks!

 

 

Monitoring (old) Zimbra

It’s September, and in universities, that means tons of new people. New staff, new faculty, new students. Lots and lots of new people.

Here at The College of Computer and Information Science at Northeastern University, we’ve got a banner crop of incoming CS students. So many, in fact, that we bumped up against one of those things that we don’t think about a lot. Email licenses.

Every year, we pay for a lot of licenses. We’ve never monitored the number used vs the number bought, but we buy many thousand seats. Well, we ran out last week. Oops.

After calling our reseller, who hooked us up with a temporary emergency bump, we made it through the day until we could buy more. I decided that it was time to start monitoring that sort of thing, so I started working on learning the Zimbra back-end.

Before you follow along with anything in this article, you should know – my version of Zimbra is old. Like, antique:

Zimbra was very cool about this and issued us some emergency licenses so that we could do what we needed until our new license block purchase went through. Thanks Zimbra!

In light of the whole “running out of licenses” surprise, I decided that the first thing I should start monitoring is license usage. In fact, I instrumented it so well that I can pinpoint the exact moment that we went over the number of emergency licenses we got:

CCIS-mail-accounts

Cool, right?

Well, except for the whole “now we’re out of licenses” again thing. Sigh.

I mentioned a while back that I was going to be concentrating on instrumenting my infrastructure this year, and although I got a late start, it’s going reasonably well. In that blog entry, I linked to a GitHub repo where I built a Vagrant-based Graphite installation. I used that work as the basis for the work I did when creating a production Graphite installation, using the echocat graphite module.

After getting Graphite up and running, I started gathering metrics in an automated fashion from the rest of the puppetized infrastructure using the pdxcat CollectD puppet module, and I wrote a little bit about how similar that was with my Kerbal Space Administration blog entry.

But my Zimbra install is old. Really old, and the server it’s on isn’t puppetized, and I don’t even want to think about compiling collectd on the version of Ubuntu this machine runs. So I was going to need something else.

As it turns out, I’ve been working in Python for a little while, and I’d written a relatively short program that serves both as a standalone command that can send a single metric to Carbon or can function as a library, if you need to send a lot of metrics at a time. I’m sure there’s probably a dozen tools to do this, but it was relatively easy, so I just figured I’d make my own. You can check it out on GitHub if you’re interested.

So that’s the script I’m using, but a script needs data. If you log in to the Zimbra admin interface (which I try not to do, because it requires Firefox in the old version we’re using), you can actually see most of the stats you’re interested in. It’s possible to scrape that page and get the information, but it’s much nicer to get to the source data itself. Fortunately, Zimbra makes that (relatively) easy:

In the Zimbra home directory (/opt/zimbra in my case), there is a “zmstats/” subdirectory, and in there you’ll find a BUNCH of directories with dates as names, and some CSV files:


... snip ...
drwxr-x--- 2 zimbra zimbra 4096 2014-09-04 00:00 2014-09-03/
drwxr-x--- 2 zimbra zimbra 4096 2014-09-05 00:00 2014-09-04/
drwxr-x--- 2 zimbra zimbra 4096 2014-09-06 00:00 2014-09-05/
-rw-r----- 1 zimbra zimbra 499471 2014-09-06 20:11 cpu.csv
-rw-r----- 1 zimbra zimbra 63018 2014-09-06 20:11 fd.csv
-rw-r----- 1 zimbra zimbra 726108 2014-09-06 20:12 imap.csv
-rw-r----- 1 zimbra zimbra 142226 2014-09-06 20:11 io.csv
-rw-r----- 1 zimbra zimbra 278966 2014-09-06 20:11 io-x.csv
-rw-r----- 1 zimbra zimbra 406240 2014-09-06 20:12 mailboxd.csv
-rw-r----- 1 zimbra zimbra 72780 2014-09-06 20:12 mtaqueue.csv
-rw-r----- 1 zimbra zimbra 2559697 2014-09-06 20:12 mysql.csv
drwxr-x--- 2 zimbra zimbra 4096 2014-06-15 22:13 pid/
-rw-r----- 1 zimbra zimbra 259389 2014-09-06 20:12 pop3.csv
-rw-r----- 1 zimbra zimbra 893333 2014-09-06 20:12 proc.csv
-rw-r----- 1 zimbra zimbra 291123 2014-09-06 20:12 soap.csv
-rw-r----- 1 zimbra zimbra 64545 2014-09-06 20:12 threads.csv
-rw-r----- 1 zimbra zimbra 691469 2014-09-06 20:11 vm.csv
-rw-r----- 1 zimbra zimbra 105 2014-09-06 19:08 zmstat.out
-rw-r----- 1 zimbra zimbra 151 2014-09-06 06:28 zmstat.out.1.gz
-rw-r----- 1 zimbra zimbra 89 2014-09-04 21:15 zmstat.out.2.gz
-rw-r----- 1 zimbra zimbra 98 2014-09-04 01:41 zmstat.out.3.gz

Each of those CSV files contains the information you want, in one of a couple of formats. Most are really easy.


sudo head mtaqueue.csv
Password:
timestamp, KBytes, requests
09/06/2014 00:00:00, 4215, 17
09/06/2014 00:00:30, 4257, 17
09/06/2014 00:01:00, 4254, 17
09/06/2014 00:01:30, 4210, 16
... snip ...

In this case, there are three columns, which include the timestamp, the number of kilobytes in queue, and the number of requests. Most CSV files have (many) more columns, but this works pretty simply. That file is updated every minute, so if you have a cronjob run, grab the last line of that file, parse it, and send it into Graphite, then your work is basically done:


zimbra$ crontab -l
... snip ...
* * * * * /opt/zimbra/zimbra-stats/zimbraMTAqueue.py

And looking at that file, it’s super-easy:


#!/usr/bin/python

import pyGraphite as graphite
import sys
import resource

CSV = open('/opt/zimbra/zmstat/mtaqueue.csv', "r")
lineList = CSV.readlines()
CSV.close()
GraphiteString = "MY.GRAPHITE.BASE."

rawLine = lineList[-1]
listVals = rawLine.split(',')

values = {
	'kbytes': listVals[1],
	'items':  listVals[2],
	}

graphite.connect()

for value in values:
	
	graphite.sendData(GraphiteString + "." + value + " ", values[value])

graphite.disconnect()

So there you go. My python isn’t awesome, but it gets the job done. Any includes not used here are because some of the other scripts I needed them, and by the time I got to this one, I was just copying and pasting my code for the most part. #LazySysAdmin

The only CSV file that took me a while to figure out was imap.csv. The format of that one is more interesting:

msimmons@zimbra:/opt/zimbra/zmstat$ sudo head imap.csv
timestamp,command,exec_count,exec_ms_avg
09/06/2014 00:00:13,ID,11,0
09/06/2014 00:00:13,FETCH,2,0
09/06/2014 00:00:13,CAPABILITY,19,0
...snip...

So you get the timestamp, the IMAP command, the number of times that command is being executed, and how long, on average, it took, so you can watch latency. But the trick is that you only get one command per line, so the previous tactic of only grabbing the final line won’t work. Instead, you have to grab the last line, figure out the timestamp, and then grab all of the lines that match the timestamp. Also, I’ve found that not all IMAP commands will show up every time, so make sure that your XFilesFactor is set right for the metrics you’ll be dealing with.

The code is only a little more complicated, but still isn’t too bad:

#!/usr/bin/python

import pyGraphite as graphite
import sys
import resource

imapCSV = open('/opt/zimbra/zmstat/imap.csv', "r")
lineList = imapCSV.readlines()
imapCSV.close()
GraphiteString = "MY.GRAPHITE.PATH"

class imapCommand:
	name = ""
	count = ""
	avgres = ""

	def __init__(self, name, count, avgres):
		self.name = name
		self.count = count
		self.avgres = avgres
	

IMAPcmds = list()

datestamp = lineList[-1].split(',')[0]

record = len(lineList)

while True:
	if ( lineList[record-1].split(',')[0] == datestamp ):
		CMD = lineList[record-1].split(',')[1]
		COUNT = lineList[record-1].split(',')[2]
		AVGRES = lineList[record-1].split(',')[3].strip()
		IMAPcmds.append(imapCommand(CMD, COUNT, AVGRES))
	else:
		break
	record = record - 1

graphite.connect()

for command in IMAPcmds:
	graphite.sendData(GraphiteString + "." + command.name + ".count ", command.count)
	graphite.sendData(GraphiteString + "." + command.name + ".avgres ", command.avgres)

graphite.disconnect()

You can read much more about all of the metrics in the online documents, Monitoring Zimbra.

Now, so far, this has been the runtime metrics, which is helpful, but doesn’t actually give me account information. To get that, we’re going to use some of the built-in Zimbra tools. zmaccts lists all accounts, and then prints a summary at the end. We can just grab the summary and learn the number of accounts. We can also use the zmlicense -p command to get the number of licensed accounts we have.

The shell script is pretty easy:

$ cat zimbra-stats/zimbraAccountStatuses.sh
#!/bin/bash

# Creates $GRAPHITESERVER and $GRAPHITEPORT
. /opt/zimbra/zimbra-stats/graphite.sh

OUTPUT="`/opt/zimbra/bin/zmaccts | tail -n 1`"

ACTIVE=`echo $OUTPUT | awk '{print $2}'`
CLOSED=`echo $OUTPUT | awk '{print $3}'`
LOCKED=`echo $OUTPUT | awk '{print $4}'`
MAINT=`echo $OUTPUT | awk '{print $5}'`
TOTAL=`echo $OUTPUT | awk '{print $6}'`
NEVERLOGGEDIN=`/opt/zimbra/bin/zmaccts | grep "never$" | wc -l`

MAX="`/opt/zimbra/bin/zmlicense -p | grep ^AccountsLimit= | cut -d \= -f 2`"

STATPATH="MY.GRAPHITE.PATH."

/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.active ${ACTIVE} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.closed ${CLOSED}
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.locked ${LOCKED} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.maintenance ${MAINT} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.total ${TOTAL} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.neverloggedin ${NEVERLOGGEDIN} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.max ${MAX}  

 

Forgive all of the shortcuts taken in the above. Things aren’t quoted when they should be and so on. Use at your own risk. Warranty void in Canada. Etc etc.

Overall, it’s to get that additional transparency into the mail server. Even after we get the server upgraded and on a modern OS, this kind of information is a welcome addition.

Oh, and for the record?

$ find ./ -name "*wsp" | wc -l
8783

Over 8,500 metrics coming in. Sweet. Most of that is coming from collectd, but that’s another blog entry…

Nagios-Plugins Brouhaha

I’m really not a fan of political infighting. It’s bad for an organization, and it’s bad for the people who rely on the organization. But it happens, because we’re people, and people have ideas and egos and goals that are mutually exclusive of each other.

Such as it is with Nagios at the moment. Although there’s been some strife for the IP and trademarks surrounding Nagios for a while, the most recent thing is that the plugins site was…reassigned, I suppose you would say.

For a brief timeline, Nagios began as NetSaint, back in 1999. It was renamed Nagios in 2001 according to the WayBack Machine, apparently because of potential trademark issues. The plugins were apparently spun off from the main NetSaint project around this time as well, although the domain creation date is 2008-05-23 for nagios-plugins.org and 2007-01-15 for nagiosplugins.org.

So what’s going on now? Well, according to a news entry on the Nagios.org site, the Nagios Plugin Team had some changes:

The Nagios Plugin team is undergoing some changes, including the introduction of a new maintainer. The www.nagios-plugins.org website will remain the official location of the Nagios Plugins, and development of the plugins will continue on github at https://github.com/nagios-plugins.

Changes are being made to the team as the result of unethical behavior of the previous maintainer Holger Weiss. Weiss had repeatedly ignored our requests to make minor changes to the plugins website to reflect their relation to Nagios, rather than unrelated projects and companies. After failing to acknowledge our reasonable requests, we updated the website to reflect the changes we had requested. Rather than contacting us regarding the change, Weiss decided to embark on a vitriolic path of attacking Nagios and spreading mistruths about what had happened.

We believe that this type of unethical behavior is not beneficial for the Nagios community nor is it in keeping with the high standards people have come to rely on from Nagios. Thus, we have decided to find a new maintainer for the plugins. A new maintainer has already stepped forward and will be announced shortly.

We would like to thank all current and past plugin developers for their contributions and welcome anyone new who is interested in contributing to the project moving forward.

So that’s what Nagios has to say.

The reason that they specify the official location for the plugins is because the original team is continuing development on the (their?) project, at https://www.monitoring-plugins.org. According to a news post there:

In the past, the domain nagios-plugins.org pointed to a server independently maintained by us, the Nagios Plugins Development Team. Today, the DNS records were modified to point to web space controlled by Nagios Enterprises instead. This change was done without prior notice.

This means the project can no longer use the name “Nagios Plugins”. We, the Nagios Plugins Development Team, therefore renamed the Nagios Plugins to Monitoring Plugins.

We’re not too happy having to make this move. Renaming the project will lead to some confusion, and to quite a bit of work for others and for ourselves. We would’ve preferred to save everyone this trouble.

However, we do like how the new name indicates that our plugins are also used with various other monitoring applications these days. While the Nagios folks created the original implementation of the core plugins bundle, an independent team has taken over development more than a decade ago, and the product is intended to be useful for all users, including, but not limited to, the customers of Nagios Enterprises.

It’ll probably take us a few days to sort out various issues caused by the new project name, but we’re confident that we can resume our development work towards the next stable releases very soon.

We’d like to take the chance to thank you, our community, for your countless contributions, which made the plugins what they are today. You guys are awesome. We’re looking forward to the next chapter of Monitoring Plugins development, and we hope you are, too!

Throwing gasoline onto the fire is Michael Friedrich, Lead Developer of Icinga, a Nagios Fork, who submitted a RedHat bug claiming:

The nagios-plugins.org website has been compromised, and the project team therefore moved to https://www.monitoring-plugins.org including the tarball releases. They also renamed their project from ‘nagios-plugins’ to ‘monitoring-plugins’ and it’s most likely that the tarball release names will be changed to the new name in future releases.

https://www.monitoring-plugins.org/archive/help/2014-January/006503.html

Additional info:

While I wouldn’t suggest to rename the package unless there’s an immediate requirement, the source and URL location should be updated in order to use official releases provided by the Monitoring Plugins Development Team. Further, users should use official online references and stay safe.

then later in the thread

Actually the old Nagios Plugins Development Team was required to rename their project due to the fact that Nagios Enterprises hijacked the DNS and website and kicked them out.

Whilst the original memebers are now known as ‘Monitoring Plugins Development Team’, the newly formed ‘Nagios Core Plugin Development Team’ is actually providing a fork of their work under the old name.

For any clarifications required, you should follow the discussion here: https://www.monitoring-plugins.org/archive/devel/2014-January/009417.html

Imho the official former nagios plugins are now provided by the same developers under a new URL and that should be reflected for any future updates.

Though, I’m leaving that to you who to trust here. I’m just a community member appreciating the work done by the original Nagios Plugins Development team, accepting their origin and the fact that there’s no censorship of Icinga, Shinken or Naemon (as Nagios forks) needed.

Clearly, the nagios-plugins.org site was not compromised – neither Nagios nor the Monitoring Plugins team is claiming that. I’ll kindly assume that Mr Freidrich was mistaken when he posted the original bug. Hanlon’s Razor and all.

So what’s my opinion? Glad you asked.

The Nagios news post states that there had been continued requests to change small content. When that didn’t happen, they pulled the rug out from under half a dozen community contributors who have collectively done a great deal of good for the project. That’s not the way you show appreciation where I’m from, but hey, I don’t know the particulars. I only know what I see on the web, just like you.

What does this all mean for us? Well, if you run anything that uses Nagios plugins, it means you’ve got a choice – go with the official package, or go with the community-maintained version. Which will be better? Which will the distros use? Probably the official plugins, though I expect the more rapidly-moving distros to offer a package from the monitoring-plugin team as soon as there’s any noticeable difference.

But on the bigger picture scale, Nagios’s previously solid position as the principle Open Source monitoring solution isn’t as unassailable as it seems they think it is. Cutting off a volunteer team that produces a big part of your product isn’t really a good way to advertise stability and unification. There are a lot of options for monitoring today, and a lot more of them are way more viable than was the case 4 or 5 years ago. Instead of this political crap that does nothing to advance the project, I think Nagios should focus on improving the core product. But what do I know? I’m just some blogger on the internet.

Strangely, as I write this, the Nagios Exchange is down. I don’t know what that means.