Nagios Configuration HOWTO

Date July 20, 2009

Although this is a HOWTO, ironically, it doesn't cover the actual configuration of Nagios. It covers the part of the configuration that might actually be harder than creating hosts, groups, and services. It covers how to organize your configuration directory in a way that makes it easiest to add new hosts, groups, services and commands in a logical manner. Organization is the key to a successful Nagios configuration...

Ah, Nagios configuration. It's a topic that's near to my heart. I was actually whining about it the other day, if you read my review on Learning Nagios 3.0.

Anyway, as a matter of serendipity, a day or so after I posted that review, Ray Holtz posted a question on the sysadmin network about nagios configurations. Ray currently has all of his services and hosts configured in one file, and has his dependencies in another file. That's not really that far off from the default install, but it's painful to update and administer. He wanted to know if there was a better way.

Scrolling through a giant file like that takes a lot of time, and trying to keep track of changes is the pits. After working out the configuration in my head for a while, I think I've got a pretty decent system, and I'm working on making it better. Here goes.

To give you an idea of scale, I'm monitoring somewhere around 100 hosts between two nagios servers. I've got two physical sites (well, technically four, but two critical sites), and each site has its own Nagios install. Each nagios install monitors all of the local servers plus the remote nagios server, plus all of the network connections. This way I'll be alerted if either of the nagios machines go down, any of the network connections go down, or (obviously) any of the "normal" servers have issues.

As for individual server configuration management, I'd recommend you put your entire /usr/local/nagios/etc directory into a subversion repository (along with libexec, too). I have to admit that I haven't done this yet, but it's in the works. I want to be able to track changes to my config over time, and subversion is a great way to do that.

I create a hierarchy of subdirectories under etc/objects. Here's how mine looks:


[root@web etc]# pwd
/usr/local/nagios/etc
[root@web etc]# tree -d
.
`-- objects
    |-- commands
    |-- computers
    |   |-- linux
    |   `-- windows
    |-- misc
    `-- network
        |-- firewalls
        |-- links
        |-- routers
        `-- switches

11 directories

Generally speaking, I try to abuse Nagios 3's inheritance model. The layers of abstraction make it a beautiful tool to quickly prototype new configurations and get new services checked with a minimum of effort.

I have a prototype for both of the "major" types: computers and network. There's a "computers.cfg" and a "network.cfg" that lives in the directory of the same name, and the only contents of that file is a host declaration:

define host {
        use                     generic-host
        name                    computers
        check_command           check_ping!100.0,20%!500.0,60%
        notification_options    d,u,r,f,s
        register                0
        max_check_attempts      10
        notification_interval   60
        contact_groups          it-admins
}

That's a good "general" for my install, and as all of these values can be overridden by more specific declarations later, hey, no worries.

Right now, there are a set of "group" config files and "service" config files in those directories as well. I'm going to be moving them into their own subdirectories eventually, to be easier to manage and less cluttered, but the general organizational theme is that every group check should be self-contained.

In other words, if I've got a web group, then the web-group.cfg file contains the hostgroup declaration as well as any service declarations needed to make sure that those checks happen. Since many groups undoubtedly share the same checks, and you don't want to be changing 6 million pieces of code every time one of the low level commands change, the services can inherit their settings from a very "general" related service check down the config line.

So for instance, my web servers need web checks, obviously. So do my application servers. So do my firewalls. And my load balancers. I don't want to maintain a full fledged service declaration set for each one of those (and I bet you don't either), but at the same time, all of those devices are administered by very different people in many cases. So in your group config, create a service declaration similar to this:


define service{
        use                     generic-http
        hostgroup_name         firewall-http
        contact_groups          firewall-admins
}

This allows you to have ultra-local service declarations in with your hostgroup declarations, and you have very finite control over any variables that need changed. Notice that I didn't specify the check_command, because the one specified in generic-http is perfectly acceptable. If the firewall web servers had some specialized requirement, I would specify the check_command in the service declaration, and then below the service declaration, I would create the check_command to be used. Ultra local, discrete units able to be administered in a much more efficient way than searching through commands.cfg to find the right check command.

Because we've got the services compartmentalized like this, we can use Unix permissions to manage users' abilities to edit these files. Only want your firewall team to be able to change firewall nagios rules? Very difficult if you've only got 5 files. Simple if it's setup like this.

Individual hosts do, in fact, get their own configuation file, and the filename is FQDN.cfg. This eliminates any ambiguity that might come into affect with less specific names, and it allows me to setup exactly the checks I need, because host declarations can belong to multiple hostgroups. Got a web server that also serves files?


define host{
        use             linux-host
        host_name       fs-web.internal.domain
        alias           fs-web.internal.domain
        address         10.95.1.22
        hostgroups      http-servers, file-servers
}

*BAM* Every check I need is in place, because the host uses the "linux-host" prototype, which specifies check by ping, automatically adds a check for snmpd, and any other requirements. http-servers does all the web related stuff, and file-servers makes sure all the NAS stuff is available. All in 7 lines.

The initial configuration still takes a lot of time to get the hierarchy setup and to "wrap your head around" the way it works, but the amount of time I've saved by being able to create short files which do lots of stuff is just amazing. I can't even tell you how much time, because I've stopped thinking about it.

I honestly can't believe that some books and howtos tell you to keep things in the original files.

I hope this has helped open your eyes to other possibilities to Nagios configuration and manageability. I'm sure some of you have interesting configuration setups, too. Share yours in the comments (or tell me what I'm doing wrong, so I can fix it!)

  • http://mymcp.blogspot.com/ steve.lippert

    This looks like an excellent guide. I currently have each server / host as an FQDN.cfg file, but I copy one to make a new one. I don't have this level of abstraction yet.

  • http://bok.xs4all.nl/weblog/ BOK

    The Nagios-instance / cfg. I inherited is a complete mess... This post might be a good start for rearranging stuff. Thanks! Any more hints for reorganizing Nagios are welcome.

  • http://www.standalone-sysadmin.com Matt Simmons

    Thanks, Steve. I'm glad you like it!

    Copying individual configs is easy, too. If you use very similar configs for all your hosts, you could write a simple script that creates the shell.

    Also, @joerussbowman on my twitter feed has promised to comment here later this week on a method of easily creating host configs.

  • http://www.standalone-sysadmin.com Matt Simmons

    @BOK

    Thanks!

    My biggest piece of advice would be to simplify. It's WAY too easy to get yourself tangled in a gordian knot of Nagios dependencies. Step back and look at the big picture. Draw diagrams, spreadsheets, or whatever it takes to help you keep track of what is happening where.

    Anything in particular you're having problems with?

  • Ray Holtz

    I saw you replied on Sysadmin Network, but didn't have a chance to fully look through your reply. Hopefully in the next day or so. I'm glad I was able to help you create a blog post about it, haha!
    Thanks Matt!!

  • http://www.standalone-sysadmin.com Matt Simmons

    Thanks Ray, I'm very grateful that you asked the question. Please let me know if you have any questions about the write up. I always look forward to feedback :-)

  • Scott

    Matt, are you setting up dependencies at all? I'm struggling in my current config because I'm monitoring about 50 different coldfusion websites via check_http, and if the coldfusion process locks up, every one of the 50 begins to alert me at the same time. I was wondering how something like that would look in your arrangement.

    I've got them set to the host as the parent, and if that server shuts off completely, I don't get 50 alerts, but the more typical failure is a runaway coldfusion process that begins queuing requests.

    As for my current organization method, I use a series of plain text files in directories that list a server per line. The directory name that the text file is in and the name of the file itself identify the type of server (i.e. "linux/apache.txt" "windows/sqlserver.txt") and then a perl script rifles through them and spits out the appropriate service entries for each host. Your way looks cleaner :-)

  • http://www.standalone-sysadmin.com Matt Simmons

    Scott,

    Thanks for the comment. I've got to admit that right now I don't have any dependencies configured. Give me a day or so to go through the documentation again and I'll see about the best way to integrate them into this.

    Thanks for the compliment. I really like this method of doing it, but if I can't get dependencies resolved, it might only be useful for me.

  • Nils

    Matt, thanks for the great howto. Could you just please explain how you add snmpd checks by using the "linux-host" prototype. Perhaps you can also post the prototype.

    best regards...
    Nils

  • http://www.standalone-sysadmin.com Matt Simmons

    Nils:

    I'll post a better prototype when I get to work tomorrow, but essentially, I've got a service setup like this:

    define service {
    use generic-service
    service_description snmpd-check
    check_command check_snmpd
    hostgroup_name linux-hosts
    }

    Whenever you create a service, you need to specify either a host or a hostgroup. I've never really liked this requirement, since I'd like to define which services the hosts or hostgroups have on the host side of the equation, rather than the service side, but it's non-optional. Since it's non-optional, I just assign a service to a hostgroup that needs that service by default. In this case, all of my linux hosts are required to have snmpd running (so I can do other checks), so I create an snmpd service and set the hostgroup as the hostgroup that all of the linux hosts belong to.

    Restart nagios and voila, all of my linux hosts automatically have snmpd checks.

  • http://www.standalone-sysadmin.com Matt Simmons

    I should also mention that this page:
    http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html

    is the most important page of Nagios documentation in existence :-)

  • Nils

    Matt, thanks for your answer.

    In your example, the host fs-web.internal.domain only belongs to the hostgroups "http-servers" and "file-servers". Wouldn't it be necessary to use the additive inheritance feature to make sure, that the host "fs-web" also belongs to the "linux-host" hostgroup?

    define host {
    use generic-host
    name linux-host
    ...
    hostgroups linux-hosts
    }

    define host{
    use linux-host
    host_name fs-web.internal.domain
    ....
    hostgroups +http-servers, file-servers
    }

    It would be nice, if you could post tomorrow some more pieces of your configuration. I really like the way you manage your configs and I hope that I can adopt it to our new nagios instance.

    best regards
    Nils

  • Anthony

    Object inheritance can be a useful tool, however it is not something to be abused.

    I inherited a Nagios implementation where there were so many layers of inheritance, you had to scan through 8 different config files to parse out the entire definition of any single object.

    In order to be able to quickly figure out any objects entire definition, I added comments to the definitions to identify which files contained the templates used in any objects definition.

    Eventually I ended up completely reconfiguring the system with a much flatter configuration.

  • http://www.lamertz.net Michael Lamertz

    Hey,

    thanks for sparing me a good day of work with your great article.

    As a freelancing consultant, I've just finished my fifth larger Nagios installation project, and always wanted to write up something like this.

    Your directory and inheritance setup looks really similar to what I came up with. Only difference is, that I try to reflect the network structure with my directory tree (e.g. one subdirectory for every network zone, or for every logical group of servers), rather than splitting it up by the OS, but ymmv.

    Funny, that none of the Nagios chapters in books and the dedicated books care about this stuff, as it's so important to ease the maintenance in larger setups.

    *sigh* perhaps I'll end up, writing the article in german anyways ;-)

  • http://cern.ch Enrico Bonaccorsi

    Very interesting article!

    Actually we manage a nagios test server with ~1000 servers and ~9000 services monitored/
    Honestly I had your same problem and after looking around in the web for a while and reading the source, in my opinion the best way to configure nagios is to create one file per host and try to use as much as possible the hostgroup and the servicegroup directive.

    In this way with a properly configured hostgroups.cfg you can manage easily the config avoiding to repeat every time the same service everywhere.

    So I would advice to create a file called "hostgroupname1_services.cfg" with something inside like the following example.

    define service{
    use generic-service
    service_description Disk /
    check_command nrpe_pvss_check
    hostgroup_name farm_nodes
    # servicegroups oraclegroup #Eventually associate with a service group
    }

    define service{
    use generic-service
    service_description Task Manager Node
    check_command check_taskmanager_node
    hostgroup_name farm_servers
    }

  • http://cern.ch Enrico Bonaccorsi

    Ops I did not read the Matt Simmons comment.
    Essentially we are speaking about the same stuff.

  • http://blog.friocorte.com/ goozbach

    Matt,

    I took your idea and did one better. I've taken the stock Fedora 11 Nagios config and split it up similar to how you've done it.


    [root@grimm nagios]# pwd
    /etc/nagios
    [root@grimm nagios]# tree -d
    .
    |-- conf.d
    |-- objects
    | |-- commands
    | | `-- local
    | |-- contacts
    | | |-- groups
    | | `-- users
    | |-- hostgroups
    | |-- hosts
    | | |-- linux
    | | |-- macos
    | | `-- windows
    | |-- misc
    | | |-- envrionmental
    | | |-- printers
    | | `-- security
    | |-- network
    | | |-- firewalls
    | | |-- links
    | | |-- load_balancers
    | | |-- routers
    | | `-- switches
    | `-- services
    | |-- localhost
    | |-- printer
    | |-- switch
    | `-- windows
    `-- private

    I've copied a tarball (complete with all the stock configs) to one of my web-servers for download.

  • Josh

    I am following this organizational idea to setup my own, and I am curious how anyone got the service definition to work. I downloaded the tarball uploaded by goozbach, and in the service definitions, it still refers to a single host:


    define service{
    use generic-service
    host_name winserver.mydomain.com
    service_description C:\Drive Space
    check_command check_nt!USEDDISKSPACE!-l c -w 80 -c 90

    }

    I want to write mine to tie this service to a hostgroup, rather than to a single host. Some earlier posts here also mentioned that they were able to use similar syntax such as:


    define service{
    use generic-service
    hostgroup_name windows-servers
    service_description C:\Drive Space
    check_command check_nt!USEDDISKSPACE!-l c -w 80 -c 90

    }

    When I attempt to start Nagios, I get an error like this:


    Error: Could not expand hostgroup and/or hosts specified in service

    Googling this error turns up some pages that mentioned patching the Nagios code so the service defnition does not require host_name. That seems a bit strange to me, and wanted to see if anyone has gotten this approach to work WITHOUT patching Nagios.

  • http://www.standalone-sysadmin.com Matt Simmons

    Hi Josh,

    Thanks for sharing. Just so I understand, you want to tie a service to a hostgroup, and you're doing that by setting "hostgroup_name" in the service as "windows-servers". Could you paste the section of your config where you define the windows-servers hostgroup?

  • Josh

    Answering my own question, now I understand why people keep calling it "empty hostgruop"... I did not assign create any host in the "hostgroup" I was using. As soon as I created a host that belongs directly to that group, this error disappeared. For example, I had a "windows-servers" hostgroup, then I created a host that has the directive "hostgroups" set to "windows-servers", this eliminated the error. I had setup hostgroup_member within the hostgroup definition but that became unnecessarily complex.

  • James

    Continuing the thread...

    I'm attempting to use the same general configuration:
    - create hostgroups
    - create services which connect to these hostgroups
    - now leave the above definitions alone (they are defined and completed)
    - then create hosts which connect to the hostgroups (thereby getting services)

    Where I'm having trouble is in my need to exclude a service for a particular host. Let's say I have a hostgroup for a certain type of custom server. This custom server ends up getting 20 services. Now, let's say on one of these custom servers I don't want one of the services. Is there a way to exclude just that service for just that host? Seems like there should be a way to remove the service link from the host side but I can't find it. Thank you, James

  • http://www.lamertz.net/ Michael Lamertz

    @James: Inside the service definition that's assigned to the hostgroup, use

    host_name !name-of-server-to-exclude

    I'm using this to exclude a single windows host in a hostgroup that otherwise consists of only unix servers.

    This is documented in the chapter 'object tricks' in the documentation (nope, that's not listed in the index. It can only be found linked from the chapter 'object definitions').

    http://nagios.sourceforge.net/docs/3_0/objecttricks.html

  • James

    Thank you, Michael. Unfortunately, that's what I had to do and what I'm hoping to get away from.

    What I would like to do is define the hostgroups and their attached services and then consider them done.

    Then, in my host configuration, I'd like to be able to individually remove services. That would be the logical place. That way host-specific stuff would be in one place. So, within the host definition, something like:

    service_name !name-of-service-to-exclude

    I guess my error was assuming this should exist. It's probably a feature request.

  • James

    Matt,
    Have you found a solution for the service dependencies (July 22 comment)?

    Here is one way that does not work. I set up a service dependency as follows:
    define servicedependency {
    service_description check_service1
    dependent_service_description check_service2
    hostgroup_name hostgroup1
    dependent_hostgroup_name hostgroup1
    }

    Basically if any host that is attached to this hostgroup (hostgroup1) has a failure on check_service1 then all hosts attached to that hostgroup will no longer have their dependent service check_service2 checked. That was an interesting behavior, but not what I intended.

  • James

    Now that a few months have gone by, I wanted to echo Michael's recommendation (excluding hosts in my service definition files). It isn't the cleanest solution, but I'm finding that it works without too much hassle.

    Here's how I have the Nagios configuration organized:
    Starting from /etc/nagios ...
    Create host definitions (one per file) in sites//hosts directory
    In the host definition, attach all the desired hostgroups (host_groups ...)
    Create site-specific service definitions (one per file) in sites//services directory
    Create hostgroup definitions (one per file) in global/hostgroups directory
    Create service definitions (one per file) in global/services directory
    In the service definition, attach the desired hostgroup (hostgroup_name ...)
    In the service definition, remove any undesired hosts (hostname ! ...)

    Now, I have sets of services that I can attach/deattach easily to/from hosts. Another important piece is to use renaming of the files. Each of these little configuration files ends in .cfg. To quickly remove any file, rename it to .cfx and reload Nagios. You can't do this if you have big configuration files. You'll find it becomes a pain to comment out hosts. If you use rc [source-code control highly recommended], you can just delete the file. Then, if you need it, you just check it out.

    As examples, you can copy a host definition file from a working environment, change the filename (I use .cfg to make life easy), change the ip address in the file, check it in to source-code control and reload. As another example, you can create a new service, attach it to a hostgroup, check it in, reload and have that service now running on many environments. These examples only take a couple of minutes to complete.

    One other thing that is helpful, is that I have a little script that determines which environments are to be included. It creates the configuration file that Nagios uses. Then, I just keep a text file with environment names. This lets me easily turn on and turn off an environment. It's cleaner than renaming all the host and custom-services files and also allows me to quickly put in and take out testing environments.

    I'm monitoring about 50 different environments with about 500 hosts and 3500 services that are changing often (application versions, new customers, removing environments, etc.) With the above, I was able to take about 70K lines of brick-and-mortar Nagios configuration and convert it into about 8K lines. Hopefully, you can avoid all that effort by going global right up front.

    The setup is not painless, but it is supportable. Next step is to automate it all.

    Lastly, I still don't have a good solution for service dependencies. I have most of the dependency-definitions in the global area (register 0) and put a little file in my sites//services directories. But, it's a little task that is needed for every environment.

  • James

    Oops -- angle brackets are special, so some text disappeared.

    My configuration filenames are in the format: [fqdn].cfg

  • James

    And the directory structure is:
    /etc/nagios
    -- sites/[environment]/hosts
    -- sites/[environment]/services
    -- global/hostgroups
    -- global/services
    -- global/templates

  • B

    Very Nice, do you also create subdirectories for sites/[environment]/hosts/ for clarity between device types?

  • http://www.standalone-sysadmin.com Matt Simmons

    Well, I've got 2 sites that I'm monitoring, and I have one nagios install at each site.

    I've got kind of a rule when I monitor things, and that's not to run checks across the WAN, with the exception of the network link checks and a check of the remote Nagios host. That way I get alerted if there's a failure anywhere in the network infrastructure OR if either of the nagios hosts go down.

    That sort of relieves me of the need to do the site-based hierarchy, but if your infrastructure's Nagios install would benefit from that form, then by all means use it :-) Sysadmins can't afford to be dogmatic. We've got to be pragmatic and do what's best.

  • Pingback: It’s my blogiversary! | Standalone Sysadmin

  • Pingback: Monitoring production server - Admins Goodies

  • http://blog.friocorte.com Goozbach

    I had someone contact me for the tarball I mentioned above. The site I posted it on has since been retired.

    Here's the latest info:

    http://blog.friocorte.com/2011/09/nagios-config-template----goozbach-rewind.html

  • Pingback: A simpler Nagios configuration « 0ddn1x: tricks with *nix

  • Pingback: Amazing outpouring against Booth Babes | Standalone Sysadmin

  • http://deadc.org deadc

    BUMP!

    with this layout, as I could set specific commands? e.g

    host1 - check_ping 100.30%
    host2 - check_ping 200.30%

    I should use variables or customize commands?

    cheers!

  • Pingback: Review: Nagios Core Administration Cookbook | Standalone Sysadmin

  • http://wiki.worldweb.com.br/pt-br/Usu%C3%A1rio:CarinaHol auto loans for bad credit

    You may have to take immediate action to rise above your financial difficulties andrestore your credit rating, it can easily be
    achieved. As a result it may well be that you believe that the chances of finding the best auto loan
    choice using bad credit score, your monthly income. To best improve your credit.

    My site; auto loans for bad credit

  • http://www.ptcf.org.tw/ptcf2/userinfo.php?uid=12242 whatsapp

    Very nice post. I just stumbled upon your weblog and wished to say that I have
    really enjoyed browsing your blog posts. After all I will be subscribing to your rss feed and I hope you write again very soon!

  • Pingback: Monitoring MySQL clusers with Nagios - Just just easy answers

  • Pingback: Monitoring production server - Just just easy answers

  • Pingback: Nagios configuration management - Just just easy answers

  • http://www.springworks.biz/cheap_ncaa_jerseys_basketball.html cheap ncaa jerseys basketball

    I realize that your individual knowing about this topic is without a doubt strong and even thorough. Many goody to obtain someone authoring whom not simply possesses knowledge but will also the chance connect via easy visualize and even recall info-bytes. Blog considerably more!

  • Pingback: Nagios not sending emails.

  • http://none Nic

    Matt, thanks for the wonderful advice. I've been thinking about the following for sometime, maybe you can shed some light on it. Is it possible to assign a group of services to a group of hosts? For example I would like this host group:


    define hostgroup {
    hostgroup_name web-servers
    alias Web Servers
    members webserver1,webserver2,webserver3
    }

    to be "linked" to this service group


    define servicegroup {
    servicegroup_name web-services
    alias Web services
    members service1,service2,service3
    }

    The main problems I see is that there is no way to define a service without specifying host_name, which is required. And, there is now way to somehow link a service group with a host group.

    Any thoughts on this?