Appearance on @GeekWhisperers Podcast!

Date October 29, 2014

I was very happy to visit Austin, TX not long ago to speak at the Spiceworld Austin conference, held by Spiceworks. Conferences like that are a great place to meet awesome people you only talk to on the internet and to see old friends.

Part of the reason I was so excited to go was because John Mark Troyer had asked me if I wanted to take part in the Geek Whisperers Podcast. Who could say no?

With over 60 episodes to their name, Geek Whisperers fills an amazing niche of enterprise solutions, technical expertise, and vendor luminaries that spans every market that technology touches. Hosted by John, Amy "CommsNinja" Lewis, and fellow Bostonite Matt Brender (well, he lives in Cambridge, but that’s close, right?), they have been telling great tales and having a good time doing it for years. I’ve always respected their work, and I was absolutely touched that they wanted to have me on the show.

We met on the Tuesday of the conference and sat around for an hour talking about technology, tribes, and the progression of people and infrastructure. I had such a really good. time, and I hope they did, too.

You can listen to the full podcast on Geek-Whisperers.com or through iTunes or Stitcher.

Please comment below if you have any questions about things we discussed. Thanks for listening!

Accidental DoS during an intentional DoS

Date October 24, 2014

Funny, I remember always liking DOS as a kid...

Anyway, on Tuesday, I took a day off, but ended up getting a call at home from my boss at 4:30pm or so. We were apparently causing a DoS attack, he said, and the upstream university had disabled our net connection. He was trying to conference in the central network (ITS) admins so we could figure out what was going on.

I sat down at my computer and was able to connect to my desktop at work, so the entire network wasn't shut down. It looked like what they had done was actually turn off out-bound DNS, which made me suspect that one of the machines on our network was performing a DNS reflection attack, but this was just a sign of my not thinking straight. If that had been the case, they would have shut down inbound DNS rather than outbound.

After talking with them, they saw that something on our network had been initiating a denial of service attack on DNS servers using hundreds of spoofed source IPs. Looking at graphite for that time, I suspect you'll agree when I say, "yep":

Initially, the malware was spoofing IPs from all kinds of IP ranges, not just things in our block. As it turns out, I didn't have the sanity check on my egress ACLs on my gateway that said, "nothing leaves that isn't in our IP block", which is my bad. As soon as I added that, a lot of the traffic died. Unfortunately, because the university uses private IP space in the 10.x.x.x range, I couldn't block that outbound. And, of course, the malware quickly caught up to speed and started exclusively using 10.x addresses to spoof from. So we got shut down again.

Over the course of a day, here's what the graph looked like:

Now, on the other side of the coin, I'm sure you're screaming "SHUT DOWN THE STUPID MACHINE DOING THIS", because I was too. The problem was that I couldn't find it. Mostly because of my own ineptitude, as we'll see.

Alright, it's clear from the graph above that there were some significant bits being thrown around. That should be easy to track. So, lets fire up graphite and figure out what's up.

Most of my really useful graphs are thanks to the ironically named Unhelpful Graphite Tip #6, where Jason Dixon describes the "mostDeviant" function, which is pure awesome. The idea is that, if you have a BUNCH of metrics, you probably can't see much useful information because there are so many lines. So instead, you probably want the few weirdest metrics out of that collection, and that's what you get. Here's how it works.

In the graphite box, set the time frame that you're looking for:

Then add the graph data that you're looking for. Wildcards are super-useful here. Since the uplink graph above is a lot of traffic going out of the switch (tx), I'm going to be looking for a lot of data coming into the switch (rx). The metric that I'll use is:


CCIS.systems.linux.Core*.snmp.if_octets-Ethernet*.rx

That metric, by itself, looks like this:

There's CLEARLY a lot going on there. So we'll apply the mostDeviant filter:

and we'll select the top 4 metrics. At this point, the metric line looks like this:


mostDeviant(4,CCIS.systems.linux.Core*.snmp.if_octets-Ethernet*.rx)

and the graph is much more manageable:

Plus, most usefully, now I have port numbers to investigate. Back to the hunt!

As it turns out, those two ports are running to...another switch. An old switch that isn't being used by more than a couple dozen hosts. It's destined for the scrap heap, and because of that, when I was setting up collectd to monitor the switches using the snmp plugin, I neglected to add this switch. You know, because I'm an idiot.

So, I quickly modified the collectd config and pushed the change up to the puppet server, then refreshed the puppet agent on the host that does snmp monitoring and started collecting metrics. Except that, at the moment, the attack had stopped...so it was a waiting game that might never actually happen again. As luck would have it, the attack started again, and I was able to trace it to a port:

Gotcha!

(notice how we actually WERE under attack when I started collecting metrics? It was just so tiny compared to the full on attack that we thought it might have been normal baseline behavior. Oops)

So, checking that port led to...a VM host. And again, I encountered a road block.

I've been having an issue with some of my VMware ESXi boxes where they will encounter occasional extreme disk latency and fall out of the cluster. There are a couple of knowledgebase articles ([1] [2]) that sort-of kind-of match the issue, but not entirely. In any event, I haven't ironed it out. The VMs are fine during the disconnected phase, and the fix is to restart the management agents through the console, which I was able to do and then I could manage the host again.

Once I could get a look, I could see that there wasn't a lot on that machine - around half a dozen VMs. Unfortunately, because the host had been disconnected from the vCenter Server, stats weren't being collected on the VMs, so we had to wait a little bit to figure out which one it was. But we finally did.

In the end, the culprit was a NetApp Balance appliance. There's even a knowledge base article on it being vulnerable to ShellShock. Oops. And why was that machine even available to the internet at large? Double oops.

I've snapshotted that machine and paused it. We'll probably have some of the infosec researchers do forensics on it, if they're interested, but that particular host wasn't even being used. VM cruft is real, folks.

Now, back to the actual problem...

The network uplink to the central network happens over a pair of 10Gb/s fiber links. According to the graph, you can see that the VM was pushing 100MB (800Mb/s). This is clearly Bad(tm), but it's not world-ending bad for the network, right? Right. Except...

Upstream of us, we are going through an in-line firewall (that, like OUR equipment, was not set to filter egress traffic based on spoofed source IPs - oops, but not for me, finally!). We are assigned to one of five virtual firewalls on that one physical piece of hardware...despite that, the actual physical piece of hardware has a limit of around a couple hundred thousand concurrent sessions.

For a network this size, that is probably(?) reasonable, but a session counts as a stream of packets between a source IP and a destination IP. Every time you change the source IP, you get a new session, and when you spoof thousands of source IPs...guess what? And since it's a per-physical-device limit, our one rogue VM managed to take out the resources of the big giant firewall.

In essence, this one intentional DoS attack on a couple of hosts in China successfully DoS'd our university as sheer collateral damage. Oops.

So, we're working on ways to fix things. A relatively simple step is to prevent egress traffic from IPs that aren't our own. This is done now. We've also been told that we need to block egress DNS traffic, except from known hosts or to public DNS servers. This is in place, but I really question its efficacy. So we're blocking DNS. There are a lot of other protocols that use UDP, too. NTP reflection attacks are a thing. Anyway, we're now blocking egress DNS and I've had to special-case a half-dozen research projects, but that's fine by me.

In terms of things that will make an actual difference, we're going to be re-evaluating the policies in place for putting VMs on publicly-accessible networks, and I think it's likely that there will need to be justification for providing external access to new resources, whereas in the past, it's just been the default to leave things open because we're a college, and that's what we do, I guess. I've never been a fan of that, from a security perspective, so I'm glad it's likely to change now.

So anyway, that's how my week has been. Fortunately, it's Friday, and my sign is up to "It has been [3] Days without a Network Apocalypse".

VM Creation Day - PowerShell and VMware Automation

Date October 17, 2014

I should have ordered balloons and streamers, because Monday was VM creation day on my VMware cluster.

In addition to a 3-node production-licensed vSphere cluster, I run a 10-node cluster specifically for academic purposes. One of those purposes is building and maintaining classroom environments. A lot of professors maintain a server or two for their courses, but our Information Assurance program here goes above and beyond in terms of VM utilization. Every semester, I've got to deal with the added load, so I figured if I'm going to document it, I might as well get a blog entry while I'm at it.vmware_ia_spinup

Conceptually, the purpose of this process is to allow an instructor to create a set of virtual machines (typically between 1 and 4 of them), collectively referred to as a 'pod', which will serve as a lab for students. Once this set of VMs is configured exactly as the professor wants, and they have signed off on them, those VMs become the 'Gold Images', and then each student gets their own instance of these VMs. A class can have between 10 and 70 students, so this quickly becomes a real headache to deal with, hence the automation.

Additionally, because these classes are Information Assurance courses, it's not uncommon for the VMs to be configured in an insecure manner (on purpose) and to be attacked by other VMs, and to generally behave in a manner unbecoming a good network denizen, so each class is cordoned off onto its own VLAN, with its own PFsense box guarding the entryway and doing NAT for the several hundred VMs behind the wall. The script needs to automate the creation of the relevant PFsense configs, too, so that comes at the end.

I've written a relatively involved PowerShell script to do my dirty work for me, but it's still a long series of things to go from zero to working classroom environment. I figured I would spend a little time to talk about what I do to make this happen. I'm not saying it's the best solution, but it's the one I use, and it works for me. I'm interested in hearing if you've got a similar solution going on. Make sure to comment and let everyone know what you're using for these kinds of things.

The process is mostly automated hard parts separated by manual staging, because I want to verify sanity at each step. This kind of thing happens infrequently enough that I'm not completely trusting of the process yet, mostly due to my own ignorance of all of the edge cases that can cause failures. To the right, you'll see a diagram of the process.

In the script, the first thing I do is include functions that I stole from an awesome post on Subnet Math with PowerShell from Indented!, a software blog by Chris Dent. Because I'm going to be dealing with the DHCP config, it'll be very helpful to be able to have functions that understand what subnet boundaries are, and how to properly increment IP addresses.

I need to make sure that, if this powershell script is running, that we are actually loading the VMware PowerCLI commandlets. We can do that like this:


if ( ( Get-PSSnapin -name VMware.VimAutomation.Core -ErrorAction SilentlyContinue ) -eq $null ) {
Add-PSSnapin VMware.VimAutomation.Core
}

For the class itself, this whole process consists of functions to do what needs to be done (or "do the needful" if you use that particular phrase), and it's fairly linear, and each step requires the prior to be completed. What I've done is to create an object that represents the course as a whole, and then add the appropriate properties and methods. I don't actually need a lot of the power of OOP, but it provides a convenient way to keep everything together. Here's an example of the initial class setup:


$IA = New-Object psobject

# Lets add some initial values
Add-Member -InputObject $IA -MemberType NoteProperty -Name ClassCode -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Semester -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Datastore -Value "FASTDATASTORENAME"
Add-Member -InputObject $IA -MemberType NoteProperty -Name Cluster -Value "IA Program"
Add-Member -InputObject $IA -MemberType NoteProperty -Name VIServer -Value "VSPHERE-SERVER"
Add-Member -InputObject $IA -MemberType NoteProperty -Name IPBlock -Value "10.0.1.0"
Add-Member -InputObject $IA -MemberType NoteProperty -Name SubnetMask -Value "255.255.0.0"
Add-Member -InputObject $IA -MemberType NoteProperty -Name Connected -Value $false
Add-Member -InputObject $IA -MemberType NoteProperty -Name ResourcePool -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name PodCount -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name GoldMasters -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Folder -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name MACPrefix -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name ConfigDir -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name VMarray -Value @()

These are just the values that almost never change. Since we're using NAT, and we're not routing to that network, and every class has its own dedicated VLAN, we can use the same IP block every time without running into a problem. The blank values are there just as placeholder, and those values will be filled in as the class methods are invoked.

At the bottom of the script, which is where I spend most of my time, I set per-class settings:


$IA.ClassCode = "ia1234"
$IA.Semester = "Fall-2014"
$IA.PodCount = 35
$IA.GoldMasters = @(
@{
vmname = "ia1234-win7-gold-20141014"
osname = "win7"
tcp = 3389
udp = ""
},
@{
vmname = "ia1234-centos-gold-20141014"
osname = "centos"
tcp = ""
udp = ""
},
@{
vmname = "ia1234-kali-gold-20141014"
osname = "kali"
tcp = "22"
udp = ""
}
)

We set the class code, semester, and pod count simply. These will be used to create the VM names, the folders, and resource groups that the VMs live in. The GoldMaster array is a data structure that has an entry for each of the gold images that the professor has created. It contains the name of the gold image, plus a short code that will be used to name the VM instances coming from it, and has a placeholder for the tcp and udp ports which need forwarded from the outside to allow internal access. I don't currently have the code in place that allows me to specify multiple port forwards, but that's going to be added, because I had a professor request 7(!) forwarded ports per VM in one of their classes this semester.

As you can see in the diagram, I'm using Linked Clones to spin up the students' pods. This has the advantage of saving diskspace and of completing quickly. Linked clones operate on a snapshot of the original disk image. Rather than actually have the VMs operate on the gold images, I do a full clone of the VM over to a faster datastore than the Ol' Reliable NetApp.

We add a method to the $IA object like this:


Add-Member -InputObject $IA -MemberType ScriptMethod -Name createLCMASTERs -Value {
# This is the code that converts the gold images into LCMASTERs
# Because you need to put a template somewhere, it makes sense to put it
# into the folder that the VMs will eventually live in themselves (thus saving
# yourself the effort of locating the right folder twice).
Param()
Process {
... stuff goes here
}
}

The core of this method is the following block, which actually performs the clone:


if ( ! (Get-VM -Name $LCMASTERName) ) {
try {
$presnap = New-snapshot -Name ("Autosnap: " + $(Get-Date).toString("yyyMMdd")) -VM $GoldVM -confirm:$false

$cloneSpec = new-object VMware.Vim.VirtualMachineCloneSpec
$cloneSpec.Location = New-Object VMware.Vim.VirtualMachineRelocateSpec
$cloneSpec.Location.Pool = ($IA.ResourcePool | Get-View).MoRef
$cloneSpec.Location.host = ($vm | Get-VMHost).MoRef
$cloneSpec.Location.Datastore = ($IA.Datastore | Get-View).MoRef
$cloneSpec.Location.DiskMoveType = [VMware.Vim.VirtualMachineRelocateDiskMoveOptions]::createNewChildDiskBacking
$cloneSpec.Snapshot = ($GoldVM | Get-View).Snapshot.CurrentSnapshot
$cloneSpec.PowerOn = $false

($GoldVM | Get-View).cloneVM( $LCMasterFolder.MoRef, $LCMASTERName, $cloneSpec)

Remove-snapshot -Snapshot $presnap -confirm:$false
}
catch [Exception] {
Write-Host "Error: " $_.Exception.Message
exit
}
} else {
Write-Host "Template found with name $LCMasterName - not recreating"
}

(apologies for the lack of indentation)

If you're interested in doing this kind of thing, make sure you check out the docs for the createNewChildDiskBacking setting.

After the Linked Clone Masters have been created, then it's a simple matter of creating the VMs from each of them (using the $IA.PodCount value to figure out how many we need). They end up getting named something like $IA.ClassCode-$IA.Semester-$IA.GoldMasters[#].osname-pod$podcount which makes it easy to figure out what goes where when I have several classes running at once.

After the VMs have been created, we can start dealing with the network portion. I used to spin up all of the VMs, then loop through them and pull the MAC addresses to use with the DHCP config, but there were problems with that method. I found that a lot of the time, I'll need to rerun this script a few times per class, either because I've screwed something up or the instructor needs to make changes to the pod. When that happens, EACH TIME I had to re-generate the DHCP config (which is easy) and then manually insert it into PFsense (which is super-annoying).

Rather than do that every time, I eventually realized that it's much easier just to dictate what the MAC address for each machine is, and then it doesn't matter how often I rerun the script, the DHCP config doesn't change. (And yes, I'm using DHCP, but with static leases, which is necessary because of the port forwarding).

Here's what I do:

Add-Member -InputObject $IA -MemberType ScriptMethod -Name assignMACs -Value {
Param()
Process {
$StaticPrefix = "00:50:56"
if ( $IA.MACPrefix -eq "" ) {
# Since there isn't already a prefix set, it's cool to make one randomly
$IA.MACPrefix = $StaticPrefix + ":" + ("{0:X2}" -f (Get-Random -Minimum 0 -Maximum 63) )
}
$machineCount = 0
$IA.VMarray | ForEach-Object {
$machineAddr = $IA.MACPrefix + ":" + ("{0:X4}" -f $machineCount).Insert(2,":")

$vm = Get-VM -name $_.name
$networkAdapter = Get-NetworkAdapter -VM $vm
Write-Host "Setting $vm to $machineAddr"
Set-NetworkAdapter -NetworkAdapter $networkAdapter -MacAddress $machineAddr -Confirm:$false
$IA.VMarray[$machineCount].MAC = $machineAddr
$IA.VMarray[$machineCount].index = $machineCount
$machineCount++

}
}
}

As you can see, this randomly assigns a MAC address in the vSphere range. Sort of. The fourth octet is randomly selected between 00 and 3F, and then the last two octets are incremented starting from 00. Optionally, the fourth octet can be specified, which is useful in a re-run of the script so that the DHCP config doesn't need to be re-generated.

After the MAC addresses are assigned, the IPs can be determined using the network math:


Add-Member -InputObject $IA -MemberType ScriptMethod -Name assignIPs -Value {
# This method really only assigns the IP to the object.
Param()
Process {
# It was tempting to assign a sane IP block to this network, but given the
# tendancy to shove God-only-knows how many people into a class at a time,
# lets not be bounded by reasonable or sane. /16 it is.
# First 50 IPs are reserved for gateway plus potential gold images.
$currentIP = Get-NextIP $IA.IPBlock 2
$IA.VMarray | ForEach-Object {
$_.IPAddr = $currentIP
$currentIP = Get-NextIP $currentIP 2
}

}
}

This is done by naively giving every other IP to a machine, leaving the odd IP addresses between them open. I've had to massage this before, where a large pod of 5-6 VMs all need to be incremental then skip IPs between them, but I've done those mostly as a one-off. I don't think I need to build in a lot of flexibility because those are relatively rare cases, but it wouldn't be that hard to develop a scheme for it if you needed.

After the IPs are assigned, you can create the DHCP config. Right now, I'm using an ugly hack, where I basically just print out the top of the DHCP config, then loop through the VMs outputting XML the whole way. It's ugly, and I'm not going to paste it here, but if you download a DHCPD XML file from PFsense, then you can basically see what I'm doing. I then do the same thing with the NAT config.

Because I'm still running these functions manually, I have these XML-creation methods printing output, but it's easy to see how you could have them redirect output to a text file (and if you were super-cool, you could use something like this example from MSDN where you spin up an instance of IE:


$ie = new-object -com "InternetExplorer.Application"
$ie.navigate("http://localhost/MiniCalc/Default.aspx")
... and so on

Anyway, I've spun up probably thousands of VMs using this script (or previous instances of it). It's saved me a lot of time, and if you have to manage bulk-VMs using vSphere, and you're not automating it (using PowerCLI, or vCloud Director, or something else), you really should be. And if you DO, what do you do? Comment below and let me know!

Thanks for reading all the way through!

Concerning PICC

Date October 8, 2014

Today, Wednesday, October 8, 2014, we, Matt Simmons and Thomas Limoncelli,  resigned from the board of Professional IT Community Conferences, Inc. also known as “PICC”.  PICC is the New Jersey non-profit business entity that has backed LOPSA-East and Cascadia since 2011.  Those two conferences should be unaffected as it was already agreed that they would find new organization(s) to work with for their 2015 conferences.

 

As of June 10, 2014, PICC, Inc. had voted to and was in the process of being dissolved.  However we feel this process has become impossible due to the remaining board member’s foot-dragging and at times outright deceptive actions.  We can not be on a board of an organization that conducts business in that way.  We feel that the community deserves better and should request transparency from PICC, Inc. during its dissolution process.

 

We look forward to the future success of the organizations and events with which PICC has been affiliated.

 

Nagios Config Howto Followup

Date September 16, 2014

One of the most widely read stories I've ever posted on this blog is my Nagios Configuration HowTo, where I explained how I set up my Nagios config at a former employer. I still think that it's a good layout to use if you're manually building Nagios configs. In my current position, we have a small manual setup and a humongous automated monitoring setup. We're moving toward a completely automated monitoring config using Puppet and Nagios, but until everything is puppetized, some of it needs to be hand-crafted, bespoke monitoring.

For people who don't have a single 'source of truth' in their infrastructure that they can draw monitoring config from, hand-crafting is still the way to go, and if you're going to do it, you might as well not drive yourself insane. For that, you need to take advantage of the layers of abstraction in Nagios and the built-in object inheritance that it offers.

Every once in a while, new content gets posted that refers back to my Config HowTo, and I get a bump in visits, which is cool. Occasionally, I'll get someone who is interested and asks questions, which is what happened in this thread on Reddit. /u/sfrazer pointed to my config as something that he references when making Nagios configs (Thanks!), and the original submitter replied:

I've read that write up a couple of times. My configuration of Nagios doesn't have an objects, this is what it looks like

(click to embiggen)

And to understand what you are saying, just by putting them in the file structure you have in your HowTo that will create an inheritance?

I wanted to help him understand how Nagios inheritance works, so I wrote a relatively long response, and I thought that it might also help other people who still need to do this kind of thing:


 

No, the directories are just to help remember what is what, and so you don't have a single directory with hundreds of files.

What creates the inheritance is this:

You start out with a host template:

msimmons@nagios:/usr/local/nagios/etc/objects$ cat generic-host.cfg
define host {
    name generic-host
    notifications_enabled   1
    event_handler_enabled   1
    flap_detection_enabled  1
    failure_prediction_enabled  1
    process_perf_data   1
    retain_status_information   1
    retain_nonstatus_information    1
    max_check_attempts 3
    notification_period 24x7
    contact_groups systems
    check_command check-host-alive
    register 0
}
# EOF

So, what you can see there is that I have a host named "generic-host" with a bunch of settings, and "register 0". The reason I have this is that I don't want to have to set all of those settings for every other host I make. That's WAY too much redundancy. Those settings will almost never change (and if we do have a specific host that needs to have the setting changed, we can do it on that host).

Once we have generic host, lets make a 'generic-linux' host that we can have the linux machines use:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat generic-linux.cfg 
define host { 
    name     linux-server
    use generic-host
    check_period    24x7
    check_interval  5
    retry_interval  1
    max_check_attempts  5
    check_command   check-host-alive
    notification_interval 1440
    contact_groups  systems
    hostgroups linux-servers
    register 0
}

define hostgroup {
    hostgroup_name linux-servers
    alias Linux Servers
}
# EOF

Alright, so you see we have two things there. A host, named 'linux-server', and you can see that it inherits from 'generic-host'. I then set some of the settings specific to the monitoring host that I'm using (for instance, you probably don't want notification_interval 1440, because that's WAY too long for most people - a whole day would go between Nagios notifications!). The point is that I set a bunch of default host settings in 'generic-host', then did more specific things in 'linux-server' which inherited the settings from 'generic-host'. And we made it 'register 0', which means it's not a "real" host, it's a template. Also, and this is important, you'll see that we set 'hostgroups linux-servers'. This means that any host we make that inherits from 'linux-server' will automatically be added to the 'linux-servers' hostgroup.

Right below that, we create the linux-servers hostgroup. We aren't listing any machines. We're creating an empty hostgroup, because remember, everything that inherits from linux-servers will automatically become a member of this group.

Alright, you'll notice that we don't have any "real" hosts yet. We're not going to yet, either. Lets do some services first.

msimmons@monitoring:/usr/local/nagios/etc/objects$ cat check-ssh.cfg
define command{
   command_name   check_ssh
   command_line   $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
   }
# EOF

This is a short file which creates a command called "check_ssh". This isn't specific to Linux or anything else. It could be used by anything that needed to verify that SSH was running. Now, lets build a service that uses it:

msimmons@monitoring:/usr/local/nagios/etc/objects/services$ cat generic-service.cfg 
define service{
        name                            generic-service  
        active_checks_enabled           1     
        passive_checks_enabled          1      
        parallelize_check               1       
        obsess_over_service             1        
        check_freshness                 0         
        notifications_enabled           1          
        event_handler_enabled           1           
        flap_detection_enabled          1            
        failure_prediction_enabled      1             
        process_perf_data               1
        retain_status_information       1 
        retain_nonstatus_information    1  
        is_volatile                     0   
        check_period                    24x7 
        max_check_attempts              3     
        normal_check_interval           10     
        retry_check_interval            2       
        contact_groups                  systems
      notification_options    w,u,c,r        
        notification_interval           1440
        notification_period             24x7       
         register                        0            
}
# EOF

This is just a generic service template with sane settings for my environment. Again, you'll want to use something good for yours. Now, something that will inherit from generic-service:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat linux-ssh.cfg
define service { 
    use generic-service
    service_description Linux SSH Enabled
    hostgroup_name linux-servers
    check_command check_ssh 
}
# EOF

Now we have a service "Linux SSH Enabled". This uses check_ssh, and (importantly), 'hostgroup_name linux-servers' means "Every machine that is a member of the hostgroup 'linux-servers' automatically gets this service check".

Lets do the same thing with ping:

define command{
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
 }

 define service { 
    use generic-service
    service_description Linux Ping
    hostgroup_name linux-servers
    check_command check_ping!3000.0,80%!5000.0,100
}

Sweet. (If you're wondering about the exclamation marks on the check_ping line in the Linux Ping service, we're sending those arguments to the command, which you can see set the warning and critical thresholds).

Now, lets add our first host:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat mylinuxserver.mycompany.com.cfg 
define host{
       use linux-server
       host_name myLinuxServer.mycompany.com
       address my.ip.address.here
}

That's it! I set the host name, I set the IP address, and I say "use linux-server" so that it automatically gets all of the "linux-server" settings, including belonging to the linux host group, which makes sure that it automatically gets assigned all of the Linux service checks. Ta-Da!

Hopefully this can help people see the value in arranging configs like this. If you have any questions, please let me know via comments. I'll be happy to explain! Thanks!

 

 

Mount NFS share to multiple hosts in vSphere 5.5

Date September 14, 2014

One of the annoying parts of making sure that you can successfully migrate virtual resources across a vSphere datacenter is ensuring that networks and datastores are not only available everywhere, but also named identically.

I inherited a system that was pretty much manually administered, without scripts. I've built a small powershell script to make sure that vSwitches can be provisioned identically when spinning up a new vHost, and there's really no excuse for not doing it for storage, except that not all of my hosts should have all of the same NFS datastores that another host has. I could do some kind of complicated menu system or long command line options, but that's hardly better than doing it individually.

Tonight, I learned about a really nice feature of the vSphere 5.5 web interface (which I am generally not fond of) - the ability to take a specific datastore and mount it on multiple hosts. (Thanks to Genroo on #vmware on Freenode for letting me know that it was a thing).


vmware-nfs-add-host-1
Log into the vSphere web interface and select "vCenter"

vmware-nfs-add-host-2

 

Select Datastores

vmware-nfs-add-host-3

Right click on the datastore you want to mount on other hosts. Select "All vCenter Actions", then 'Mount Datastore to Additional Host"
vmware-nfs-add-host-4

 

 

Pick the hosts you want to mount the datastore to.

vmware-nfs-add-host-5

The mount attempt will show up in the task list. Make sure that the hosts you select have access to NFS mount the datastore, otherwise it will fail. (You can see the X here, my failed attempt to rename a datastore after I created it using the wrong network for the filer. I'll clean that up shortly.

Anyway, hopefully this helps someone else in the future.