VM Creation Day - PowerShell and VMware Automation

Date October 17, 2014

I should have ordered balloons and streamers, because Monday was VM creation day on my VMware cluster.

In addition to a 3-node production-licensed vSphere cluster, I run a 10-node cluster specifically for academic purposes. One of those purposes is building and maintaining classroom environments. A lot of professors maintain a server or two for their courses, but our Information Assurance program here goes above and beyond in terms of VM utilization. Every semester, I've got to deal with the added load, so I figured if I'm going to document it, I might as well get a blog entry while I'm at it.vmware_ia_spinup

Conceptually, the purpose of this process is to allow an instructor to create a set of virtual machines (typically between 1 and 4 of them), collectively referred to as a 'pod', which will serve as a lab for students. Once this set of VMs is configured exactly as the professor wants, and they have signed off on them, those VMs become the 'Gold Images', and then each student gets their own instance of these VMs. A class can have between 10 and 70 students, so this quickly becomes a real headache to deal with, hence the automation.

Additionally, because these classes are Information Assurance courses, it's not uncommon for the VMs to be configured in an insecure manner (on purpose) and to be attacked by other VMs, and to generally behave in a manner unbecoming a good network denizen, so each class is cordoned off onto its own VLAN, with its own PFsense box guarding the entryway and doing NAT for the several hundred VMs behind the wall. The script needs to automate the creation of the relevant PFsense configs, too, so that comes at the end.

I've written a relatively involved PowerShell script to do my dirty work for me, but it's still a long series of things to go from zero to working classroom environment. I figured I would spend a little time to talk about what I do to make this happen. I'm not saying it's the best solution, but it's the one I use, and it works for me. I'm interested in hearing if you've got a similar solution going on. Make sure to comment and let everyone know what you're using for these kinds of things.

The process is mostly automated hard parts separated by manual staging, because I want to verify sanity at each step. This kind of thing happens infrequently enough that I'm not completely trusting of the process yet, mostly due to my own ignorance of all of the edge cases that can cause failures. To the right, you'll see a diagram of the process.

In the script, the first thing I do is include functions that I stole from an awesome post on Subnet Math with PowerShell from Indented!, a software blog by Chris Dent. Because I'm going to be dealing with the DHCP config, it'll be very helpful to be able to have functions that understand what subnet boundaries are, and how to properly increment IP addresses.

I need to make sure that, if this powershell script is running, that we are actually loading the VMware PowerCLI commandlets. We can do that like this:


if ( ( Get-PSSnapin -name VMware.VimAutomation.Core -ErrorAction SilentlyContinue ) -eq $null ) {
Add-PSSnapin VMware.VimAutomation.Core
}

For the class itself, this whole process consists of functions to do what needs to be done (or "do the needful" if you use that particular phrase), and it's fairly linear, and each step requires the prior to be completed. What I've done is to create an object that represents the course as a whole, and then add the appropriate properties and methods. I don't actually need a lot of the power of OOP, but it provides a convenient way to keep everything together. Here's an example of the initial class setup:


$IA = New-Object psobject

# Lets add some initial values
Add-Member -InputObject $IA -MemberType NoteProperty -Name ClassCode -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Semester -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Datastore -Value "FASTDATASTORENAME"
Add-Member -InputObject $IA -MemberType NoteProperty -Name Cluster -Value "IA Program"
Add-Member -InputObject $IA -MemberType NoteProperty -Name VIServer -Value "VSPHERE-SERVER"
Add-Member -InputObject $IA -MemberType NoteProperty -Name IPBlock -Value "10.0.1.0"
Add-Member -InputObject $IA -MemberType NoteProperty -Name SubnetMask -Value "255.255.0.0"
Add-Member -InputObject $IA -MemberType NoteProperty -Name Connected -Value $false
Add-Member -InputObject $IA -MemberType NoteProperty -Name ResourcePool -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name PodCount -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name GoldMasters -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name Folder -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name MACPrefix -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name ConfigDir -Value ""
Add-Member -InputObject $IA -MemberType NoteProperty -Name VMarray -Value @()

These are just the values that almost never change. Since we're using NAT, and we're not routing to that network, and every class has its own dedicated VLAN, we can use the same IP block every time without running into a problem. The blank values are there just as placeholder, and those values will be filled in as the class methods are invoked.

At the bottom of the script, which is where I spend most of my time, I set per-class settings:


$IA.ClassCode = "ia1234"
$IA.Semester = "Fall-2014"
$IA.PodCount = 35
$IA.GoldMasters = @(
@{
vmname = "ia1234-win7-gold-20141014"
osname = "win7"
tcp = 3389
udp = ""
},
@{
vmname = "ia1234-centos-gold-20141014"
osname = "centos"
tcp = ""
udp = ""
},
@{
vmname = "ia1234-kali-gold-20141014"
osname = "kali"
tcp = "22"
udp = ""
}
)

We set the class code, semester, and pod count simply. These will be used to create the VM names, the folders, and resource groups that the VMs live in. The GoldMaster array is a data structure that has an entry for each of the gold images that the professor has created. It contains the name of the gold image, plus a short code that will be used to name the VM instances coming from it, and has a placeholder for the tcp and udp ports which need forwarded from the outside to allow internal access. I don't currently have the code in place that allows me to specify multiple port forwards, but that's going to be added, because I had a professor request 7(!) forwarded ports per VM in one of their classes this semester.

As you can see in the diagram, I'm using Linked Clones to spin up the students' pods. This has the advantage of saving diskspace and of completing quickly. Linked clones operate on a snapshot of the original disk image. Rather than actually have the VMs operate on the gold images, I do a full clone of the VM over to a faster datastore than the Ol' Reliable NetApp.

We add a method to the $IA object like this:


Add-Member -InputObject $IA -MemberType ScriptMethod -Name createLCMASTERs -Value {
# This is the code that converts the gold images into LCMASTERs
# Because you need to put a template somewhere, it makes sense to put it
# into the folder that the VMs will eventually live in themselves (thus saving
# yourself the effort of locating the right folder twice).
Param()
Process {
... stuff goes here
}
}

The core of this method is the following block, which actually performs the clone:


if ( ! (Get-VM -Name $LCMASTERName) ) {
try {
$presnap = New-snapshot -Name ("Autosnap: " + $(Get-Date).toString("yyyMMdd")) -VM $GoldVM -confirm:$false

$cloneSpec = new-object VMware.Vim.VirtualMachineCloneSpec
$cloneSpec.Location = New-Object VMware.Vim.VirtualMachineRelocateSpec
$cloneSpec.Location.Pool = ($IA.ResourcePool | Get-View).MoRef
$cloneSpec.Location.host = ($vm | Get-VMHost).MoRef
$cloneSpec.Location.Datastore = ($IA.Datastore | Get-View).MoRef
$cloneSpec.Location.DiskMoveType = [VMware.Vim.VirtualMachineRelocateDiskMoveOptions]::createNewChildDiskBacking
$cloneSpec.Snapshot = ($GoldVM | Get-View).Snapshot.CurrentSnapshot
$cloneSpec.PowerOn = $false

($GoldVM | Get-View).cloneVM( $LCMasterFolder.MoRef, $LCMASTERName, $cloneSpec)

Remove-snapshot -Snapshot $presnap -confirm:$false
}
catch [Exception] {
Write-Host "Error: " $_.Exception.Message
exit
}
} else {
Write-Host "Template found with name $LCMasterName - not recreating"
}

(apologies for the lack of indentation)

If you're interested in doing this kind of thing, make sure you check out the docs for the createNewChildDiskBacking setting.

After the Linked Clone Masters have been created, then it's a simple matter of creating the VMs from each of them (using the $IA.PodCount value to figure out how many we need). They end up getting named something like $IA.ClassCode-$IA.Semester-$IA.GoldMasters[#].osname-pod$podcount which makes it easy to figure out what goes where when I have several classes running at once.

After the VMs have been created, we can start dealing with the network portion. I used to spin up all of the VMs, then loop through them and pull the MAC addresses to use with the DHCP config, but there were problems with that method. I found that a lot of the time, I'll need to rerun this script a few times per class, either because I've screwed something up or the instructor needs to make changes to the pod. When that happens, EACH TIME I had to re-generate the DHCP config (which is easy) and then manually insert it into PFsense (which is super-annoying).

Rather than do that every time, I eventually realized that it's much easier just to dictate what the MAC address for each machine is, and then it doesn't matter how often I rerun the script, the DHCP config doesn't change. (And yes, I'm using DHCP, but with static leases, which is necessary because of the port forwarding).

Here's what I do:

Add-Member -InputObject $IA -MemberType ScriptMethod -Name assignMACs -Value {
Param()
Process {
$StaticPrefix = "00:50:56"
if ( $IA.MACPrefix -eq "" ) {
# Since there isn't already a prefix set, it's cool to make one randomly
$IA.MACPrefix = $StaticPrefix + ":" + ("{0:X2}" -f (Get-Random -Minimum 0 -Maximum 63) )
}
$machineCount = 0
$IA.VMarray | ForEach-Object {
$machineAddr = $IA.MACPrefix + ":" + ("{0:X4}" -f $machineCount).Insert(2,":")

$vm = Get-VM -name $_.name
$networkAdapter = Get-NetworkAdapter -VM $vm
Write-Host "Setting $vm to $machineAddr"
Set-NetworkAdapter -NetworkAdapter $networkAdapter -MacAddress $machineAddr -Confirm:$false
$IA.VMarray[$machineCount].MAC = $machineAddr
$IA.VMarray[$machineCount].index = $machineCount
$machineCount++

}
}
}

As you can see, this randomly assigns a MAC address in the vSphere range. Sort of. The fourth octet is randomly selected between 00 and 3F, and then the last two octets are incremented starting from 00. Optionally, the fourth octet can be specified, which is useful in a re-run of the script so that the DHCP config doesn't need to be re-generated.

After the MAC addresses are assigned, the IPs can be determined using the network math:


Add-Member -InputObject $IA -MemberType ScriptMethod -Name assignIPs -Value {
# This method really only assigns the IP to the object.
Param()
Process {
# It was tempting to assign a sane IP block to this network, but given the
# tendancy to shove God-only-knows how many people into a class at a time,
# lets not be bounded by reasonable or sane. /16 it is.
# First 50 IPs are reserved for gateway plus potential gold images.
$currentIP = Get-NextIP $IA.IPBlock 2
$IA.VMarray | ForEach-Object {
$_.IPAddr = $currentIP
$currentIP = Get-NextIP $currentIP 2
}

}
}

This is done by naively giving every other IP to a machine, leaving the odd IP addresses between them open. I've had to massage this before, where a large pod of 5-6 VMs all need to be incremental then skip IPs between them, but I've done those mostly as a one-off. I don't think I need to build in a lot of flexibility because those are relatively rare cases, but it wouldn't be that hard to develop a scheme for it if you needed.

After the IPs are assigned, you can create the DHCP config. Right now, I'm using an ugly hack, where I basically just print out the top of the DHCP config, then loop through the VMs outputting XML the whole way. It's ugly, and I'm not going to paste it here, but if you download a DHCPD XML file from PFsense, then you can basically see what I'm doing. I then do the same thing with the NAT config.

Because I'm still running these functions manually, I have these XML-creation methods printing output, but it's easy to see how you could have them redirect output to a text file (and if you were super-cool, you could use something like this example from MSDN where you spin up an instance of IE:


$ie = new-object -com "InternetExplorer.Application"
$ie.navigate("http://localhost/MiniCalc/Default.aspx")
... and so on

Anyway, I've spun up probably thousands of VMs using this script (or previous instances of it). It's saved me a lot of time, and if you have to manage bulk-VMs using vSphere, and you're not automating it (using PowerCLI, or vCloud Director, or something else), you really should be. And if you DO, what do you do? Comment below and let me know!

Thanks for reading all the way through!

Concerning PICC

Date October 8, 2014

Today, Wednesday, October 8, 2014, we, Matt Simmons and Thomas Limoncelli,  resigned from the board of Professional IT Community Conferences, Inc. also known as “PICC”.  PICC is the New Jersey non-profit business entity that has backed LOPSA-East and Cascadia since 2011.  Those two conferences should be unaffected as it was already agreed that they would find new organization(s) to work with for their 2015 conferences.

 

As of June 10, 2014, PICC, Inc. had voted to and was in the process of being dissolved.  However we feel this process has become impossible due to the remaining board member’s foot-dragging and at times outright deceptive actions.  We can not be on a board of an organization that conducts business in that way.  We feel that the community deserves better and should request transparency from PICC, Inc. during its dissolution process.

 

We look forward to the future success of the organizations and events with which PICC has been affiliated.

 

Nagios Config Howto Followup

Date September 16, 2014

One of the most widely read stories I've ever posted on this blog is my Nagios Configuration HowTo, where I explained how I set up my Nagios config at a former employer. I still think that it's a good layout to use if you're manually building Nagios configs. In my current position, we have a small manual setup and a humongous automated monitoring setup. We're moving toward a completely automated monitoring config using Puppet and Nagios, but until everything is puppetized, some of it needs to be hand-crafted, bespoke monitoring.

For people who don't have a single 'source of truth' in their infrastructure that they can draw monitoring config from, hand-crafting is still the way to go, and if you're going to do it, you might as well not drive yourself insane. For that, you need to take advantage of the layers of abstraction in Nagios and the built-in object inheritance that it offers.

Every once in a while, new content gets posted that refers back to my Config HowTo, and I get a bump in visits, which is cool. Occasionally, I'll get someone who is interested and asks questions, which is what happened in this thread on Reddit. /u/sfrazer pointed to my config as something that he references when making Nagios configs (Thanks!), and the original submitter replied:

I've read that write up a couple of times. My configuration of Nagios doesn't have an objects, this is what it looks like

(click to embiggen)

And to understand what you are saying, just by putting them in the file structure you have in your HowTo that will create an inheritance?

I wanted to help him understand how Nagios inheritance works, so I wrote a relatively long response, and I thought that it might also help other people who still need to do this kind of thing:


 

No, the directories are just to help remember what is what, and so you don't have a single directory with hundreds of files.

What creates the inheritance is this:

You start out with a host template:

msimmons@nagios:/usr/local/nagios/etc/objects$ cat generic-host.cfg
define host {
    name generic-host
    notifications_enabled   1
    event_handler_enabled   1
    flap_detection_enabled  1
    failure_prediction_enabled  1
    process_perf_data   1
    retain_status_information   1
    retain_nonstatus_information    1
    max_check_attempts 3
    notification_period 24x7
    contact_groups systems
    check_command check-host-alive
    register 0
}
# EOF

So, what you can see there is that I have a host named "generic-host" with a bunch of settings, and "register 0". The reason I have this is that I don't want to have to set all of those settings for every other host I make. That's WAY too much redundancy. Those settings will almost never change (and if we do have a specific host that needs to have the setting changed, we can do it on that host).

Once we have generic host, lets make a 'generic-linux' host that we can have the linux machines use:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat generic-linux.cfg 
define host { 
    name     linux-server
    use generic-host
    check_period    24x7
    check_interval  5
    retry_interval  1
    max_check_attempts  5
    check_command   check-host-alive
    notification_interval 1440
    contact_groups  systems
    hostgroups linux-servers
    register 0
}

define hostgroup {
    hostgroup_name linux-servers
    alias Linux Servers
}
# EOF

Alright, so you see we have two things there. A host, named 'linux-server', and you can see that it inherits from 'generic-host'. I then set some of the settings specific to the monitoring host that I'm using (for instance, you probably don't want notification_interval 1440, because that's WAY too long for most people - a whole day would go between Nagios notifications!). The point is that I set a bunch of default host settings in 'generic-host', then did more specific things in 'linux-server' which inherited the settings from 'generic-host'. And we made it 'register 0', which means it's not a "real" host, it's a template. Also, and this is important, you'll see that we set 'hostgroups linux-servers'. This means that any host we make that inherits from 'linux-server' will automatically be added to the 'linux-servers' hostgroup.

Right below that, we create the linux-servers hostgroup. We aren't listing any machines. We're creating an empty hostgroup, because remember, everything that inherits from linux-servers will automatically become a member of this group.

Alright, you'll notice that we don't have any "real" hosts yet. We're not going to yet, either. Lets do some services first.

msimmons@monitoring:/usr/local/nagios/etc/objects$ cat check-ssh.cfg
define command{
   command_name   check_ssh
   command_line   $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
   }
# EOF

This is a short file which creates a command called "check_ssh". This isn't specific to Linux or anything else. It could be used by anything that needed to verify that SSH was running. Now, lets build a service that uses it:

msimmons@monitoring:/usr/local/nagios/etc/objects/services$ cat generic-service.cfg 
define service{
        name                            generic-service  
        active_checks_enabled           1     
        passive_checks_enabled          1      
        parallelize_check               1       
        obsess_over_service             1        
        check_freshness                 0         
        notifications_enabled           1          
        event_handler_enabled           1           
        flap_detection_enabled          1            
        failure_prediction_enabled      1             
        process_perf_data               1
        retain_status_information       1 
        retain_nonstatus_information    1  
        is_volatile                     0   
        check_period                    24x7 
        max_check_attempts              3     
        normal_check_interval           10     
        retry_check_interval            2       
        contact_groups                  systems
      notification_options    w,u,c,r        
        notification_interval           1440
        notification_period             24x7       
         register                        0            
}
# EOF

This is just a generic service template with sane settings for my environment. Again, you'll want to use something good for yours. Now, something that will inherit from generic-service:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat linux-ssh.cfg
define service { 
    use generic-service
    service_description Linux SSH Enabled
    hostgroup_name linux-servers
    check_command check_ssh 
}
# EOF

Now we have a service "Linux SSH Enabled". This uses check_ssh, and (importantly), 'hostgroup_name linux-servers' means "Every machine that is a member of the hostgroup 'linux-servers' automatically gets this service check".

Lets do the same thing with ping:

define command{
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
 }

 define service { 
    use generic-service
    service_description Linux Ping
    hostgroup_name linux-servers
    check_command check_ping!3000.0,80%!5000.0,100
}

Sweet. (If you're wondering about the exclamation marks on the check_ping line in the Linux Ping service, we're sending those arguments to the command, which you can see set the warning and critical thresholds).

Now, lets add our first host:

msimmons@monitoring:/usr/local/nagios/etc/objects/linux$ cat mylinuxserver.mycompany.com.cfg 
define host{
       use linux-server
       host_name myLinuxServer.mycompany.com
       address my.ip.address.here
}

That's it! I set the host name, I set the IP address, and I say "use linux-server" so that it automatically gets all of the "linux-server" settings, including belonging to the linux host group, which makes sure that it automatically gets assigned all of the Linux service checks. Ta-Da!

Hopefully this can help people see the value in arranging configs like this. If you have any questions, please let me know via comments. I'll be happy to explain! Thanks!

 

 

Mount NFS share to multiple hosts in vSphere 5.5

Date September 14, 2014

One of the annoying parts of making sure that you can successfully migrate virtual resources across a vSphere datacenter is ensuring that networks and datastores are not only available everywhere, but also named identically.

I inherited a system that was pretty much manually administered, without scripts. I've built a small powershell script to make sure that vSwitches can be provisioned identically when spinning up a new vHost, and there's really no excuse for not doing it for storage, except that not all of my hosts should have all of the same NFS datastores that another host has. I could do some kind of complicated menu system or long command line options, but that's hardly better than doing it individually.

Tonight, I learned about a really nice feature of the vSphere 5.5 web interface (which I am generally not fond of) - the ability to take a specific datastore and mount it on multiple hosts. (Thanks to Genroo on #vmware on Freenode for letting me know that it was a thing).


vmware-nfs-add-host-1
Log into the vSphere web interface and select "vCenter"

vmware-nfs-add-host-2

 

Select Datastores

vmware-nfs-add-host-3

Right click on the datastore you want to mount on other hosts. Select "All vCenter Actions", then 'Mount Datastore to Additional Host"
vmware-nfs-add-host-4

 

 

Pick the hosts you want to mount the datastore to.

vmware-nfs-add-host-5

The mount attempt will show up in the task list. Make sure that the hosts you select have access to NFS mount the datastore, otherwise it will fail. (You can see the X here, my failed attempt to rename a datastore after I created it using the wrong network for the filer. I'll clean that up shortly.

Anyway, hopefully this helps someone else in the future.

Impossible problems are the best

Date September 11, 2014

"I can't believe that!" said Alice.
"Can't you?" the Queen said in a pitying tone. "Try again: draw a long breath, and shut your eyes."
Alice laughed. "There's no use trying," she said: "one can't believe impossible things."
"I daresay you haven't had much practice," said the Queen. "When I was your age, I always did it for half-an-hour a day. Why, sometimes I've believed as many as six impossible things before breakfast."

Impossible problems are fun. It's nice to sometimes encounter the types of things that require a suspension of disbelief in order to deal with. I had a user give me one of these this morning, and I really enjoyed the mental cartwheels.

Imagine for a second, that we have the following situation:

jdoe@login:/course/cs101$ ls -al
total 52
drwxrwsr-x   4 msimmons cs101-staff  4096 Sep 11 09:51 ./
drwxr-sr-x 309 root    course          28672 Sep 11 09:59 ../
drwxrws---   7 msimmons cs101-staff  4096 Sep 19  2013 svnrepos/
drwxrwsr-x   4 msimmons cs101-staff  8192 Sep  9 22:21 .www/
jdoe@login:/course/cs2500f14$ 

The issue reported was that the user, here being played by 'Jane Doe' (username jdoe), cannot checkout from the svn repository in svnrepos there. She is a member of the group cs101-staff, as indicated by getent group cs101-staff, by running groups as her user on the machine, and by the ypgroup jdoe command on the NetApp. However, when trying to checkout the repository, she gets a permission denied error on a file in svnrepos/, and initial investigations show this:

jdoe@login:/course/cs101$ cd svnrepos
-bash: cd: svnrepos: Permission denied

You'll notice that the x in the group permissions is set to 's', which indicates the GUID bit is set. This is a red herring, and the problem happens regardless of the fiddling of this bit.

I'm not going to walk you though the hour of debugging that my coworkers and I performed, but I'm willing to bet you would probably have done similar, if not the same. Clearly, something was wrong. A user that should have been able to change directory was not able to. This is not new software - there's no bug in 'cd'. We are dealing with the most ancient part of the code, and as it turns out, that had something to do with it.

The key was discovered (and initially overlooked) while verifying that the user was a member of the group:


jdoe@login:/course/cs101$ groups
faculty 101prof cs101f14-prof cs201sp14-prof cs301su14 2101ta cs301f14-staff cs101sp14-prof cs101sp14-ta 101staff cs121f14 cs101f14-ta cs101sp14-staff cs121sp14 cs301f14-ta cs201sp14-staff cs221f14-ta cs101-staff

If you're going through looking all of those groups thinking there are too many similarities, that's also a red herring (and although the course numbers have been changed, they really were this similar. The proper group IS in there though).

If you're thinking, "wow, that's a lot of groups", then you're right. That IS a lot of groups. 18 of them, but Linux has no problem with up to 32 groups.

So what is the problem? Well, it's the group count. Even though it's OK for Linux, you might have caught that I said that the netapp 'ypgroup' command showed her as a group member. That's because she absolutely is! When you query NIS (which is what the YP (formerly 'yellow pages') in ypgroup means), NIS says "yes, she is a member of that group". However, all is not peachy, and in this well-written blog entry from 2005(!), Mike Eisler explains why NFS still so often has a 16 group limit for users.

Removing Jane Doe from a few groups made her instantly able to change directory in to the subversion repo, and she was immediately able to complete her svn checkout.

That was a really fun little excursion into the realm of impossibility. Keep that one in mind if you're the sort of place where group memberships tend to accumulate and you use NFS.

Monitoring (old) Zimbra

Date September 6, 2014

It's September, and in universities, that means tons of new people. New staff, new faculty, new students. Lots and lots of new people.

Here at The College of Computer and Information Science at Northeastern University, we've got a banner crop of incoming CS students. So many, in fact, that we bumped up against one of those things that we don't think about a lot. Email licenses.

Every year, we pay for a lot of licenses. We've never monitored the number used vs the number bought, but we buy many thousand seats. Well, we ran out last week. Oops.

After calling our reseller, who hooked us up with a temporary emergency bump, we made it through the day until we could buy more. I decided that it was time to start monitoring that sort of thing, so I started working on learning the Zimbra back-end.

Before you follow along with anything in this article, you should know - my version of Zimbra is old. Like, antique:

Zimbra was very cool about this and issued us some emergency licenses so that we could do what we needed until our new license block purchase went through. Thanks Zimbra!

In light of the whole "running out of licenses" surprise, I decided that the first thing I should start monitoring is license usage. In fact, I instrumented it so well that I can pinpoint the exact moment that we went over the number of emergency licenses we got:

CCIS-mail-accounts

Cool, right?

Well, except for the whole "now we're out of licenses" again thing. Sigh.

I mentioned a while back that I was going to be concentrating on instrumenting my infrastructure this year, and although I got a late start, it's going reasonably well. In that blog entry, I linked to a GitHub repo where I built a Vagrant-based Graphite installation. I used that work as the basis for the work I did when creating a production Graphite installation, using the echocat graphite module.

After getting Graphite up and running, I started gathering metrics in an automated fashion from the rest of the puppetized infrastructure using the pdxcat CollectD puppet module, and I wrote a little bit about how similar that was with my Kerbal Space Administration blog entry.

But my Zimbra install is old. Really old, and the server it's on isn't puppetized, and I don't even want to think about compiling collectd on the version of Ubuntu this machine runs. So I was going to need something else.

As it turns out, I've been working in Python for a little while, and I'd written a relatively short program that serves both as a standalone command that can send a single metric to Carbon or can function as a library, if you need to send a lot of metrics at a time. I'm sure there's probably a dozen tools to do this, but it was relatively easy, so I just figured I'd make my own. You can check it out on GitHub if you're interested.

So that's the script I'm using, but a script needs data. If you log in to the Zimbra admin interface (which I try not to do, because it requires Firefox in the old version we're using), you can actually see most of the stats you're interested in. It's possible to scrape that page and get the information, but it's much nicer to get to the source data itself. Fortunately, Zimbra makes that (relatively) easy:

In the Zimbra home directory (/opt/zimbra in my case), there is a "zmstats/" subdirectory, and in there you'll find a BUNCH of directories with dates as names, and some CSV files:


... snip ...
drwxr-x--- 2 zimbra zimbra 4096 2014-09-04 00:00 2014-09-03/
drwxr-x--- 2 zimbra zimbra 4096 2014-09-05 00:00 2014-09-04/
drwxr-x--- 2 zimbra zimbra 4096 2014-09-06 00:00 2014-09-05/
-rw-r----- 1 zimbra zimbra 499471 2014-09-06 20:11 cpu.csv
-rw-r----- 1 zimbra zimbra 63018 2014-09-06 20:11 fd.csv
-rw-r----- 1 zimbra zimbra 726108 2014-09-06 20:12 imap.csv
-rw-r----- 1 zimbra zimbra 142226 2014-09-06 20:11 io.csv
-rw-r----- 1 zimbra zimbra 278966 2014-09-06 20:11 io-x.csv
-rw-r----- 1 zimbra zimbra 406240 2014-09-06 20:12 mailboxd.csv
-rw-r----- 1 zimbra zimbra 72780 2014-09-06 20:12 mtaqueue.csv
-rw-r----- 1 zimbra zimbra 2559697 2014-09-06 20:12 mysql.csv
drwxr-x--- 2 zimbra zimbra 4096 2014-06-15 22:13 pid/
-rw-r----- 1 zimbra zimbra 259389 2014-09-06 20:12 pop3.csv
-rw-r----- 1 zimbra zimbra 893333 2014-09-06 20:12 proc.csv
-rw-r----- 1 zimbra zimbra 291123 2014-09-06 20:12 soap.csv
-rw-r----- 1 zimbra zimbra 64545 2014-09-06 20:12 threads.csv
-rw-r----- 1 zimbra zimbra 691469 2014-09-06 20:11 vm.csv
-rw-r----- 1 zimbra zimbra 105 2014-09-06 19:08 zmstat.out
-rw-r----- 1 zimbra zimbra 151 2014-09-06 06:28 zmstat.out.1.gz
-rw-r----- 1 zimbra zimbra 89 2014-09-04 21:15 zmstat.out.2.gz
-rw-r----- 1 zimbra zimbra 98 2014-09-04 01:41 zmstat.out.3.gz

Each of those CSV files contains the information you want, in one of a couple of formats. Most are really easy.


sudo head mtaqueue.csv
Password:
timestamp, KBytes, requests
09/06/2014 00:00:00, 4215, 17
09/06/2014 00:00:30, 4257, 17
09/06/2014 00:01:00, 4254, 17
09/06/2014 00:01:30, 4210, 16
... snip ...

In this case, there are three columns, which include the timestamp, the number of kilobytes in queue, and the number of requests. Most CSV files have (many) more columns, but this works pretty simply. That file is updated every minute, so if you have a cronjob run, grab the last line of that file, parse it, and send it into Graphite, then your work is basically done:


zimbra$ crontab -l
... snip ...
* * * * * /opt/zimbra/zimbra-stats/zimbraMTAqueue.py

And looking at that file, it's super-easy:


#!/usr/bin/python

import pyGraphite as graphite
import sys
import resource

CSV = open('/opt/zimbra/zmstat/mtaqueue.csv', "r")
lineList = CSV.readlines()
CSV.close()
GraphiteString = "MY.GRAPHITE.BASE."

rawLine = lineList[-1]
listVals = rawLine.split(',')

values = {
	'kbytes': listVals[1],
	'items':  listVals[2],
	}

graphite.connect()

for value in values:
	
	graphite.sendData(GraphiteString + "." + value + " ", values[value])

graphite.disconnect()

So there you go. My python isn't awesome, but it gets the job done. Any includes not used here are because some of the other scripts I needed them, and by the time I got to this one, I was just copying and pasting my code for the most part. #LazySysAdmin

The only CSV file that took me a while to figure out was imap.csv. The format of that one is more interesting:

msimmons@zimbra:/opt/zimbra/zmstat$ sudo head imap.csv
timestamp,command,exec_count,exec_ms_avg
09/06/2014 00:00:13,ID,11,0
09/06/2014 00:00:13,FETCH,2,0
09/06/2014 00:00:13,CAPABILITY,19,0
...snip...

So you get the timestamp, the IMAP command, the number of times that command is being executed, and how long, on average, it took, so you can watch latency. But the trick is that you only get one command per line, so the previous tactic of only grabbing the final line won't work. Instead, you have to grab the last line, figure out the timestamp, and then grab all of the lines that match the timestamp. Also, I've found that not all IMAP commands will show up every time, so make sure that your XFilesFactor is set right for the metrics you'll be dealing with.

The code is only a little more complicated, but still isn't too bad:

#!/usr/bin/python

import pyGraphite as graphite
import sys
import resource

imapCSV = open('/opt/zimbra/zmstat/imap.csv', "r")
lineList = imapCSV.readlines()
imapCSV.close()
GraphiteString = "MY.GRAPHITE.PATH"

class imapCommand:
	name = ""
	count = ""
	avgres = ""

	def __init__(self, name, count, avgres):
		self.name = name
		self.count = count
		self.avgres = avgres
	

IMAPcmds = list()

datestamp = lineList[-1].split(',')[0]

record = len(lineList)

while True:
	if ( lineList[record-1].split(',')[0] == datestamp ):
		CMD = lineList[record-1].split(',')[1]
		COUNT = lineList[record-1].split(',')[2]
		AVGRES = lineList[record-1].split(',')[3].strip()
		IMAPcmds.append(imapCommand(CMD, COUNT, AVGRES))
	else:
		break
	record = record - 1

graphite.connect()

for command in IMAPcmds:
	graphite.sendData(GraphiteString + "." + command.name + ".count ", command.count)
	graphite.sendData(GraphiteString + "." + command.name + ".avgres ", command.avgres)

graphite.disconnect()

You can read much more about all of the metrics in the online documents, Monitoring Zimbra.

Now, so far, this has been the runtime metrics, which is helpful, but doesn't actually give me account information. To get that, we're going to use some of the built-in Zimbra tools. zmaccts lists all accounts, and then prints a summary at the end. We can just grab the summary and learn the number of accounts. We can also use the zmlicense -p command to get the number of licensed accounts we have.

The shell script is pretty easy:

$ cat zimbra-stats/zimbraAccountStatuses.sh
#!/bin/bash

# Creates $GRAPHITESERVER and $GRAPHITEPORT
. /opt/zimbra/zimbra-stats/graphite.sh

OUTPUT="`/opt/zimbra/bin/zmaccts | tail -n 1`"

ACTIVE=`echo $OUTPUT | awk '{print $2}'`
CLOSED=`echo $OUTPUT | awk '{print $3}'`
LOCKED=`echo $OUTPUT | awk '{print $4}'`
MAINT=`echo $OUTPUT | awk '{print $5}'`
TOTAL=`echo $OUTPUT | awk '{print $6}'`
NEVERLOGGEDIN=`/opt/zimbra/bin/zmaccts | grep "never$" | wc -l`

MAX="`/opt/zimbra/bin/zmlicense -p | grep ^AccountsLimit= | cut -d \= -f 2`"

STATPATH="MY.GRAPHITE.PATH."

/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.active ${ACTIVE} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.closed ${CLOSED}
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.locked ${LOCKED} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.maintenance ${MAINT} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.total ${TOTAL} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.neverloggedin ${NEVERLOGGEDIN} 
/opt/zimbra/zimbra-stats/pyGraphite.py ${STATPATH}.max ${MAX}  

 

Forgive all of the shortcuts taken in the above. Things aren't quoted when they should be and so on. Use at your own risk. Warranty void in Canada. Etc etc.

Overall, it's to get that additional transparency into the mail server. Even after we get the server upgraded and on a modern OS, this kind of information is a welcome addition.

Oh, and for the record?

$ find ./ -name "*wsp" | wc -l
8783

Over 8,500 metrics coming in. Sweet. Most of that is coming from collectd, but that's another blog entry...