Graphite might be the best thing I’ve rolled out here in my position at CCIS.
One of our graduate students has been working on a really interesting paper for a while. I can’t go into details, because he’s going to publish before too long, but he has been making good use of my network diagrams. Since he has a lot riding on the accuracy of the data, he’s been asking me very specific questions about how the data was obtained, and how the graphs are produced, and so on.
One of the questions he asked me had to do with a bandwidth graph, much like this one:
His question revolved around the actual amount of traffic each datapoint represented. I explained briefly that we were looking at Megabytes per second, and he asked for clarification – specifically, whether each point was the sum total of data sent per second between updates, or whether it was the average bandwidth used over the interval.
We did some calculations, and decided that if it were, in fact, the total number of bytes received since the previous data point, it would mean my network had basically no traffic, and I know that not to be the case. But still, these things need verified, so I dug in and re-determined the entire path that the metrics take.
These metrics are coming from a pair of Cisco Nexus Switches via SNMP. The data being pulled is a per-interface ifInOctets and ifOutOctets. As you can see from the linked pages, each of those are 32 bit counters, with “The total number of octets transmitted [in|out] of the interface, including framing characters”.
Practically speaking, what this gives you is an ever-increasing number. The idea behind this counter is that you query it, and receive a number of bytes (say, 100). This indicates that at the time you queried it, the interface has sent (in the case of ifOutOctets) 100 bytes. If you query it again ten seconds later, and you get 150, then you know that in the intervening ten seconds, the interface has sent 50 bytes, and since you queried it ten seconds apart, you determine that the interface has transmitted 5 bytes per second.
Having the counter work like this means that, in theory, you don’t have to worry about how frequently you query it. You could query it tomorrow, and if it went from 100 to 100000000, you could be able to figure out how many seconds it was since you asked before, divide the byte difference, and figure out the average bytes per second that way. Granted, the resolution on those stats isn’t stellar at that frequency, but it would still be a number.
Incidentally, you might wonder, “wait, didn’t you say it was 32 bits? That’s not huge. How big can it get?”. The answer is found in RFC 1155:
This application-wide type represents a non-negative integer which monotonically increases until it reaches a maximum value, when it wraps around and starts increasing again from zero. This memo specifies a maximum value of 2^32-1 (4294967295 decimal) for counters.
In other words, 4.29 gigabytes (or just over 34 gigabits). It turns out that this is actually kind of an important facet to the whole “monitoring bandwith” thing, because in our modern networks, switch interfaces are routinely 1Gb/s, often 10Gb/s, and sometimes even more. If our standard network interfaces can transfer one gigabits per second, then a fully utilized network interface can roll over an entire counter in 35 seconds. If we’re only querying that interface once a minute, then we’re potentially losing a lot of data. Consider, then, a 10Gb/s interface. Are you pulling metrics more often than once every 4 seconds? If not, you may be losing data.
Fortunately, there’s an easy fix. Instead of ifInOctets and ifOutOctets, query ifHCInOctets and ifHCOutOctets. They are 64 bit counters, and only roll over once every 18 exabytes. Even with a 100% utilized 100Gb/s interface, you’ll still only roll over a counter every 5.8 years or so.
I made this change to my collectd configuration as soon as I figured out what I was doing wrong, and fortunately, none of my metrics jumped, so I’m going to say I got lucky. Don’t be me – start out doing it the right way, and save yourself confusion and embarrassment later. Use 64-bit counters from the start!
(Also, there are the equivalent HC versions for all of the other interface counters you’re interested in, like the UCast, Multicast, and broadcast packet stats – make sure to use the 64-bit version of all of them).
Thanks, I hope I managed to help someone!