Tag Archives: troubleshooting

My trouble with bonded interfaces

In an effort to improve the redundancy of our network, I have all of our blade servers configured to have bonded network interfaces. Bonding the interfaces in linux means that eth0 and eth1 form together like Voltron into bond0, an interface that can be “high availability”, meaning if one physical port (or the device it is plugged into) dies, the other can take over.

Because I wanted to eliminate a single point of failure, I used two switches:

The switches are tied together to make sure traffic on one switch hits the other if necessary.

Here is my problem, though: I have had an array of interesting traffic patterns from my hosts. Some times they’ll have occasional intermittent loss of connectivity, sometimes they’ll have regular time periods of non-connectivity (both of which I’ve solved by changing the bonding method), and most recently, I’ve had the very irritating problem of a host connecting perfectly fine to anything on the local subnet, but remote traffic experiences heavy traffic loss. To fix the problem, all I have to do is unplug one of the network cables.

I’ve got the machine set up in bonding mode 0. According to the documents, mode 0 is:

Round-robin policy: Transmit packets in sequential
order from the first available slave through the
last. This mode provides load balancing and fault

It would be at least logical if I lost 50% of the packets. Two interfaces, one malfunctioning, half the packets. But no, it’s more like 70% of the packets getting lost, and I haven’t managed to figure it out yet.

If you check my twitter feed for yesterday, I was whining about forgetting a jacket. This is because I was hanging out in the colocation running tests. ‘tcpdump’ shows that the packets are actually being sent. Only occasional responses are received, though, unless the other host is local, in which case everything is fine.

There are several hosts configured identically to this one, however this is the only one displaying this issue. Normally I’d suspect the firewall, but there isn’t anything in the configuration that would single out this machine, and the arp tables check out everywhere. I’m confused, but I haven’t given up yet. I’ll let you know if I figure it out, and in the mean time, if you’ve got suggestions, I’m open to them.

Pre-emptive Troubleshooting

Troubleshooting is a very reactive process. By its nature, you’re fixing an already existing problem. As good as it is to be able to troubleshoot, it’s better to prevent weird problems from cropping up in the first place.

As sysadmins, we have numerous ways to do this. First, as Michael Jenke very sensibly suggests, is to use structured systems management, by administering via script instead of editing files by hand, or worse yet, clicking through the interface.

Another very potent tool that Chris Siebenmann brings up today is using checklists to perform complex tasks. (Incidentally, Chris mentions a term I’ve never heard before…”rubber duck debugging“. I think I’m going to try to expense one of these for troubleshooting purposes)

I’m a firm believer in checklists for anything that isn’t (or can’t be) automated. On our internal wiki, I have checklists for things like adding and removing users from the infrastructure, adding new machines, etc etc. They’re great, and I don’t have to “remember” everything that needs done, I can just do it and it’s always accurate. And if I’ve left something off the list, I add it to the list, and it’s more accurate.

I wasn’t always such a checklist person. It took a while for their usefulness to sink in. Tom Limoncelli does a great job of explaining why in his blog post, Transforming an art into a science, where he explains that back in the bad old days, planes kept crashing until pilots started pre-takeoff checklists. Similarly, Boston.com has an article about doctors using checklists, which resulted in 36% fewer complications and deaths in the operating room.

Checklists take complex, fun tasks and make them boring.

Bizarre issues almost always point to DNS problems

Duct tape. The Force. DNS.

These are the things that bind our world together. Sure, you can’t see the force when you’re juggling rocks while standing on your head, just like you don’t pay attention to DNS 99% of the time you’re browsing the web, but that doesn’t mean it doesn’t affect everything you do.

Misconfigured DNS has caused more, weirder problems than any other single aspect of networking I’ve yet encountered. Sure, it causes plain, vanilla connectivity issues when you can’t resolve something, but it gets much weirder.

Misconfigured DNS causes mail to break, active directory to stop authenticating (or to even recognize that domains exist), SSH sessions to timeout instead of connect, and an entire host of other problems.

I have even had it cause password issues: the DNS that I was on pointed to a different machine, yet configured identically and with all the same identifiers, and when someone added my account to the machine she was talking to, I couldn’t get access. We fought with this for a few hours before I got desperate enough to check into the IP addresses we were connecting to.

This is just a friendly reminder that DNS is everywhere, and if you’re having a bizarre network issue, make sure DNS is somewhere early in your troubleshooting checklist.