Help me debug a switching issue?

Date January 24, 2014

I've mentioned before here that I am moving from our legacy Catalyst 6500-based switching infrastructure to a new Nexus 5548-based infrastructure, and I'm in the early stages of the actual migration. Before I can actually migrate things physically from the old switches to the new, I need to make sure that things work as I think they should, and that has been a voyage of discovery for me, let me tell you.

The most recent thing I've had appear that I didn't understand is this: When running a pair of CAT-6 cables from the Cat6500 to a shared FEX (using a vPC), I had massive packet loss, in the ballpark of 50%. Depending on whether all of the links were up, or one of the links was down, the packet loss might be from all sources, or possibly just traffic crossing a subnet boundary. Here's the diagram for how it was set up:


Po3 in this case, was a trunk which had almost all of the VLANs going over it. The behavior was such that if both links were up, and the server and the laptop were on the same subnet, then there would be no packet loss, but if they were on separate subnets, then the packet loss might be 50-60%. And the packets that were dropped weren't some recognizable pattern, it was "clumpy". Four to five miss, then one or two hit, then a miss, then four or five hit, and so on.

The layer 3 switching between VLANs in those cases was being done by the Catalyst (in fact, all of the L3 switching right now is being done by the Catalyst).

I talked with Cisco, and they suggested that I move the Cat from the FEX and directly attach it to the Nexuses (Nexii?). The reason they gave me was STP related, but I'd already found that out the hard way and enabled BPDU filtering on the Catalyst's port-channel members.

The ports on a Nexus5548 are SFP+, and I was worried that I didn't have any transceivers that would work, but luckily I found some, so I then ran two fiber connections from the Cat, one to each Nexus. As soon as I did that, my packet loss stopped across the board. Here's how it's wired now:

Switching Working

So, the question that I don't know the answer to is, "why did I see such a strange pattern of packet loss before?". What was I doing wrong? I asked for clarification from the ticket holder at Cisco, and here's what I got back:

The main issue is that connecting a 6500 to a FEX is not a supported topology so any unexpected behavior may have no explanation because it may or may not work correctly. Checking the show tech I could not find anything indicating any issues, everything looks correct but that is the issue when having a not supported topology.

I refuse to believe that there is magic here. Yes, it's an unsupported topology, but why is it unsupported - what produces the packet loss here?

The closest thing I've come to an answer is in the vPC Gotchas You Need to Know blog entry by Peter Triple, aka Routing over a Nexus 7000 VPC Peer Link, although I'm not doing any routing adjacencies over the peer link. It's all static, and mostly resembles Diagram 4, but without the OSPF.

So basically, can anyone shed light on why the response I got was what it was? Thanks, I appreciate it!

  • Garry

    From the Cisco documentation:
    "The Fabric Extender provides end-host connectivity into the network fabric. As a result, BPDU Guard is enabled on all its host interfaces. If you connect a bridge or switch to a host interface, that interface is placed in an error-disabled state when a BPDU is received. You cannot disable BPDU Guard on the host interfaces of the Fabric Extender."

    I would then assume that there is and errdiscover running that would reset the interface then the process of BPDU Guard would start over again...

    But you said you used BPDUFilter, so at that point there were no BPDUs coming in from your 6500 to the Nexi and that would cause another set of issue as spanning-tree cannot work without BPDUs to know where the loops would be...