Category Archives: Networking

Posts relating to IP networking

Annoying pfSense Issue with 2.15 -> 2.2 Upgrade

I run several pfSense boxes throughout my network. Although the platform doesn’t have an API, and it can be a pain to configure manually in certain cases, it’s generally very reliable once running, and because it’s essentially a skinned BSD, it’s very easy on resources. There’s also a really nice self-update feature that I use to get things to the newest release when they’re available.

It’s that last feature that bit me in my butt Sunday night. After doing the upgrade at midnight or so, I went to bed after everything seemed to work alright, but then this morning, I started getting reports that people couldn’t log into the captive portal that we use for our “guest” wired connections.

I thought, “That’s strange…everything seemed to work after the upgrade, but I’ll check it out”, and sure enough, as far as I could tell, all of the networks were working fine on that machine, but there was no one logged into the captive portal.

Taking a look at the logs, I found this error:

logportalauth[42471]: Zone: cpzone – Error during table cpzone creation.
Error message: file is encrypted or is not a database

Well, hrm. “Error during table cpzone creation” is strange, but “file is encrypted or is not a database” is even weirder. Doing a quick google search, I came across this thread on the pfSense forums where someone else (maybe the only other person?) has encountered the same problem I have.

As it turns out, prior to version 2.2, pfSense was still using sqlite2, but now, it’s on sqlite3, and the database formats are incompatible. A mention of that in the upgrade notes would have been, you know, swell.

The thread on the forums suggests to shut off the captive portal service, remove the .db files, and then restart the service. I tried that, and it didn’t work for me, so what I did after that was to shut down the captive portal (to release any file locks), remove the db files, and then from the text-mode administrative menu, force an re-installation of pfSense itself.

Although I haven’t actually tested the captive portal yet (I’m at home doing this remotely, because #YOLO), a new database file has been created (/var/db/captiveportalcpzone.db) and inspecting it seems to show sqlite3 working:

[2.2-RELEASE][root@host]/var/db: sqlite3 captiveportalcpzone.db
SQLite version 3.8.7.2 2014-11-18 20:57:56
Enter ".help" for usage hints.
sqlite> .databases
seq  name             file
---  ---------------  ----------------------------------------------------------
0    main             /var/db/captiveportalcpzone.db
sqlite> .quit

This is as opposed to some of the older database files created prior to upgrade:

[2.2-RELEASE][root@host]/var/db/backup: sqlite3 captiveportalinterface.db
SQLite version 3.8.7.2 2014-11-18 20:57:56
Enter ".help" for usage hints.
sqlite> .databases
Error: file is encrypted or is not a database
sqlite>

What I don’t understand is that the normal way to convert from sqlite2 to sqlite3 is to dump and restore, but it doesn’t look like this process did that at all. It would be incredibly easy to do a database dump/restore during an upgrade, ESPECIALLY when revving major database versions like this.

Anyway, this kind of experience is very unusual for me with pfSense. Normally it’s “set it and forget it”. Hopefully this will work and I can get back to complaining about a lack of API.

VLAN Translation on a Nexus 5548 – :Sad Trombone:

I’ve got a problem. Our school is expanding, and we’re constantly hiring people. We’re hiring so many people that they won’t actually fit in the building we’re in. Because of that, we’re having to expand outside of the building we’ve been in for years. Part of that expansion is extending my networks across campus (and in some cases, farther).

The network that I run is really old. Like, it actually predates the network at the central university. I’ve got around 50 VLANs, and now that we’re growing outside of this physical environment, I’ve got to extend those layer 2 broadcast domains to the other buildings. I have a good relationship with the central network folks, and although most of my VLAN IDs collide with theirs, they assigned us some IDs that we can use on their infrastructure. Now, I just have to translate my VLAN IDs to their VLAN IDs.

My network core is a pair of Cisco Nexus 5548s. When I was planning this migration, I didn’t worry at all, because the documentation clearly declared that the switchport vlan mapping command was supported. The only weird thing was, when I went to set up the VLAN translation, the command wasn’t found. It was in the docs, but not in the CLI. Weird, right?

So I did what you do when you pay ungodly amounts of money for Cisco support: I opened a ticket with the TAC.

I had been operating under the assumption that my device would be able to perform VLAN ID mapping on an interface, but I can’t figure out how to do it.

Is it possible to map VLAN IDs across a link? I have a trunk to my provider across which I need to send several vlans, but my IDs collide with those in use there. I was hoping to use the equivalent of “switchport vlan mapping”, but it doesn’t appear to be in my release.

Can you please advise me?

Thanks,

Matt

I got back what may be the best response from tech support ever. Emphasis my own:

Hi Matt,

My name is XXXXXXX and I will be assisting you with the Service Request 633401489. I am sending this e-mail as an initial point of contact and so that you can contact me if you need to.

Problem Description
As I have understood it, “switchport vlan mapping” command does not exist in 5548

>>
If you look at the release notes of Nexus5500
http://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus5500/
sw/release/notes/7x/Nexus5500_Release_Notes_7x.html#pgfId-530160

States:
VLAN Translation
Allows for the merging of separate Layer 2 domains that might reside in a two data centers that are connected through some form of Data Center Interconnect (DCI).

So I can understand why you were under the impression that this platform supports this feature however I must state that the document is incorrect here.

I have verified with the Technical Marketing Engineers and it has been confirmed that there are no plans to support vlan mapping / translation on Nexus5500 platforms however as of today; Nexus5672, Nexus 6000 and Nexus7000 do support this feature in 7.x release.

Please let me know if there is anything else I may assist you with .




So that was, you know, less than helpful. And I still need to get those VLANs over there. How are we going to do this?

For now, I’m doing it the old fashioned way. Crossover cables.

Normally, when you move VLAN traffic around, you use a dot1q trunk. Each layer 2 frame gets a header when it leaves a switch that tells the remote device (usually a switch) what VLAN the packet belongs to. So, VLAN ID 10 gets a header that says “this frame goes to VLAN ID 10″, which allows traffic from VLAN 10 and VLAN 20 to be sent over the same physical link and still be kept separate.

Since the VLAN ID is encoded in the frame, it’ll cause problems if the VLAN ID I’m using means something else to the other network. But, since the only thing the other end cares about is the VLAN ID, if I can send my traffic over to the other network on the proper VLAN ID, then they’re happy. To do that, I need to bridge the networks. The easiest way I know how to do that is to take an access port on VLAN A, and an access port on VLAN B, and plug a single cable into both of them (after disabling spanning tree, of course). Yes, this sounds insane. Yes, it might actually be insane. But this is how I did it, and it worked the first time.

The bad part is that I’m currently burning two physical ports for every VLAN I need to translate, and this isn’t tenable over the long-run. Fortunately, the Juniper switches on the remote side of the network link support translation, so I believe that we should be able to do it the “right way”. The sooner the better, because I feel dirty.

Accidental DoS during an intentional DoS

Funny, I remember always liking DNS reflection attack, but this was just a sign of my not thinking straight. If that had been the case, they would have shut down inbound DNS rather than outbound.

After talking with them, they saw that something on our network had been initiating a denial of service attack on DNS servers using hundreds of spoofed source IPs. Looking at graphite for that time, I suspect you’ll agree when I say, “yep”:

Initially, the malware was spoofing IPs from all kinds of IP ranges, not just things in our block. As it turns out, I didn’t have the sanity check on my egress ACLs on my gateway that said, “nothing leaves that isn’t in our IP block”, which is my bad. As soon as I added that, a lot of the traffic died. Unfortunately, because the university uses private IP space in the 10.x.x.x range, I couldn’t block that outbound. And, of course, the malware quickly caught up to speed and started exclusively using 10.x addresses to spoof from. So we got shut down again.

Over the course of a day, here’s what the graph looked like:

Now, on the other side of the coin, I’m sure you’re screaming “SHUT DOWN THE STUPID MACHINE DOING THIS”, because I was too. The problem was that I couldn’t find it. Mostly because of my own ineptitude, as we’ll see.

Alright, it’s clear from the graph above that there were some significant bits being thrown around. That should be easy to track. So, lets fire up graphite and figure out what’s up.

Most of my really useful graphs are thanks to the ironically named Unhelpful Graphite Tip #6, where Jason Dixon describes the “mostDeviant” function, which is pure awesome. The idea is that, if you have a BUNCH of metrics, you probably can’t see much useful information because there are so many lines. So instead, you probably want the few weirdest metrics out of that collection, and that’s what you get. Here’s how it works.

In the graphite box, set the time frame that you’re looking for:

Then add the graph data that you’re looking for. Wildcards are super-useful here. Since the uplink graph above is a lot of traffic going out of the switch (tx), I’m going to be looking for a lot of data coming into the switch (rx). The metric that I’ll use is:


CCIS.systems.linux.Core*.snmp.if_octets-Ethernet*.rx

That metric, by itself, looks like this:

There’s CLEARLY a lot going on there. So we’ll apply the mostDeviant filter:

and we’ll select the top 4 metrics. At this point, the metric line looks like this:


mostDeviant(4,CCIS.systems.linux.Core*.snmp.if_octets-Ethernet*.rx)

and the graph is much more manageable:

Plus, most usefully, now I have port numbers to investigate. Back to the hunt!

As it turns out, those two ports are running to…another switch. An old switch that isn’t being used by more than a couple dozen hosts. It’s destined for the scrap heap, and because of that, when I was setting up collectd to monitor the switches using the snmp plugin, I neglected to add this switch. You know, because I’m an idiot.

So, I quickly modified the collectd config and pushed the change up to the puppet server, then refreshed the puppet agent on the host that does snmp monitoring and started collecting metrics. Except that, at the moment, the attack had stopped…so it was a waiting game that might never actually happen again. As luck would have it, the attack started again, and I was able to trace it to a port:

Gotcha!

(notice how we actually WERE under attack when I started collecting metrics? It was just so tiny compared to the full on attack that we thought it might have been normal baseline behavior. Oops)

So, checking that port led to…a VM host. And again, I encountered a road block.

I’ve been having an issue with some of my VMware ESXi boxes where they will encounter occasional extreme disk latency and fall out of the cluster. There are a couple of knowledgebase articles ([1] [2]) that sort-of kind-of match the issue, but not entirely. In any event, I haven’t ironed it out. The VMs are fine during the disconnected phase, and the fix is to restart the management agents through the console, which I was able to do and then I could manage the host again.

Once I could get a look, I could see that there wasn’t a lot on that machine – around half a dozen VMs. Unfortunately, because the host had been disconnected from the vCenter Server, stats weren’t being collected on the VMs, so we had to wait a little bit to figure out which one it was. But we finally did.

In the end, the culprit was a NetApp Balance appliance. There’s even a knowledge base article on it being vulnerable to ShellShock. Oops. And why was that machine even available to the internet at large? Double oops.

I’ve snapshotted that machine and paused it. We’ll probably have some of the infosec researchers do forensics on it, if they’re interested, but that particular host wasn’t even being used. VM cruft is real, folks.

Now, back to the actual problem…

The network uplink to the central network happens over a pair of 10Gb/s fiber links. According to the graph, you can see that the VM was pushing 100MB (800Mb/s). This is clearly Bad(tm), but it’s not world-ending bad for the network, right? Right. Except…

Upstream of us, we are going through an in-line firewall (that, like OUR equipment, was not set to filter egress traffic based on spoofed source IPs – oops, but not for me, finally!). We are assigned to one of five virtual firewalls on that one physical piece of hardware…despite that, the actual physical piece of hardware has a limit of around a couple hundred thousand concurrent sessions.

For a network this size, that is probably(?) reasonable, but a session counts as a stream of packets between a source IP and a destination IP. Every time you change the source IP, you get a new session, and when you spoof thousands of source IPs…guess what? And since it’s a per-physical-device limit, our one rogue VM managed to take out the resources of the big giant firewall.

In essence, this one intentional DoS attack on a couple of hosts in China successfully DoS’d our university as sheer collateral damage. Oops.

So, we’re working on ways to fix things. A relatively simple step is to prevent egress traffic from IPs that aren’t our own. This is done now. We’ve also been told that we need to block egress DNS traffic, except from known hosts or to public DNS servers. This is in place, but I really question its efficacy. So we’re blocking DNS. There are a lot of other protocols that use UDP, too. NTP reflection attacks are a thing. Anyway, we’re now blocking egress DNS and I’ve had to special-case a half-dozen research projects, but that’s fine by me.

In terms of things that will make an actual difference, we’re going to be re-evaluating the policies in place for putting VMs on publicly-accessible networks, and I think it’s likely that there will need to be justification for providing external access to new resources, whereas in the past, it’s just been the default to leave things open because we’re a college, and that’s what we do, I guess. I’ve never been a fan of that, from a security perspective, so I’m glad it’s likely to change now.

So anyway, that’s how my week has been. Fortunately, it’s Friday, and my sign is up to “It has been [3] Days without a Network Apocalypse”.