December 4, 2013
Man, I've been pulling my hair out for the past couple of days trying to get my pair of Cisco Nexus 5548s to synchronize their switch profile configurations, but I think I've finally got it, so I wanted to write a little bit and maybe help other people who got stuck, too.
Here's some background:
A while back, Cisco developed the idea of port profiles for the UCS environment, so that you could quickly and easily apply a templated configuration to a switch port.
With NX-OS 5.0(2)N1(1), Cisco included switch profiles, with a similar goal in mind. When you have a number of switches, all of which need to have the interfaces configured identically, then switch-profiles are what you're looking for.
You may be wondering why you would like a bunch of switches configured all the same...especially if you aren't familiar with Cisco Nexus switches. Just to clarify, here's a typical chassis switch:
It's basically a bunch of ports. That's what I have right now, and I've got a whole bunch of wiring going back to it from all of my racks in the server room. It's great for management, since there's only one device, but I hate running 30ft cords all the time.
Here's what I'm replacing it with:
Actually, a pair of them. As you can see, the port count doesn't quite add up. That's ok, because I've got a bunch of Fabric Extenders (FEX), too:
Here's how the FEX connect to the switches:
As you can see, each FEX is connected to each switch (actually a couple of times - each FEX has four 10Gb/s SFP+ ports, so each FEX has a pair of 10Gb/s connections in a port channel configuration).
What I end up with is a physical layout that looks like "Top of Rack" or "End of Row", but doesn't have the headache of trying to configure six different switches. And with switch-profile synchronization, I don't even really have to deal with configuring two switches that often.
When a switch is configured to use a fabric extender, the FEX is assigned a number, from 100-199 (I don't know why that particular range of numbers). That configuration is pretty simple:
Then you just configure each of the ports that the FEX is attached to:
switchport mode fex-fabric
fex associate 100
switchport mode fex-fabric
fex associate 100
switchport mode fex-fabric
fex associate 100
You can check that things are working the way you think:
core01# sh fex
FEX FEX FEX FEX
Number Description State Model Serial
100 FEX100 Online N2K-C2248TP-E-1GE FOX1724GZKL
Assuming the FEX is actually plugged in, that creates a series of interfaces, Ethernet100/1/1 - Ethernet100/1/48 (in NX-OS, everything is ethernet, regardless of the speed of the port).
Now, that configuration was done in isolation, with one switch. What about another switch that's also attached to the FEX? If you want to be able to use the second switch, something similar needs to be done on that switch, too.
The "right" way to do this is to set up Virtual Port Channels (vPC), as outlined in this document from Cisco. That's the document I used when I first started to configure the switches and FEX, and it worked. The problem is, that document doesn't actually explain to you that you should be using switch profiles. I mean, it mentions them twice in what are essentially footnotes, but by the time I got there, I assumed it was just another of the many Cisco technologies on the periphery that I don't know, don't use, and don't need to worry about.
But then, if that were the case, I wouldn't be writing this article, would I?
If you're doing this, you should use switch profiles. Seriously. And I can tell you from experience, it's harder to take an existing configuration and apply profiles into it than it is to start from a clean slate using profiles the first time.
So lets do this. I will assume that you have a couple of Nexus switches that have their interfaces, fex, and so on unconfigured, and that you have their management ports on the network. They need to be able to talk to each other, so this should work from both sides:
ping -other-switch-mgmt0-ip- vrf management
At this point, the first thing we need to set up is Cisco Fabric Services. CFS is designed to distribute configuration information throughout the network. Fortunately, this is relatively straight forward for an infrastructure the size I'm dealing with.
Basically, you need to create a CFS region, tell it to distribute the configurations over IPv4 (or IPv6 if you're awesome) and, for me anyway, it worked. Here's my config and status:
core01# sh run | include cfs
cfs ipv4 distribute
cfs region 20
cfs eth distribute
core01# sh cfs status
Distribution : Enabled
Distribution over IP : Enabled - mode IPv4
IPv4 multicast address : 188.8.131.52
IPv6 multicast address : ff15::efff:4653
Distribution over Ethernet : Enabled
core01# sh cfs peers
Switch WWN IP Address
20:00:00:2a:6a:47:3b:00 -switch1 IP- [Local]
20:00:00:2a:6a:1a:3c:00 -switch2 IP-
Total number of entries = 2
Once this works, then you can start setting up the switch profile. To do that, you use a different configuration environment, one I'd never used before, called 'configure sync':
core01# configure sync
Enter configuration commands, one per line. End with CNTL/Z.
core01(config-sync)# switch-profile ?
WORD Enter the name of the switch-profile (Max Size 64)
core01(config-sync)# switch-profile core-shared
Switch-Profile started, Profile ID is 1
In the switch profile, you can make configuration changes to most of the switch. For the specifics of what you can and can't do, you probably want to read the docs.
The workflow for adding a FEX for me was to start off by pre-provisioning a "slot" for it. When I wanted to preconfigure FEX111, for instance, I did this in config-sync-sp mode:
provision model N2K-C2248TP-E-1GE
That pre-creates interfaces Eth111/1/1-48, and more importantly, it does the same thing on both switches.
I'm not going to walk through the entirety of my switch config, but if you have questions about a specific part, just ask.
While I was going through this work, the tools I used extensively were:
This command is kind of a dry run which makes sure that the configuration changes aren't going to do anything too wrong. If this returns successfully, then there's a good chance that you can commit your change. I've seen the occasional time when a configuration change will pass on 'verify' but will fail on commit because of something on the remote node that it didn't take into account. I'm not sure what it is or isn't checking.
That exception that I was talking about was when I changed the model number of a pre-configured FEX.
- show switch-profile buffer
This shows a numerical list of the proposed changes that you are trying to commit. It's very useful as a sanity check, and to make sure that you haven't mistyped something or thought that you were another configuration mode when you tried configuring something.
This gives you the ability to delete some or all of your buffered changes, as shown by the 'show switch-profile buffer' command above.
When you're ready to apply the changes, you commit them, which locks the configuration on both switches. When you commit, the first thing that runs is 'verify', and if that passes, then it tries to apply the changes to the local switch. If that succeeds, then it tries to run the changes to the remote switch. If that succeeds, the change is a success. If any thing in this process fails, then the entire change is rolled back and nothing is applied anywhere. This atomicity allows for known-good identical configuration everywhere.
- show switch-profile status
When something inevitably goes wrong, this command stands a good chance of helping you figure out what happened. Here's the output from one of my switches:
core02# show switch-profile status
switch-profile : core-shared
Start-time: 185468 usecs after Wed Dec 4 11:23:12 2013
End-time: 414949 usecs after Wed Dec 4 11:23:14 2013
Profile-status: Sync Success
Status: Commit Success
IP-address: -other switch IP-
Sync-status: In sync
Status: Commit Success
Running it from both sides is helpful (and reassuring).
- show run switch-profile
This shows the running configuration for only the bits of configuration that are applied through the switch-profile. This is a life saver because 'show running-config' doesn't actually differentiate what was picked up through local config and what was synced.
Big thanks go to Markku Leiniö for his blog entries on the topic. They really helped open my eyes. If you deal with this kind of stuff, I'd recommend reading what he writes.
In general, my advice is to have two terminals open - one to each switch, and to make small atomic changes while you're learning how configuration is applied. When I was configuring my FEX, I would do one FEX, run verify, do the commit, verify that it happened on both sides, then I would do all of the other FEX in exactly the same way, just in bulk, copied from a text buffer. Then I'd run verify, commit, and then check the status on the other switch. Make sure that you can apply configurations from both switches and that things are working right.
Being cautious is a good thing. Like any other distributed configuration system, this allows you to quickly and easily make changes, but also exacerbates mistakes.
Thanks for reading, and I hope it was helpful!
December 2, 2013
Working for the College of Computer and Information Science is definitely a challenge in certain ways. The people that I deal with are almost universally computer-literate and very skilled and knowledgeable about what they're doing, and they're all doing different and interesting things.
While I do provide a virtualization environment for general purpose classes (under VMware), I'm seeing more and more interest in spinning up instances of student VMs in Amazon (not to mention the classes that are doing things like running Hadoop in the cloud so that they don't need to micromanage their MapReduce infrastructures). Right now is probably the most flexible time for computing I've ever seen. You can just about run whatever you want wherever you want, if you have the funding.
But that's kind of an interesting problem in and of itself. Suppose that we, the college, did have the funding, centrally, and not per-project, or per-class, or per-instructor. Or, even, suppose we have it per program. How do you equitably divide that up among the researchers, instructors, classes, and students so that it gets to the people who need it most?
This is the meta problem that I'm considering right now, and I know I'm not alone in being concerned that the amount of money that we're using and the amount of money that we've got budgeted might not line up in any meaningful way. If you have an environment where several people can spin up instances at a whim, how do you ensure costs don't go insane? Or, alternatively, how do you control the resources that someone can instantiate?
That's what I'm asking. Are you using an external cloud provider? How do you control your scaling? How do you make sure that you don't suffer a Pyrrhic victory in terms of traffic and usage? What are you doing to manage this kind of thing?
Please comment below!
November 19, 2013
At LOPSA-East 2013, I was lucky enough to present a half-day tutorial on Solid State hard drives that got a lot of positive feedback. The talk was recorded, but wasn't available until now!
I've uploaded it to YouTube, and I'm embedding it below:
Note that it is three hours long. If you already know about spinning disks and you're comfortable with that technology, skip to 56 minutes in for the beginning of the SSD discussion.
Also, I've uploaded the slideshow to slideshare (along with several other presentation slide decks I've done).
I hope you enjoy it and get something out of it. If you have any questions, feel free to ask here or on twitter.
November 12, 2013
Yesterday, I briefly discussed some of the events I took part in at LISA13 in Washington DC, but I wouldn't say it was an exhaustive list of everything I did. You see...I also played hookie on Monday, rented a car, and drove some people to the Udvar-Hazy Center, otherwise known as the Air and Space Museum Hanger.
Every time I've ever been to the Smithsonian Air and Space Museum, it's always the one on the National Mall. They have a lot of really cool stuff there, like Apollo landers, the Spirit of St. Louis, Spaceship One, and hundreds of other air-and-spacecraft. But the big birds are reserved for the Udvar-Hazy center, because there's no parking spot on the National Mall for a Concorde. Or one of these.
Yep, they have a Blackbird. Oh, and right behind it? This girl:
That's the Space Shuttle Discovery. No big deal, right? Seriously, if you're visiting DC, you must make it out to Udvar-Hazy. It's so worth the trip.
One of the other things they have is a selection of space capsules. I've been to enough museums now that I've seen plenty of them. Before, I'd concentrate on the tininess of the cockpit. How would you like to be shoved into that sardine can for days or weeks, flying through space with a limitless vacuum just outside the metal walls?
Maybe it's the fact that I've been playing a lot of Kerbal Space Program lately, or maybe it's the fact that I was experiencing things by visiting with a group of sysadmins, but for whatever reason, instead of concentrating on the cramped quarters, I opened my eyes and I realized that even the tiniest space capsule has a TON of switches, knobs, screens, and dials. The instrumentation is amazing!
Now, we see things like the Dragon from SpaceX, and yeah, it's complicated because it does a lot. It's modern and docks with the ISS, and all of that, but what about the first ones? Suppose you looked at the very first space capsule that John Glenn rode in. How many controls and dials would you see? How many things do you think a space capsule does?
You know it has to control the angle of re-entry, so you could expect some kind of control stick arrangement for that. It probably should monitor fuel, so you'll need a dial for that. Temperature, both internal and external would be useful. Also an altimeter, and an emergency "pop the chute" button in case the chute didn't do that on its own. Plus the wheel you'd unscrew to open the hatch. If I were imagining it in my head, that's probably what I'd say. I imagine I'm simplifying a little bit, but I don't think I'd be that far from reality.
Except I'm dramatically wrong. Here's John Glenn in the cockpit of the Friendship 7 capsule:
John Glenn in control of the Friendship 7
When I saw the capsules at the Smithsonian had instrumentation like that (and more!), I was kind of blown away. What do all of those things do? (As it turns out, there's an app for that) But it's just simply amazing. Every single function that could possibly be executed is instrumented there, with a physical switch or dial. That's kind of inspirational to me.
To some extent, modern IT has gotten away from these kind of command and control interfaces. Where, in previous times, it might have seemed advantageous to us to have a "command and control center", now, we largely try to automate away all of the interactive features and to get our systems to fly by wire, as it were. These switches and knobs have been replaced by API calls, and I'm much more comfortable with things like that.
But what of the dials? Feedback is still just as important as it was in John Glenn's day. Yes, we have alerting to those conditions that we know to watch for, and chances are good that you're trending certain things like bandwidth, load, and so on (partially because those are easy, and partially because they've been standard to trend since even before Tobi Oetiker wrote MRTG). But is that all you're gathering? Is that enough?
I keep reading through Etsy's amazing blog entry from a few years ago, Measure Anything, Measure Everything, and it kind of speaks to me. Diskspace is cheap. Information isn't. Why aren't I monitoring more? Why aren't I monitoring business metrics? Is it worth the effort of implementing? What's the potential payoff? What's the downside?
What I've eventually come to is that, for the first time in my professional life, I am creating an annual goal for myself. This year, I'm going to concentrate on instrumentation. I'm going to implement better, more thorough monitoring, I'm going to figure out what matters to me and my employer, and figure out what the best way to present it is. I don't know what's important and what isn't until I see it, so I'm going to be liberal in my data gathering.
One of the first things I'm going to be doing is standing up a Graphite instance that I don't hate. I spent a lot of the train trip back from Washington playing with it using Vagrant, and I wanted to help other people. Graphite has a reputation for being difficult to install. The amazing Jason Dixon wrote Synthesize to do the hard work for you, but it's limited to Ubuntu 13.04, and we're settled on 12.04 until the next LTS is released, so I cribbed some of his ideas and those of other HowTos and created a Vagrantfile that installs Graphite. It's not a vagrant box - it's a generic Precise64 box, but iif you look at the Vagrantfile, it walks you through the steps of installation, it just does it automatically and without any interaction.
You can take the automated steps and change them to meet your environment however you want. Hopefully it will help you set up Graphite in your own infrastructure. I'm still working on Collectd, and I may have some places to use something like statsd, too. But I'm making progress. If your monitoring environment needs help, maybe you can start now, too. It's just one step at a time, right?
November 11, 2013
I'm used to being worn out after LISA, but I feel especially hard hit after this one. It's kind of expected, what with the whole week-long-conference thing, but it seems like this year, it's even more-so than normal.
I attended several sessions and classes, only one of which I've written about so far (incidentally, it's Introduction to PowerShell with Steven Murawski, and it's awesome). The rest exist as pages in Evernote that I'll be typing up this week in my free time.
I think the reason I'm so wiped out is because it was the first year that I attended LISA as a LOPSA Director, and I underestimated the amount of time that LOPSA, plus blogging for USENIX would take. I also had to get the finishing touches on the LOPSA Recognized Professionals program, which was pretty heavy, but important enough that it made me get up early and stay up late working on it.
All told, I was in DC for 8 days and it's nice to be back home. Today is also Veterans Day in the United States, a holiday recognized by my work, so I'm at home working on collating my various writings.
As for the LISA13 content itself, I was really impressed. The opening keynote was on the topic of Modern Infrastructure: The Convergence of Network, Compute, and Data, given by Jason Hoffman, founder of Joyent.
Sadly, I missed the talk by Bruce Schneier. Unfortunately, he couldn't be there in person because the IETF meeting was happening in Vancouver, but there was some remote teleconference happening. I heard from a few people that went though, that the talk was great and the format worked out alright.
One of the more interesting talks I did go to was Brendan Gregg's Blazing Performance with Flame Graphs. In this talk, Brendan explained what flame graphs are, and showed some very cool use cases for debugging and performance tuning. Maybe the neatest part was that he posted examples for DTrace, which in included out of the box on OS X, so I could play along on my laptop. You can check out FlameGraph on GitHub there. Really cool talk, though. I'm very glad I went. Just check out the slideshow:
Another talk that pretty much everyone told me was excellent was Hacking your Mind and Emotions by Branson Matheson. Or, you know, "Social Engineering". Branson is a funny guy in general, and apparently this talk was hilarious as well as insightful. I wish I could have gone!
Matt Provost from Weta Digital presented a talk with excellent content at Drifting Into Fragility. This was the story of a failure in Weta's infrastructure, as told through the lens of Sydney Dekker and his Drift Into Failure and AntiFragile by Nassim Taleb. I really enjoyed the talk, and I really hope that Matt was able to introduce new people to those books. The concepts they cover are really important.
The closing plenary was delivered by Todd Underwood at Google, speaking on what I believe is a term he coined in this usage, "Post Ops" His talk was amusing, discussing the concept that system administration is finished (direct quote from slide 2). And there's some truth to that, I believe, but it's a little myopic to use as a flat statement. I really enjoyed the talk, and there was some great discussion during and after on Twitter using the #LISA13 hashtag.
While all of this "official" content was going on, things were happening all over the place. The inaugural hackspace was actually really awesome. It was the size of a normal tutorial room, except there were whiteboards everywhere, and the fastest internet connections at the whole conference. There were always snacks and drinks, and every time I went, there were people there working on things, and even more amazingly, throughout the entire conference, I never heard a single word of complaint about it. I believe USENIX is going to do it again next year, so I expect that it'll get even better. If you go to LISA14 in Seattle, make sure to check this out. I wish I had more time to go in there and hack on things. It looked awesome.
So what of next year? Well, as I said, LISA14 is in Seattle, November 9-14. The Call for Participation is out, so start thinking about things that you're interested in speaking on. I was a conference volunteer last year, helping with the training. This year, I'll be doing similar, and involved with Invited Talks, too, so if you come up with something you'd really like to have people know about, talk to me, either through email or on Twitter.
And now, if it's all the same to you, I'm going to unwind and spend a day relaxing.
November 7, 2013
Last night, I got to present a really great new program at the town hall meeting at LISA13. It's called the LOPSA Professional Recognition Program, or LPR, and there's nothing like it in the world.
Here's the deal: You and I are both system administrators, and I'm willing to bet you're a relatively decent one, at that. One of the reasons we're good at being administrators is because we try to improve ourselves, so we take efforts to learn and we're interested in the community around us.
But we both probably know some people who have a job similar to ours, but who don't take the effort to get better at it, and who just basically show up to punch a time clock. Because we're IT admins, we know the difference - we can tell the ones who care from the ones who don't. But if you aren't in the industry, it's much harder to tell.
(photo by LOPSA member Will Dennis)
That's why LOPSA is launching the LPR program. We're drawing the line in the sand by establishing the first set of professional standards of practice in the industry. Here is the set of requirements for 2014:
Must agree to abide by the LOPSA Code of Ethics
Must have at least 640 hours of professional practice in 2013
Must be a member in good standing of LOPSA
Have 20 hours of structured training (such a a tutorial, course, class, or otherwise)
or, as an autodidact, write a short essay on what the individual has done over the previous year to learn and improve themselves, and list the online communities that they're involved in
We are aware that there are many ways that individuals work to improve themselves, and we're covering all of the bases. Not everyone can come to conferences and take part in training, so we created the "autodidact" tract, which allows for more people to take part.
Because our industry changes rapidly, this program will need to change along with it. For that reason, each year, the LPR committee will re-evaluate the standards and examine how the industry has changed, and will take into account changes in the requirements to practice in other professional fields, as well.
The cost to apply is $20, although because this is the first launch, we're cutting that in half, to $10. This will get you a nice certificate to hang at your desk or in your office.
This program is going to be amazing. I'm excited to see it grow, and in a couple of years, it's going to really gain traction and cause a change in our industry. I'm hapy to be a part of it, and I hope you will be too.
Read more about it at the LPR page, and sign up!