No updates lately - super busy!

Date April 23, 2014

I've been writing less lately than normal, and given my habits of not posting, that's saying something!

Lately, I've been feeling less like a sysadmin and more like a community manager, honestly. On top of the normal LOPSA Board Member duties I've had, I'm serving as co-chair of the LISA'14 Tutorials committee AND the Invited Talks committee. PLUS I've been doing a lot of work with PICC, the company that manages the Cascadia IT Conference and LOPSA-East (which is next week, so if you haven't yet, register now. Prices are going up starting on Monday!) .

All of this leaves not much time at all for doing actual sysadmin work, and even less for writing about it.

As an overview of the stuff I've been dealing with at work, let me just implore you, if you're using Cisco Nexus switches, Do Not Use Switch Profiles. I've written about them before, but it would be impossible for me to tell you not to use them emphatically enough. They're terrible. I'll talk about how terrible some time later, but trust me on this.

Also, I've been doing a whole lot of network migration that I'll also write about at some time in the future, but I'll just say that it's really demoralizing to perform the same migration three times, but I'm awfully glad that I had a plan to rollback. At the moment, I'm working on writing some python scripts to make per-port changes simpler so that I can offload it to students. I'm glad that Cisco has the native Nexus Python API, but their configuration support is severely lacking - basically equivalent to cli(). Also, students migrating hosts to the new core...what could possibly go wrong? ;-)

Alright, no time to write more. I will work on writing more frequently, anyway!

Fun lesson on VRRP on Nexus

Date April 8, 2014

I'm in the middle of migrating our upstream links from the old 6500 core to the new Nexus switches, and I discovered something fun today that I thought I'd share.

Before, since I only had a single switch, each of my subnets had a VLAN interface which had the IP address of the gateway applied, such as this:

interface Vlan100
 description VLAN 100 -- Foo and Bar Stuff
 ip address 192.168.100.1 255.255.255.0
 no ip redirects
 ip dhcp relay information trusted
 ip route-cache flow
end

Pretty simple. But in the new regime, there are two switches, and theoretically, each switch should be fully capable of acting as the gateway. This is clearly a case for VRRP.

One Core01, the configuration looks like this:

interface Vlan100
  description Foo and Bar Stuff
  no shutdown
  ip address 192.168.100.2/24
  management
  vrrp 100
    authentication text MyVrrpPassword
    track 1 decrement 50
    address 192.168.100.1
    no shutdown

and on Core02, it looks like this:

interface Vlan100
  description Foo and Bar Stuff
  no shutdown
  ip address 192.168.100.3/24
  management
  vrrp 100
    authentication text MyVrrpPassword
    track 1 decrement 50
    address 192.168.100.1
    no shutdown

The only difference is the IP address assigned to the interface.

(Incidentally, I'm tracking the upstream interface. If the link goes dead, it decrements 50 points to make sure that VRRP changes over the Virtual IP to the other interfaces).

When both switches were configured this way, "show vrrp" reported that both switches were set to Master:

core01(config-if)# sho vrrp
      Interface  VR IpVersion Pri   Time Pre State   VR IP addr
---------------------------------------------------------------
        Vlan100 100   IPV4     100    1 s  Y  Master 192.168.100.1

core02(config-if)# sho vrrp
      Interface  VR IpVersion Pri   Time Pre State   VR IP addr
---------------------------------------------------------------
        Vlan100 100   IPV4     100    1 s  Y  Master 192.168.100.1

That's clearly not good. I verified that both switches were actually sending traffic:

2014 Apr 8 10:58:20.968603 vrrp-eng: Vlan100[Grp 100]: Sent packet for VR 100, intf 0x9010073

When digging in, what I found was that the switches were getting traffic from duplicate IPs:

Apr 8 11:06:33 core01 %ARP-3-DUP_VADDR_SRC_IP: arp [3573] Source address of packet received from 0000.5e00.0173 on Vlan100(port-channel1) is duplicate of local virtual ip, 192.168.100.1

On a whim, the other admin I was working with at the upstream provider had me disable the "management" flag. And of course, that made things start working immediately.

Apparently, setting the management flag (which ostensibly allows you to manage the switch in-band by using the address assigned to the interface) ALSO aggressively uses the VIP as its source address. I don't know why. It seems like a bug to me, but I'm going to get in touch with the TAC today and see if it's a known thing or not.

I thought you might be interested in knowing about this, anyway. Thanks for reading! (and if you have more information as to why this happens, please comment!)

LOPSA East #SysAdmin conference in NJ - Early Bird ending soon

Date March 19, 2014

I don't know about you, but I love a good local SysAdmin conference, and LOPSA-East is definitely shaping up to be one.

Now in its fifth year, it's being chaired by LOPSA Board Member Evan Pettrey, who also started the Crabby Admin chapter in Baltimore.

The training classes look great as always, and include a KanBan course (which should be interesting!) as well as a Hands-On Security Course by Branson Matheson, who works at NASA and gave one of the most entertaining invited talks of last year's LISA conference. You really don't want to miss this. Personally, though, I think he really missed the boat when naming it...

Anyway, not only is registration open, the early bird pricing is almost over. The discounted pricing ends soon, so get approval from your boss and get registered, because you don't want to miss this!

Just what we need...another package manager

Date March 18, 2014

The rust programming language announced the release of Cargo, their new package manager, which will, "support the common lifecycle for packages", because we don't have enough of that already.

Before I start, let me just say - I'm not picking on rust or cargo. This isn't about them. This is about the ecosystem.

We have too many package managers. Far, far, far too many package managers.

Each Unix-like OS has their own package management solution. And each programming language seems to have ITS own, too. Whether it's CPAN, or pip, or pear, or easy_install, or gems, npm, or now cargo, there's a way to install addons for software specific to a language, in that language itself, but more importantly, outside of the OS package management.

And then, there are other layers to contend with. DSLs have their own internal way of installing modules. Puppet has the forge, chef has knife and the community cookbooks, and CFengine has the..well, whatever it has, I'm sure. Then there is meta-software to manage those, too. Things like librarian-puppet, which is, itself, a customization of a piece of software called librarian, which is a framework to manage bundlers, which is something that manages ruby dependencies (pulling down the software you need to run the things that you have). There is also librarian-chef. This is separate from Berkshelf, of course, which does pretty much the same thing.

How in the ever-loving sight of his noodley appendages are any of us supposed to manage this heaping layer of crap?

Yes, I know, some of the aforementioned software solutions are meant for managing a single user's module directory maybe, or a specific dev environment, or for testing. But which, and why, and is it explained anywhere that the software should or shouldn't be used in a certain environment or way, or are you just given the sticks, the twine, and the plastic and told to go fly a kite?

The irony of all of this is that we're using this jumbled up pile of random software which largely overlaps in purpose and function, to try to build more solid, easily reproducible infrastructures. But rebuilding the environment that created it? Good luck!

Maybe it's not all that bad. Maybe. I mean, after all, I have the artifacts that were created, right? I have the puppet code that I used to produce my infrastructure. Well, sort of. I mean, to be really honest, I don't write all that much puppet code. I use off the shelf puppet code, and I end up writing YAML for hiera to interpret and feed into parameterized classes. So my artifacts aren't puppet code so much as really ugly stuff that isn't markup, according to the definition of the language construct.

So how do I rebuild the compost heap infrastructure that I used to build my environment? Suppose I used entirely off-the-shelf puppet code. Nothing custom, just modules I found. And I erased my repo which contains my puppet modules. How would I rebuild it and get the same thing that I had before?

Do I stand a chance in hades of actually being able to reproduce what I had before? Suppose I even had a list of all of the puppet modules. Could I source the same modules? The same versions of the same modules? No. And I'm willing to bet, neither could you.

If you're like me, you've probably cobbled together some kind of unholy array of modules from the forge, modules from github, and modules of ill repute that you found in a back alley somewhere, and that's all stuff that you probably think of as "off the shelf". It's completely irreplaceable, if only because there's no good way to figure out where what came from, because I (and most likely you) used any one of a dozen methods of getting the module into the system.

I don't think there's any hope for fixing the 'bespoke build infrastructure', at least in this generation. We have to deal with the fact that the systems we use to build our highly available, fully reproducible, solid infrastructures are, themselves, anything but.

However, if I could plead to the tool makers, to the developers who are, have, and will build the things we use to build things, let me ask you this. Please, swallow your hubris. Please, accept that a tool someone else wrote that you see as imperfect may actually solve the problem, and know that the world may not need another solution that does the same thing. Instead of building something new that overlaps, work with the other person to extend what they've done, don't build anew.

Unless we learn this lesson, we're going to continue to have infrastructures that look like the Walled City of Kowloon, rather than something better. Please.

Interview with Atom Powers, co-chair of #CasIT14 in Seattle

Date February 17, 2014

Cascadia IT ConferenceI've been involved in LOPSA's regional conferences since 2010, when NJ-LOPSA first held the Professional IT Community Conference. The next year, Seattle got in on the act, and the Seattle-Area System Administrators Guild put on Cascadia IT Conference in 2011.

This year, helping Lee Damon along is Atom Powers, a gentleman I've not yet had the pleasure to meet. I wanted to get to know him, and more importantly, I wanted you to get to know him, too, because he's helping out in a big way.

Matt Simmons: Is this is going to be your first year being helping to run Cascadia IT Conference? How did you get involved?

Atom Powers: This is my first year helping to organize Cascadia IT Conference and my first year helping to organize any conference. We are very fortunate to have Lee Daemon lead the organizational effort and impart some of his wisdom unto me that I may shoulder more of the burden next year.

MS: What is the process like to put together something like Cascadia? How long does it take?

AP: It is a constant effort, barely a day goes by that we don't do some kind of planning for the conference. Even before this year's conference starts we will begin planning for next years conference.

MS: According to your LinkedIn page, you live in the Seattle area. Has that always been the case? If not, what brought you there?

AP: From my parents' hippie roots I've been in the Pacific North West for most of my life, although rarely in the same place for more than three or four years at a time. In 2004 I found a part of Renton that I like well enough and I've been there since.

MS: What in the Cascadia schedule are you most looking forward to?

AP: We are very fortunate this year to have so many well known people presenting: Tom Limoncelli, Garrett Honeycutt, Steve Murawski, to name a few. We have a good lineup of MS Windows administration tutorials as well, something you don't usually see at an IT Conference. You can find a full list of presenters and presentations here: http://casitconf.org/casitconf14/conference/schedule/

MS: Was there anything that didn't get put in this year that you think should get put in next time?

AP: We select talks and tutorials based on what we think people will want to learn about because we want this conference to be useful and to
elevate the IT industry generally. It is always difficult to predict what other people will want and we always have to turn down many very promising proposals. We won't know until after the conference if we made wise choices and we may never know if a different choice would have been a better choice.

Every year is new. If you have an idea or a request then it is never too early to start thinking about preparing your proposal for next year.

MS: If someone is on the fence about coming to Cascadia, what would you say what would convince them?

AP: If you want to grow and learn and be valuable to your team, your company, and our industry then you need to get out and learn. The
Cascadia IT Conference is a unique opportunity to do that among friends and colleagues. This conference provides many difference kinds
of registrations for those with difficult time and financial commitments. If you can't afford two days then you have the option of
one day, or half a day, or even just an hour; and it will be one of the best things you can do with your hours.

http://casitconf.org/casitconf14/registration-is-now-open/

MS: One last question. On your Google+ profile, you have a picture of a Kerbal, from Kerbal Space Program. What's your proudest in-game accomplishment?

AP: I think Kerbals are cute.

I wasn't playing Kerbal Space Program (KSP) with much dedication until recently. My favorite games are sandbox games but "stock" KSP is a bit too open-ended and not very challenging. With the 0.23 science update and a few mods (BTSM, MechJeb) and a spreadsheet of science "missions", I find the game much more enjoyable. Enjoyable enough that my Minecraft world hasn't had any attention in many weeks.

I'd like to thank Atom for taking the time to answer my questions, and I've got to say, I'm looking forward to meeting him when I attend Cascadia this year.

In fact, I feel like I should tell you, I think it's important enough that I'm paying for my trip out of my own pocket. I'm lucky in that my work is picking up one conference a year for me, and that's going to be LISA this year, but I feel like going to Cascadia is important enough that I'm paying for my flight, my hotel, and my conference registration out of pocket (although after my work saw how important it was to me, they DID offer to pick up the cost of the Advanced Puppet training course I'm taking on Friday, so thanks NEU!).

If you're on the west coast and you've never been to a conference, why not make this your first? If you can register today (Feb 17th, 2014), you can still get the Early Bird savings. It's going to be full price tomorrow, so do it today.

Thanks for reading!

RabbitMQ on Ubuntu via Puppet?

Date February 7, 2014

Ubuntu has a certain really annoying property. Alright it has several, but the one that I'm talking about right now is the insistence that it start services upon installation. While I'm never a fan, there are certain times when it chafes more than normal.

Here's the deal. I'm using the PuppetLabs RabbitMQ module, and I'm trying to spin up an instance in Vagrant. The initialization is simple enough:

class { 'rabbitmq':
port                    => '5672',
service_manage          => true,
environment_variables   => {
'RABBITMQ_NODENAME'     => 'server',
'RABBITMQ_SERVICENAME'  => 'rabbitMQ'
}
}

This works - mostly. The package gets installed, but when puppet tries to manage the service, it fails:

Screen Shot 2014-02-07 at 5.07.30 AM

 

The reason for the failure is actually because the port is in use. If you connect into the machine and try to start the service manually, you get this:

vagrant@precise64:~$ sudo /etc/init.d/rabbitmq-server start
 * Starting message broker rabbitmq-server
 * FAILED - check /var/log/rabbitmq/startup_\{log, _err\}
   ...fail!

When you check the start_log, you find this:

Error description:
{could_not_start_tcp_listener,{"::",5672}}

The reason it can't start the listener is because that port is already in use!

vagrant@precise64:~$ sudo lsof -P | grep LISTEN | grep 5672
beam.smp 2211 rabbitmq 16u IPv6 11556 0t0 TCP *:5672 (LISTEN)

Sure enough, 'ps auxf' shows this:

rabbitmq 2211 0.4 8.3 1073128 31160 ? Sl 10:07 0:02 \_ /usr/lib/erlang/erts-5.8.5/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.3/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@precise64 -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/[email protected]"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/[email protected]"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.2.3/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@precise64-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@precise64"

As it turns out, of course, starting the service upon installation is by design.

So here's my question. Presumably, other people have done this. The service is documented as tested under Ubuntu 12.04, which is what I'm running. How do you make this work? The only thing I've seen that makes the puppet run not fail is to set "service_manage => false" and I don't want that. I want the service to be managed and I want the initial default running-right-after-installation instance to die.

What's the right way to do that? For what it's worth, I searched on ServerFault and found someone else with the same problem, but there hadn't been an answer after a month yet, which is why I brought it here.

Can you give me a hand? Thanks in advance!


EDIT

Got it! I popped onto the #rabbitmq IRC channel on freenode and started asking questions, and I was going down the complete wrong line of thinking, when one of the local denizens, bob235, asked me a crucial question:

so when you install other services (e.g. webservers) they don't start up automatically? any idea how they manage to do that?

Right. Good call. Why, when I install apache via puppet doesn't it fail. Well, it's because Puppet checks to see if Apache is running before it starts, by doing '/etc/init.d/apache2 status' and learning that the service is already running. So why wasn't that happening here?

I booted a generic vagrant instance, and manually installed rabbitmq using apt-get. It ran after installation, as I expected it would. Running '/etc/init.d/rabbitmq-server status' showed that it was running:

root@precise64:~# /etc/init.d/rabbitmq-server status
Status of node rabbit@precise64 ...
[{pid,2284},
 {running_applications,[{rabbit,"RabbitMQ","2.7.1"},
                        {mnesia,"MNESIA  CXC 138 12","4.5"},
                        {os_mon,"CPO  CXC 138 46","2.2.7"},
                        {sasl,"SASL  CXC 138 11","2.1.10"},
                        {stdlib,"ERTS  CXC 138 10","1.17.5"},
                        {kernel,"ERTS  CXC 138 10","2.14.5"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R14B04 (erts-5.8.5) [source] [64-bit] 
[smp:2:2] [rq:2] [async-threads:30] [kernel-poll:true]\n"},
 {memory,[{total,24525920},
          {processes,9676480},
          {processes_used,9672808},
          {system,14849440},
          {atom,1124441},
          {atom_used,1120225},
          {binary,92152},
          {code,11134417},
          {ets,733392}]},
 {vm_memory_high_watermark,0.3999999984343938},
 {vm_memory_limit,153295257}]
...done.
Then I killed the vagrant instance, turned on the puppet rabbitmq manifest, and started it up. Puppet failed during the run because it couldn't start the service, just as it always had failed. So I connected in and ran '/etc/init.d/rabbitmq-server status', and I got this:
root@precise64:~# /etc/init.d/rabbitmq-server status 
Status of node precise64@precise64 ...
Error: unable to connect to node precise64@precise64: nodedown

DIAGNOSTICS
===========

nodes in question: [precise64@precise64]

hosts, their running nodes and ports:
- precise64: [{rabbit,45074},{rabbitmqctl2835,51120}]

current node details:
- node name: rabbitmqctl2835@precise64
- home dir: /var/lib/rabbitmq
- cookie hash: ovNnSahEs2CWYKS80bcf5w==

Yeah, it definitely does't see it running. But wait, did you catch it? There's a subtle difference.

In the output from installing it manually, the first line was "Status of node rabbit@precise64", but after the puppet run, it was "precise64@precise64". Puppet is changing the node name on disk, but the running instance is under the original name!

I edited /etc/rabbitmq/rabbitmq-env.conf to change the node name back to 'rabbit', then ran /etc/init.d/rabbitmq-server status, and sure enough, it showed identical output to when I installed it manually.

To finish it off, I edited the puppet config to change the node name back to 'rabbit', then restarted the vagrant box. It started perfectly:

Screen Shot 2014-02-07 at 10.28.13 AM

End result: If you want to use the RabbitMQ module from puppetlabs, my advice is to not change the node name on the server from 'rabbit'.