Tag Archives: redundancy

Eventual regulation of system administration?

I was asked recently whether I thought that, eventually, System Administration would require regulation, similar to how engineering and medicine require regulation.

This isn’t an easy question to answer, even though I think about it quite a lot. I think the right answer (as much as any answer can be “right”) is that, yes, eventually some of us may hold positions that need to be regulated in the future, but in my opinion, it’s for the best. Here’s my answer:

Yes, some regulation is absolutely necessary in certain segments of the industry.

There is a very good (but very hard to read) book called Risk Society written by Ulrich Beck that caused something of a paradigm shift in the engineering mindset in the 90s.

To oversimplify, society (and the world it exists in) has become complex to the point that you can not engineer risk out of the equation.

This idea is supported by the findings of people like Sidney Dekker in The Field Guide to Understanding Human Failure, who performs what could be considered root cause analysis of surgical and aeronautical accidents. The systems that he deals with are now complex to the point where there is no single root cause, because failure is an inherent operational condition of the environment. In other words, asking why something failed is exactly like asking why something didn’t fail – it was the end result of an impossibly complex web of interrelationships, all of which culminated in the eventual success (or failure) of the system.

There are a lot of scenarios where the tasks undertaken by system administrators do have life or death consequences, and in order to architect those infrastructures with adequate resiliency, a lot of education is necessary.

The path of a lot of system administrators from amateur to professional resembles that of a child who is exceptionally gifted at building erector sets being hired to construct a pedestrian bridge. Then, if the bridge doesn’t fall, the kid gets to build bridges designed to handle interstate traffic.

I don’t write this to disparage the upwardly mobile system administrator who has learned on the job, acquired a high skill level, and is successful in the systems that they engineer. Someone who does that should be justly proud.

When you start considering the potential loss of human life in such a system, however, you start to realize that “best effort” learning isn’t enough, particularly when there is no test to establish a safe knowledge level.

Why should you require a degree in civil engineering to design and implement a traffic control system, then not require the slightest test of the people who administer the IT infrastructure that it runs on?

No, I anticipate that in the future, “critical infrastructure” administrators will have certain requirements laid on them for the benefit of everyone who uses the system. The difficult decision will be where to draw the line.

What are your thoughts? Can you see the need to pass a test (or series of tests) to become a “Critical Infrastructure Administrator”?

The silver lining of Amazon’s cloud issues

Sorry for the cloud pun in the headline, I just couldn’t help myself.

If you’re a somewhat active user of the internet, you’ve probably noticed that Amazon’s cloud service started having some problems yesterday. I thought about writing this blog entry then, but I held off. There are blips in every service, and I figured that this was just one of them. Today is the second day of the “blip” though, and some people are losing real money because of this.

If you’re curious, there is a status page for the Amazon Web Services, and you can see how things are going there. As I write this, 6 of their 29 AWS offerings are down or degraded, all in the East Coast (Virginia) location. Here’s what the 8am update yesterday said:

A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

An update today basically said that the recovery wasn’t happening as quickly as they anticipated, and they’re still running all-hands-on-deck (which at this point, 36 hours after the incident, I really hope is not true).

A lot of people are going to use this as an indictment of cloud computing in general. Unfortunately, a lot of people are going to listen to them.

Here’s the truth. Cloud computing, in and of itself, is not a flawed solution. You outsource your infrastructure to people that you trust because you (hopefully) did due diligence and determined that the costs of building your own infrastructure and keeping fine grained performance control would not be an appropriate tradeoff for the ability to expand as fluidly as your chosen cloud provider allowed.

Assuming your workload and legal situation allows you to use someone else’s resources, then there’s nothing wrong with doing that. Yes, you relinquish control and responsibility, but you almost always do that, anyway. What are you doing when you buy gold support on your enterprise hardware? You’re paying someone else to have responsibility for your hardware, and you’re considering the SLA that they offer a guarantee. It isn’t. Of course, there are financial ramifications if they break their SLA, but the verbiage in the contract has no bearing on reality.

And now, that’s the state that some AWS customers are in. Depending on how they calculate it, this 36 hour downtime brings their availability down to around 99.5%. “Two Nines” isn’t something I’ve heard a lot of people bragging about.

So, what is the actual problem? Yes, for Amazon, it was a mirror issue, but I mean what is the problem for the people who are down? The problem is that they saw a solution (cloud computing), and they treated it like it was a panacea. They didn’t treat it like what it was, which is an infrastructure platform.

If I told you that I was going to take your website that makes you $10m dollars a year and I was going to put it in a cage at a Tier 4 Colocation and run it from there, what would you think?

Well, if you were paranoid enough, you’d probably think, “Hey, great, but what happens when that colocation has an unplanned outage? “, because you know that the SLA doesn’t actually guarantee reality. You know about things like car crashes (1 and 2), drunk employees, and good-old-fashioned giant fires. You know, things can happen. So how do you get around it?

Well, with a datacenter, the answer is pretty simple…if it’s worth it to you, you get another one. (At some point in the future, I want to discuss the Economies of Scaling Out, but that’s another blog post).

There’s no reason that the same mentality can’t work with cloud computing. In fact, it’s even easier and cheaper.

If you decide to move into the cloud, don’t pick the best cloud provider and go with them, pick the best TWO cloud providers, and go with both of them. There’s no reason whatsoever that you have to continually mirror your entire infrastructure at both sites, just run one that moves your data over and keeps it there, and whenever you make changes to your infrastructure, do it in a managed fashion (such as a configuration management solution!), and mirror the copies of the configs (or config management) to the other cloud provider.

This way, when your primary cloud provider fails, it only takes you long enough to switch DNS and turn up the second infrastructure (which happens on the fly, again, because you’re in the cloud). If you’re hosted in the Amazon cloud, and you have your secondary site in the Rackspace cloud (for instance, that’s not an endorsement, I just know that they have one), your Rackspace bill is constantly low except when you need it the most, when it functions as an emergency buoy to keep your business afloat.

The takeaway from this is that you shouldn’t let this Amazon issue sour you on the idea of hosting in the cloud, if that paradigm is right for you, but you absolutely have to treat it like what it is, and not what they tell you it is. It’s a computing platform, and it needs to have a controlled failure mode for your business, just like everything else.

Progressing towards a true backup site

A while back, I moved our production site into a Tier 4 co-location in NJ. Our former primary site became the backup, and things went very smoothly.

Now we’re continuing on with our plans of centralizing our company in the northeast of the US. To advance these plans, I’m less than a week away from building a backup site into another tier 4 colo operated by the same company as the primary, but in Philadelphia. This will give us the benefit of being able to lease a fast (100Mb/s) line between the two sites on pre-existing fiber. I cannot tell you how excited I am to be able to have that sort of bandwidth and not rely on T1s.

The most exciting part of this backup site will be that it will use almost exactly the same equipment as the primary site, top to bottom. Back when we were ordering equipment for the primary site, we ordered 2 Dell PowerEdge 1855 enclosures, and we ordered 20 1955s to fill them up. Our SAN storage at the primary is a Dell-branded EMC AX4-5, and we just bought a 2nd for the backup site (though the backup site’s storage is only single controller while the primary has redundant controllers. We can always purchase another if we need). We’re using the same load balancer as the primary, and we’ll have the same Juniper Netscreen firewall configuration. Heck, we’re even going to have the same Netgear VPN concentrator. It’s going to be a very good thing.

I don’t know that I’ll have time to create the same sort of diagrams for the rack as I did before, but I should be able to make an adequate spreadsheet of the various pieces of equipment. When all of the pieces are done and in place, I am going to install RackTables to keep track of what is installed where. I mentioned RackTables before on my twitter feed and got some very positive feedback, so if you’re looking for a piece of software to keep track of your installed hardware, definitely check that out.

The rest of this week will be spent configuring various network devices. I knocked out the storage array on Monday and two ethernet switches & the fiber switch yesterday. Today I’ll be doing the Netscreens, one of the routers (the other will be delivered Friday), and the VPN box. Don’t look for extensive updates until next week, when I’ll review the install process.