Tag Archives: uptime

The silver lining of Amazon’s cloud issues

Sorry for the cloud pun in the headline, I just couldn’t help myself.

If you’re a somewhat active user of the internet, you’ve probably noticed that Amazon’s cloud service started having some problems yesterday. I thought about writing this blog entry then, but I held off. There are blips in every service, and I figured that this was just one of them. Today is the second day of the “blip” though, and some people are losing real money because of this.

If you’re curious, there is a status page for the Amazon Web Services, and you can see how things are going there. As I write this, 6 of their 29 AWS offerings are down or degraded, all in the East Coast (Virginia) location. Here’s what the 8am update yesterday said:

A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

An update today basically said that the recovery wasn’t happening as quickly as they anticipated, and they’re still running all-hands-on-deck (which at this point, 36 hours after the incident, I really hope is not true).

A lot of people are going to use this as an indictment of cloud computing in general. Unfortunately, a lot of people are going to listen to them.

Here’s the truth. Cloud computing, in and of itself, is not a flawed solution. You outsource your infrastructure to people that you trust because you (hopefully) did due diligence and determined that the costs of building your own infrastructure and keeping fine grained performance control would not be an appropriate tradeoff for the ability to expand as fluidly as your chosen cloud provider allowed.

Assuming your workload and legal situation allows you to use someone else’s resources, then there’s nothing wrong with doing that. Yes, you relinquish control and responsibility, but you almost always do that, anyway. What are you doing when you buy gold support on your enterprise hardware? You’re paying someone else to have responsibility for your hardware, and you’re considering the SLA that they offer a guarantee. It isn’t. Of course, there are financial ramifications if they break their SLA, but the verbiage in the contract has no bearing on reality.

And now, that’s the state that some AWS customers are in. Depending on how they calculate it, this 36 hour downtime brings their availability down to around 99.5%. “Two Nines” isn’t something I’ve heard a lot of people bragging about.

So, what is the actual problem? Yes, for Amazon, it was a mirror issue, but I mean what is the problem for the people who are down? The problem is that they saw a solution (cloud computing), and they treated it like it was a panacea. They didn’t treat it like what it was, which is an infrastructure platform.

If I told you that I was going to take your website that makes you $10m dollars a year and I was going to put it in a cage at a Tier 4 Colocation and run it from there, what would you think?

Well, if you were paranoid enough, you’d probably think, “Hey, great, but what happens when that colocation has an unplanned outage? “, because you know that the SLA doesn’t actually guarantee reality. You know about things like car crashes (1 and 2), drunk employees, and good-old-fashioned giant fires. You know, things can happen. So how do you get around it?

Well, with a datacenter, the answer is pretty simple…if it’s worth it to you, you get another one. (At some point in the future, I want to discuss the Economies of Scaling Out, but that’s another blog post).

There’s no reason that the same mentality can’t work with cloud computing. In fact, it’s even easier and cheaper.

If you decide to move into the cloud, don’t pick the best cloud provider and go with them, pick the best TWO cloud providers, and go with both of them. There’s no reason whatsoever that you have to continually mirror your entire infrastructure at both sites, just run one that moves your data over and keeps it there, and whenever you make changes to your infrastructure, do it in a managed fashion (such as a configuration management solution!), and mirror the copies of the configs (or config management) to the other cloud provider.

This way, when your primary cloud provider fails, it only takes you long enough to switch DNS and turn up the second infrastructure (which happens on the fly, again, because you’re in the cloud). If you’re hosted in the Amazon cloud, and you have your secondary site in the Rackspace cloud (for instance, that’s not an endorsement, I just know that they have one), your Rackspace bill is constantly low except when you need it the most, when it functions as an emergency buoy to keep your business afloat.

The takeaway from this is that you shouldn’t let this Amazon issue sour you on the idea of hosting in the cloud, if that paradigm is right for you, but you absolutely have to treat it like what it is, and not what they tell you it is. It’s a computing platform, and it needs to have a controlled failure mode for your business, just like everything else.

Admin Heroics

You know, 99% of the time, we have a pretty boring job. Sometimes we get to work on interesting problems, or maybe a system goes down, but for the most part, it’s pretty mundane.

Sometimes, though, we get called to do relative heroics. Before I was even an admin, I did tech support for an ISP in West Virginia. Once, the mail server went down hard. 20,000 people around the state suddenly had no email, and the two administrators weren’t able to be contacted. I was the only guy in the office who knew linux, and it just so happened that I had the root password to that server because I helped the younger admin a few weeks earlier with something.

I reluctantly agree to take a look at the thing, having never touched QMail (ugh), I delved into the problem. Numerous searches later led me to conclude that a patch would (probably?) fix the problem. I explained that to my bosses, and that I had never done anything like this before, but that I thought I might be able to do it without wrecking the server.

They gave me the go-ahead since we still couldn’t contact either admin, and the call queue was flooded with people complaining. I printed out the instructions from the patch, downloaded it to the mail server, and applied it as close to the instructions as I could manage. Then, I started the software. It appeared to run, and testing showed that indeed, mail was back up.

I was a hero. At least until the next day when the main admin got back. Then my root access was taken away. Jerk.

Sometimes, we’re called upon to extend beyond our zone of comfort. To do things that are beyond our skill levels, and to perform heroics under dire circumstances. These are things that make us better admins. Learning to deal with the kind of pressure that 20,000 people’s programs aren’t working and it’s up to you, or that electricity is down and $18 billion dollars worth of financial reports aren’t getting published and only you can fix it. Maybe it’s that your biggest (or only) client had a catastrophe and you’re the one handed the shovel. Whatever it is, it’s alright to think of yourself as a hero.

Because that’s what you are.