The silver lining of Amazon’s cloud issues

Sorry for the cloud pun in the headline, I just couldn’t help myself.

If you’re a somewhat active user of the internet, you’ve probably noticed that Amazon’s cloud service started having some problems yesterday. I thought about writing this blog entry then, but I held off. There are blips in every service, and I figured that this was just one of them. Today is the second day of the “blip” though, and some people are losing real money because of this.

If you’re curious, there is a status page for the Amazon Web Services, and you can see how things are going there. As I write this, 6 of their 29 AWS offerings are down or degraded, all in the East Coast (Virginia) location. Here’s what the 8am update yesterday said:

A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

An update today basically said that the recovery wasn’t happening as quickly as they anticipated, and they’re still running all-hands-on-deck (which at this point, 36 hours after the incident, I really hope is not true).

A lot of people are going to use this as an indictment of cloud computing in general. Unfortunately, a lot of people are going to listen to them.

Here’s the truth. Cloud computing, in and of itself, is not a flawed solution. You outsource your infrastructure to people that you trust because you (hopefully) did due diligence and determined that the costs of building your own infrastructure and keeping fine grained performance control would not be an appropriate tradeoff for the ability to expand as fluidly as your chosen cloud provider allowed.

Assuming your workload and legal situation allows you to use someone else’s resources, then there’s nothing wrong with doing that. Yes, you relinquish control and responsibility, but you almost always do that, anyway. What are you doing when you buy gold support on your enterprise hardware? You’re paying someone else to have responsibility for your hardware, and you’re considering the SLA that they offer a guarantee. It isn’t. Of course, there are financial ramifications if they break their SLA, but the verbiage in the contract has no bearing on reality.

And now, that’s the state that some AWS customers are in. Depending on how they calculate it, this 36 hour downtime brings their availability down to around 99.5%. “Two Nines” isn’t something I’ve heard a lot of people bragging about.

So, what is the actual problem? Yes, for Amazon, it was a mirror issue, but I mean what is the problem for the people who are down? The problem is that they saw a solution (cloud computing), and they treated it like it was a panacea. They didn’t treat it like what it was, which is an infrastructure platform.

If I told you that I was going to take your website that makes you $10m dollars a year and I was going to put it in a cage at a Tier 4 Colocation and run it from there, what would you think?

Well, if you were paranoid enough, you’d probably think, “Hey, great, but what happens when that colocation has an unplanned outage? “, because you know that the SLA doesn’t actually guarantee reality. You know about things like car crashes (1 and 2), drunk employees, and good-old-fashioned giant fires. You know, things can happen. So how do you get around it?

Well, with a datacenter, the answer is pretty simple…if it’s worth it to you, you get another one. (At some point in the future, I want to discuss the Economies of Scaling Out, but that’s another blog post).

There’s no reason that the same mentality can’t work with cloud computing. In fact, it’s even easier and cheaper.

If you decide to move into the cloud, don’t pick the best cloud provider and go with them, pick the best TWO cloud providers, and go with both of them. There’s no reason whatsoever that you have to continually mirror your entire infrastructure at both sites, just run one that moves your data over and keeps it there, and whenever you make changes to your infrastructure, do it in a managed fashion (such as a configuration management solution!), and mirror the copies of the configs (or config management) to the other cloud provider.

This way, when your primary cloud provider fails, it only takes you long enough to switch DNS and turn up the second infrastructure (which happens on the fly, again, because you’re in the cloud). If you’re hosted in the Amazon cloud, and you have your secondary site in the Rackspace cloud (for instance, that’s not an endorsement, I just know that they have one), your Rackspace bill is constantly low except when you need it the most, when it functions as an emergency buoy to keep your business afloat.

The takeaway from this is that you shouldn’t let this Amazon issue sour you on the idea of hosting in the cloud, if that paradigm is right for you, but you absolutely have to treat it like what it is, and not what they tell you it is. It’s a computing platform, and it needs to have a controlled failure mode for your business, just like everything else.

  • Pingback: Shared Links – 2011-04-22()

  • I haven’t done much in the cloud, but doesn’t every cloud have its own way of doing things / tools? As I understand it, automating all that stuff is the big time sink. If you have redundant colos, a SysAdmin can SSH to one just as easily as another, and manage machines. With clouds you have to be competent with two different infrastructure management systems. In an emergency.

    Yes, you’ll conduct fire drills and all, but as far as I know, juggling two cloud providers takes more tech savvy than juggling two DC providers. Hopefully tools with standardize in the future if they are not doing so already. Hopefully my uninformed assessment of the difficulty is overblown.

    And uh, you could make sure your Amazon cloud infrastructure is at least geographically distributed, neh?


  • Hi Daniel,

    Yes, each cloud provider has its own API for creating new VMs, for changing resources, etc etc, but those are all infrastructure-related things, they aren’t OS-or-higher level. A solid configuration management solution will take a vanilla machine and bend it precisely to your goal.

    That being said, there are even solutions for abstracting away from one cloud provider.

    And yeah, Amazon provides multiple service levels at each site, and has three sites, so the people who were in the worst shape were those who didn’t take advantage of that fact.

    Disaster Recovery is the art of asking “What if…” and having answers.

  • Thanks for the sanity, Matt. As a guy in an IT shop that makes under $1M / year, we use the cloud, Tier 4 data centers, and managed hosting. When revenue is tight and you are starting out you need to have Good Enough solutions and mitigated risk, not Top of the LIne solutions and some imaginary zero risk.

  • One day I hope to be able to work in an IT Production environment where the SLA’s are considered “Good Enough”.. In 15 years, it’s been 3-9s-5 (99.95% Availability) and just about 2-3 years ago became 4-9’s (99.99% Availability) It’s no wonder I am prematurely gray.

    15 Minutes after the beginning of an outage, the issue get’s escalated, and every 15 minutes after that it goes one step higher until it reaches the President, or the CEO…

    So you can understand my suspicion of “good enough” – because it has never been in my experience. maybe someday it will – Just not today. :/

  • Rick, I hear that. I don’t mean to suggest anything lackadaisical about the way we go about things, far from it. I do mean to suggest that when I quote out the prices of the various ways of getting more reliability, I have had good luck in finding a solution that provides some degree of risk mitigation at a cost the client (internal or external) thought was fair. Clearly though, every company’s culture is different, and I’m lucky to be working in a small place where I talk to the CEO every day. Probably your company is much more sensitive about downtime (and probably much, much larger!)

    What we’ve come down to is this: the web application for the big quasi-government client? Real protection. I estimate that in a complete server failure, I can have them up and running again in 15 minutes with no data loss (We could have made it tighter, but at a much higher cost.) For my company’s website? We keep an offsite backup. If we’re down for a day, nobody is going to die.

    It all comes down to what’s necessary. If you can convince the bosses that they can live without the company blog for a day, your price of operations goes down. Some bosses prefer that price.

  • Great post. One aspect to mention is that one could easily have mirrored between regions within the Amazon platform as well.

  • Flemming Jacobsen

    Personally I think cloud is just another fancy buzzword for something that has coincidentally existed for many years before it was “brought about”.

  • amazing post….. thank you for great sharing …..