Facebook and their response to an outage

Yeah yeah, I know, facebook is a waste of time, not worth your consideration, a blight upon the landscape…right. I’m not going to argue your feelings about that. What I will say is that it is an absolutely gigantic, huge, immense, gargantuan infrastructure that has been engineered to death to make sure that it never dies or becomes unavailable. Yesterday, it was down for over two hours.

Depending on your perspective, that’s not a lot of time. Or it’s huge. According to this postmortem report, that’s the biggest downtime issue they’ve had in over four years. That puts them at around 99.98% uptime for the year.

The actual cause of the outage (a complicated master / slave configuration verification system issue) isn’t of concern to me, because I don’t run anything like that, but of more interest is the magnitude of the response (shutting down the entire system), the prevention of future occurrences (disable the faulty system, and engineer a new solution), and the after effects (publicly releasing information as to the cause and response).

I doubt seriously whether the public cares what cause my outages, but I know that my managers and the people they’re responsible to do. Watching the “big boys” and seeing that they respond openly and with candor should be a reminder to us to document our issues, be forthcoming with the people who are impacted, and be open to improving the design of our systems when things don’t work.

  • Well said. It is surprising that some of the management I have worked for over the many years agree with the need to respond openly. While other management would rather have a response written by a marketing guy highly skilled in “spin”. I have long been in favor of the former more honest and professional approach.

  • Google goes through a similar open process with their outages (at least related to their App Engine) which is refreshing to see.

    Here is an example from their February 2010 outage which made a lot of news:

  • Anthony

    Sometimes the ‘honest’ answer is actually the best marketing SPIN anyway.

    I’ll never forget when I worked for a large company known by it’s initials we had a design flaw in one of the hard disks. Rather than continue to replace these defective units with the same flawed model, they started replacing them with Larger (though newer and still cheaper) models.

    However because the customer had paid for a drive of a particular size they engineered the new drives with a jumper that would cobble it down to the smaller size and they pretended they were a new model of the smaller drive.

    This actually made sense to someone.

    So instead of saying “Here’s a new larger capacity drive that won’t have the problem, enjoy the extra space with our apologies for the inconvenience” they spent a fortune cobbling the drives and all they got were a bunch of customer saying “How do I know this one wont’ have a problem too? “

  • @Ken – Thanks! And I feel the same way. Be honest about your failures and people will value your honesty

    @Doug – That’s a great link! Thanks!

    @Anthony – Ouch. Some people intentionally shoot themselves in the foot!

  • @Anthony – Having shared the same employer with you for a long time, have to admit that is their “style” as old-school that particular thinking is. To the client it’s always “You can’t handle the truth” – and since they still charge the client based on the size of the resources consumed, (specifically allotted by SLA and contract) they won’t soon be giving the store away for goodwill reasons.. So cobble the drives they will, to meet the contract.

    Welcome to the cowardly old world! ;)

  • Pingback: Tweets that mention Facebook and their response to an outage | Standalone Sysadmin -- Topsy.com()

  • @Anthony: were those disks going to be used in some mission critical system ? I’m asking because as innocent as adding a disk with different capacity could be, software is software.. and it has bugs. some systems are designed to work under specific assumptions and a specific disk size could be one of those. the vendor is this case can’t tell it’s customers “hey, i’m shipping your disks, they’re built with different materials and also have a new firmware which we really are not sure if your system will like… but they are bigger and that’s good, right?”.. I’m just trying to make a point that these changes aren’t a big deal in smaller systems but can be a pain in something like a system that processes millions of transactions an hour.

    Lying to customers is never acceptable.. they should have shipped the same model but with the flaw fixed, no matter what it takes. But we often see that disk vendors usually don’t care about that (as your history proves)… we buy only enterprise-grade disks and they still don’t care… I guess that’s why big storage vendors overcharge their disks.