September 24, 2010
Yeah yeah, I know, facebook is a waste of time, not worth your consideration, a blight upon the landscape...right. I'm not going to argue your feelings about that. What I will say is that it is an absolutely gigantic, huge, immense, gargantuan infrastructure that has been engineered to death to make sure that it never dies or becomes unavailable. Yesterday, it was down for over two hours.
Depending on your perspective, that's not a lot of time. Or it's huge. According to this postmortem report, that's the biggest downtime issue they've had in over four years. That puts them at around 99.98% uptime for the year.
The actual cause of the outage (a complicated master / slave configuration verification system issue) isn't of concern to me, because I don't run anything like that, but of more interest is the magnitude of the response (shutting down the entire system), the prevention of future occurrences (disable the faulty system, and engineer a new solution), and the after effects (publicly releasing information as to the cause and response).
I doubt seriously whether the public cares what cause my outages, but I know that my managers and the people they're responsible to do. Watching the "big boys" and seeing that they respond openly and with candor should be a reminder to us to document our issues, be forthcoming with the people who are impacted, and be open to improving the design of our systems when things don't work.