December 5, 2011
If you follow me on twitter, you might have seen an interesting occurrence on Friday…
I was whining.
My server is hosted at prgmr.com, and it’s honestly a pretty decent VPS. I’ve got 2 vCPUs, 1GB of memory, and it keeps up with my normal amount of traffic, which is to say somewhere in the ballpark of 250-1000 hits, which usually tasks the server with something like 2500 page views on the upper end.
All put together, my server handles most traffic spikes well, although I did have some consternation in September, when my BoothBabes post went viral. I got 50,000 pageviews in a single weekend, which was enough to crush the server.
On Friday of last week, the “50,000 page views” record didn’t just get broken; it got crushed. mangled. stomped on repeatedly. Friday was a “perfect storm” of “kill the server”. Here’s how it went down:
Thursday afternoon, Reddit user algrym submits my CPU Act post to /r/sysadmin. It eventually grows to a score of over 90 (which is good for that subreddit, and ensures that everyone who subscribes to /r/sysadmin will see it, which is around 11,500 people).
Nearly simultaneously, users grauenwolf and rsantoro submit the story to /r/programming and /r/technology, which have a combined subscriber count of almost 850,000. Together, the posts receive over 1,500 upvotes, ensuring that the posts go not just to the top of their respective pages, but also to the front page of Reddit itself, which sees God-only-knows how much traffic.
Apparently, someone liked it enough to submit it to Slashdot, where it got over a thousand comments, and on the front page as well.
If it goes on Slashdot and it’s interesting, you can bet that Fark will cover it as well, and sure enough, it hit the front page of Fark.
Somewhere in the mix, it also got covered by Ars Technica, along with being spread around Facebook by thousands of people.
So where does that leave us?
The sum total is pretty staggering. In terms of raw “hits”, on Friday alone, I got 51,025. Now, those are basically visits, and don’t count page views, or the number of actual requests made of a web server. In terms of page views, according to CloudFlare, I got 297,649 requests, which I have trouble fathoming. And if you ever wondered why you might want to use a CDN, here’s the statistic that sells it. Total requests, that is, for dynamic AND static content, was 4,638,389…and the best part was that CloudFlare handled 4,235,891 of those. Which is a damned good thing.
As it was, I got an average of more than one request per second on Friday, and at peak, it took over 20 seconds to render the page…so you can do the math there. There were a lot of requests that just didn’t get served, even with TotalCache.
On the advice of several twitterers (particularly @jpluscplusm), I actually took the blog down for a little while in order to try to set up Varnish, but I pulled the plug and rolled back changes after a half hour or so – serving a certain percentage of requests was better than serving no requests. Varnish is something that I will be implementing one of these weekends, though.
I also want to thank some other people who helped me out very much. The founder and current CEO of CloudFlare, Matthew Prince, went the extra mile in getting the engineers wrangled and more actively caching the page that was getting slammed.
I want to thank all of you who talked with me and offered your resources during the time. If I didn’t take you up on the offers, it was only because I didn’t want to make the service more unavailable during the transition than it already was – but know that I really do appreciate the offers and the concern. Thank you.
So, what could I have done better? Tons.
Firstly, Varnish will be put in place. I don’t know of a way to load-test against 500,000 requests (if you do, please comment!), but I can at least watch performance metrics and compare them to how the site scales now.
Speaking of, I’m going to start collecting performance metrics. Right now, the page loads entirely in 4.95 seconds. I know that because I have Firebug installed in Firefox, and I just checked it. That’s the kind of stat I should be monitoring and recording so I can watch it over time. So I’m going to start doing things like that, too.
Anyway, that’s enough post mortem for a blog server, I suppose. Thanks to all of my readers (old and new). Hopefully we’ll have that problem again in the future!