When it rains, it pours…then it landslides

If you follow me on twitter, you might have seen an interesting occurrence on Friday…
I was whining.

My server is hosted at prgmr.com, and it’s honestly a pretty decent VPS. I’ve got 2 vCPUs, 1GB of memory, and it keeps up with my normal amount of traffic, which is to say somewhere in the ballpark of 250-1000 hits, which usually tasks the server with something like 2500 page views on the upper end.

To help out on those high-load days, I’ve enabled WP-SuperCache (thanks to everyone who helped me get that working a while back, by the way!) and I use CloudFlare, a free (or upgrade-for-pay) Content Delivery Network (CDN) that caches all of my static content such as images and javascript .js files, and serves those from their network. They also do a good job of caching my pages in the event my server dies, so the pages can still be available.

All put together, my server handles most traffic spikes well, although I did have some consternation in September, when my BoothBabes post went viral. I got 50,000 pageviews in a single weekend, which was enough to crush the server.

On Friday of last week, the “50,000 page views” record didn’t just get broken; it got crushed. mangled. stomped on repeatedly. Friday was a “perfect storm” of “kill the server”. Here’s how it went down:

Thursday afternoon, Reddit user algrym submits my CPU Act post to /r/sysadmin. It eventually grows to a score of over 90 (which is good for that subreddit, and ensures that everyone who subscribes to /r/sysadmin will see it, which is around 11,500 people).

Nearly simultaneously, users grauenwolf and rsantoro submit the story to /r/programming and /r/technology, which have a combined subscriber count of almost 850,000. Together, the posts receive over 1,500 upvotes, ensuring that the posts go not just to the top of their respective pages, but also to the front page of Reddit itself, which sees God-only-knows how much traffic.

Apparently, someone liked it enough to submit it to Slashdot, where it got over a thousand comments, and on the front page as well.

If it goes on Slashdot and it’s interesting, you can bet that Fark will cover it as well, and sure enough, it hit the front page of Fark.

Somewhere in the mix, it also got covered by Ars Technica, along with being spread around Facebook by thousands of people.

So where does that leave us?

The sum total is pretty staggering. In terms of raw “hits”, on Friday alone, I got 51,025. Now, those are basically visits, and don’t count page views, or the number of actual requests made of a web server. In terms of page views, according to CloudFlare, I got 297,649 requests, which I have trouble fathoming. And if you ever wondered why you might want to use a CDN, here’s the statistic that sells it. Total requests, that is, for dynamic AND static content, was 4,638,389…and the best part was that CloudFlare handled 4,235,891 of those. Which is a damned good thing.

As it was, I got an average of more than one request per second on Friday, and at peak, it took over 20 seconds to render the page…so you can do the math there. There were a lot of requests that just didn’t get served, even with TotalCache.

On the advice of several twitterers (particularly @jpluscplusm), I actually took the blog down for a little while in order to try to set up Varnish, but I pulled the plug and rolled back changes after a half hour or so – serving a certain percentage of requests was better than serving no requests. Varnish is something that I will be implementing one of these weekends, though.

I also want to thank some other people who helped me out very much. The founder and current CEO of CloudFlare, Matthew Prince, went the extra mile in getting the engineers wrangled and more actively caching the page that was getting slammed.

I want to thank all of you who talked with me and offered your resources during the time. If I didn’t take you up on the offers, it was only because I didn’t want to make the service more unavailable during the transition than it already was – but know that I really do appreciate the offers and the concern. Thank you.

So, what could I have done better? Tons.

Firstly, Varnish will be put in place. I don’t know of a way to load-test against 500,000 requests (if you do, please comment!), but I can at least watch performance metrics and compare them to how the site scales now.

Speaking of, I’m going to start collecting performance metrics. Right now, the page loads entirely in 4.95 seconds. I know that because I have Firebug installed in Firefox, and I just checked it. That’s the kind of stat I should be monitoring and recording so I can watch it over time. So I’m going to start doing things like that, too.

Anyway, that’s enough post mortem for a blog server, I suppose. Thanks to all of my readers (old and new). Hopefully we’ll have that problem again in the future!

  • For load testing, you might want to check out browsermob. Now, I wouldn’t pay for 500k simulated hits, but smaller slices are affordable.

  • Wyatt: Cool, thanks for the tip. I’ll check them out!

  • Alex

    Matt, I actually strongly recommend against Varnish. It’s not a bad caching engine, don’t get me wrong, but there are better utilities and methods out there. As I’m sure you can see my attached E-Mail to this comment, send me an E-Mail and I’ll help you out with this. I have a lot of experience handling extremely high traffic, and it’s actually much trickier than slapping Varnish in front of your application if you want it to correctly scale and handle traffic surges.

    I will admit, though, CloudFlare is a very good first step. Kudos for jumping on that before a problem arose. A proactive–as opposed to a reactive–sysadmin is a good sysadmin.

  • James

    For stress testing, siege allows you to pick a URL and have it hit repeatedly for a designated number of seconds.

    Homepage: http://www.joedog.org/index/siege-home

    For me, I use Ubuntu server and just installed it out of the standard repos.

  • I know it’s become something I harp on about, but you really can’t beat static generated sites for coping with load, particularly if you shove a service like CloudFlare in front of it.

    I switched my site across from WordPress to Octopress (Ruby/Jekyll based), using Disqus for comments (so it’s just client side javascript execution.) I don’t have that many blogs so whilst a script I found did the bulk of the heavy lifting converting WordPress output to markdown language, I didn’t waste time trying to script the manual tweaks each blog required. I’ve since seen others that have shifted large wordpress archives across in entirely automated fashions after they wrote / fixed / tweaked scripts. Site loading times have plummeted drastically, even under load (tested with Siege mentioned by James, using between 20 and 50 concurrent users)

  • Chris Snell

    Again, if you’re looking for testing large amounts of traffic, you can look at traffic generation tools from Ixia and Spirent, but I’m suspecting you don’t want to shell out the $$ those companies will ask.

  • Nope, I’ve just about resigned myself to having to write compelling content that my readers want to share. Damnit.

  • algrym

    I’m very, very sorry … but it was a great article. :)

    For what its worth, I’ve been there.

  • @algrym: Hey, you keep on submitting anything you want!