Please schedule your unplanned emergencies at a more opportune time

Date October 29, 2009

It's been one of those weeks.

We recently ordered a couple of pretty hefty (for us, anyway) machines from penguin computing. They're 12 core machines with 32GB of RAM. They're for heavy maths, and we've planned to put one in the primary site and one in the backup site.

Well, one arrived DOA. I've RMA'd it, but since we've needed to get these machines into production, I installed and configured the working machine and delivered it to the rack at the primary site. And there was much rejoicing.

Until the next day, when we started getting odd flapping reports of packet loss on the production database machine. It solved itself, and we didn't think any more of it, until it happened the next day. Of course, in retrospect, we should have seen the relationship immediately, but it took some investigation. It seems that certain processes using 10 of the 12 cores cause issues with the database machine when run concurrently with another database-heavy task on another machine. Maybe.

See, I'd love to say conclusively, but signs are pointing to "yes". Before I can assure anyone that it is absolutely the issue, I've got to recreate the occurrence at the backup stack. But we don't have a machine there capable of that kind of performance. At least, we didn't before yesterday.

It was decided at 3:30pm that the machine needed to be physically migrated from the primary site to the secondary site. Like, right now.

So by 6 last night, Ryan and I were at the backup site installing the machine in the rack and getting it configured. Hopefully by the end of the week, we'll be able to conclusively prove that I need to find a way to make the database server magically better. Or not. I'm not sure which to hope for.

Thank goodness that the LISA conference is next week. It sure won't be a vacation, but it'll be nice to get away for a while.

  • http://hype-free.blogspot.com/ Cd-MaN

    You could try the recently released stressapptest project from Google (http://code.google.com/p/stressapptest/) to verify your hardware. It stresses disk / RAM and CPU (it also has a networking part which I didn't use as of yet) with the intent of discovering problems under high load. Although I didn't manage to uncover problems on the system I ran it as of yet (which is a good thing, I suppose), it is easy to use and it really seems to stress the given components.

    Regards.

  • Anthony

    Always hope that your suspicions are correct no matter the consequence. The alternative is that your intuition/experience/knowledge failed to predict the cause of the problem - which is ultimately worse.

  • http://www.standalone-sysadmin.com Matt Simmons

    @Cd-MaN - Thanks for the tip. I don't think I'm going to run the stress test on our production database, but it could definitely be useful in the future for benchmarking and the like. Good link!

    @Anthony - Very true. Excellent advice, and it sounds like something House MD would say ;-)

  • http://jeffhengesbach.blogspot.com/ Jeff Hengesbach

    Good luck Matt - you are fortunate to have an environment where you can hash it out! I guess it is a classic case of where adding something to production, wouldn't have been considered to be disruptive. Usually turns out to be the opposite though - at some level. A 'big' box like that can really put a strain on the resources it relies upon.

  • Pingback: Planet Network Management Highlights – Week 44()