October 29, 2009
It's been one of those weeks.
We recently ordered a couple of pretty hefty (for us, anyway) machines from penguin computing. They're 12 core machines with 32GB of RAM. They're for heavy maths, and we've planned to put one in the primary site and one in the backup site.
Well, one arrived DOA. I've RMA'd it, but since we've needed to get these machines into production, I installed and configured the working machine and delivered it to the rack at the primary site. And there was much rejoicing.
Until the next day, when we started getting odd flapping reports of packet loss on the production database machine. It solved itself, and we didn't think any more of it, until it happened the next day. Of course, in retrospect, we should have seen the relationship immediately, but it took some investigation. It seems that certain processes using 10 of the 12 cores cause issues with the database machine when run concurrently with another database-heavy task on another machine. Maybe.
See, I'd love to say conclusively, but signs are pointing to "yes". Before I can assure anyone that it is absolutely the issue, I've got to recreate the occurrence at the backup stack. But we don't have a machine there capable of that kind of performance. At least, we didn't before yesterday.
It was decided at 3:30pm that the machine needed to be physically migrated from the primary site to the secondary site. Like, right now.
So by 6 last night, Ryan and I were at the backup site installing the machine in the rack and getting it configured. Hopefully by the end of the week, we'll be able to conclusively prove that I need to find a way to make the database server magically better. Or not. I'm not sure which to hope for.
Thank goodness that the LISA conference is next week. It sure won't be a vacation, but it'll be nice to get away for a while.