Timing is Key

Date January 28, 2011

I don't know if you've heard, but my area in New Jersey has gotten a lot of snow lately. Right now, I'm looking at the back parking lot, and there's snow around two feet deep everywhere that isn't plowed or covered by 12 foot plow droppings. Earlier this week, we were at work as we started to get the most recent snowfall, and I was trying to encourage my junior admin to leave the office before it hit. "I don't know, it looks alright", he said, and I replied,

"If you wait until it's obvious, it's too late"

It sort of hit me. That last sentence has become my defacto motto when it comes to a lot of things. I think the first time it occurred to me was when I started researching IPv6. The depletion of IPv4 is no surprise, and hasn't been for quite a while, but it seems like most people are holding off even researching it until it becomes obvious that they need it. Again, by that time, if you're in any kind of competitive company vying for market position, it'll be too late. It won't be obvious that it's necessary until you see people financially punished for not taking those steps.

Another thing I've noticed within my own company is that so much of the concentration is on day to day operations that it's rare when someone steps back and thinks, "is this really a sustainable practice?". It's getting better since we created a QA role, but still, the first thing I do after putting out a fire is to figure out why something wasn't made flame-retardant a long time ago.

I'm going through practices that we have on paper but aren't performing (like enforced data lifetimes), plus I'm implementing things that will help the environment scale further, such as sharding my data sets. Only through continuing to manage these things toward the direction I want them to go can I prevent future disasters. If I wait until it's obvious that there's a problem, then it's too late.

I know that I'm not the only one. My friend Robin Harris published a column back in 2007 titled Why RAID 5 stops working in 2009 which basically explained that the rate of unrecoverable read errors (UREs) wasn't decreasing at the same rate that disk capacity was increasing, and that it was soon going to be common that the size of your RAID array would nearly guarantee a URE when a drive fails. He was right; right now, no one would recommend RAID-5 for an array of any size.

Last year, he published a follow-up article, "Why RAID-6 stops working in 2019". He's right again, if not sooner.

Our jobs are hard. They take up a lot of time, but the day to day requirements of user and machine maintenance are only part of the story. We need to keep an eye out for oncoming trucks, especially when no one else is doing that job for our company.

What kinds of things do you see looming? What have you noticed but others have missed? Share your thoughts in the comments, because you might open someone's eyes to a problem they never knew was sneaking up on them.