The Trouble with SAN storage is that most of it is too well designed. Really.
Some serious engineering goes into making highly fault-tolerant, reliable devices. They’re designed to be available all the time. It reminds me of a saying…
The difference between something that can break and something that can’t break is that when something that can’t break does, it’s hell to get at
I’ve got an AX4-5 at my secondary site that hasn’t worked since this past weekend. The currently suspected reason is almost unfathomably simple…a sense cable.
See, the way that heavy-duty RAID controllers work is that they have a significant amount of cache. Because the array wants to be absolutely sure of everything that is written and unwritten, the cache on the actual hard drives is disabled, and the cache is used for the entire controller.
By using this cache, the response time of the entire array is heavily improved. Instead of waiting for the data to actually be written to the disks, you can wait until it’s spooled into the cache (much faster). The drawback is that if you lose power, the data wasn’t written to the disk, and would be lost. To circumvent that, the array has a Standby Power Supply (SPS), which is essentially just a UPS.
The SPS is plugged into the power source, and the controller is plugged into the SPS. To alert the controller to a power outage, the SPS and controller share a sense cable. Without this cable, the controller doesn’t know it needs to flush the cache to disk, so it would merrily continue to store data in RAM until it loses power.
Because the storage array is designed NOT. TO. LOSE. YOUR. DATA., if you don’t have a battery, or even a sense cable, the array turns off all of that expensive, performance-producing cache. With no cache whatsoever, not even on the disks, reading continues to be relatively good, performance wise, but writing data, particularly non-linear data, is a drag, literally and figuratively.
This is the state that my standby array has been in since Sunday morning. The Navisphere Express interface was reporting
The storage system’s Standby Power Supplies (SPSs) are not working properly. (0x7404)
, so EMC sent out a Customer Engineer (CE) to replace the SPS. Of course, by the time he got out here, they looked at the diagnostic information and saw that the array was ACTUALLY reporting a bad sense cable. No worries, just replace it. I’m sure one came with the SPS, right?
No, I’m afraid not. In fact, it wasn’t going to be easy to get one at all.
Lets look at this with some perspective. I am less than an hour from New York City, one of the largest metropolitan and commercial hubs of the world. The equipment was in the heart of Philadelphia, the 6th most populated city in the country. Where are we going to get a sense cable? Apparently, Mexico. What am I supposed to think about that?
Anyway, as I was writing this entry, the replacement cable was installed. I got a brief 10 minute window where I really, honestly, thought the system was going to recover, as the cable was replaced, and I got an email:
Event 7210 has occurred on storage system FCNMM084200078 for device N/A at 10/06/10 19:52:11
The event description is: The write cache is temporarily disabled because the SPS battery is testing or charging. When the battery is fully charged, the write cache will automatically be enabled. You can change the battery test time for this system if required.
Well, testing and charging are certainly better than “failed horribly and lying smoking on the ground, husks of their former selves”. Alas, it was not to be. The next message:
Event 720f has occurred on storage system FCNMM084200078 for device N/A at 10/06/10 20:01:57
The event description is: The storage system has disabled its write cache, most likely because of a hardware issue. See the storage system Attention Required page for details.
Yay, back to that again. When will my storage come up? I honestly don’t know, and what’s worse, I’m so tired that I can’t even care that much right now. Yesterday’s 10 hours was the shortest day I’ve had since last Friday. I’m just wore out.
Writing this blog entry was pretty cathartic, though, so if my whining does nothing else, I feel a little better. Thanks for reading if you made it this far.
Sometimes I think the slogan should be
SysAdmin: Not all rainbows and puppies