Not all rainbows and puppies

Date October 6, 2010

The Trouble with SAN storage is that most of it is too well designed. Really.

Some serious engineering goes into making highly fault-tolerant, reliable devices. They're designed to be available all the time. It reminds me of a saying...

The difference between something that can break and something that can't break is that when something that can't break does, it's hell to get at

I've got an AX4-5 at my secondary site that hasn't worked since this past weekend. The currently suspected reason is almost unfathomably simple...a sense cable.

See, the way that heavy-duty RAID controllers work is that they have a significant amount of cache. Because the array wants to be absolutely sure of everything that is written and unwritten, the cache on the actual hard drives is disabled, and the cache is used for the entire controller.

By using this cache, the response time of the entire array is heavily improved. Instead of waiting for the data to actually be written to the disks, you can wait until it's spooled into the cache (much faster). The drawback is that if you lose power, the data wasn't written to the disk, and would be lost. To circumvent that, the array has a Standby Power Supply (SPS), which is essentially just a UPS.

The SPS is plugged into the power source, and the controller is plugged into the SPS. To alert the controller to a power outage, the SPS and controller share a sense cable. Without this cable, the controller doesn't know it needs to flush the cache to disk, so it would merrily continue to store data in RAM until it loses power.

Because the storage array is designed NOT. TO. LOSE. YOUR. DATA., if you don't have a battery, or even a sense cable, the array turns off all of that expensive, performance-producing cache. With no cache whatsoever, not even on the disks, reading continues to be relatively good, performance wise, but writing data, particularly non-linear data, is a drag, literally and figuratively.

This is the state that my standby array has been in since Sunday morning. The Navisphere Express interface was reporting

The storage system's Standby Power Supplies (SPSs) are not working properly. (0x7404)

, so EMC sent out a Customer Engineer (CE) to replace the SPS. Of course, by the time he got out here, they looked at the diagnostic information and saw that the array was ACTUALLY reporting a bad sense cable. No worries, just replace it. I'm sure one came with the SPS, right?

No, I'm afraid not. In fact, it wasn't going to be easy to get one at all.

Lets look at this with some perspective. I am less than an hour from New York City, one of the largest metropolitan and commercial hubs of the world. The equipment was in the heart of Philadelphia, the 6th most populated city in the country. Where are we going to get a sense cable? Apparently, Mexico. What am I supposed to think about that?

Anyway, as I was writing this entry, the replacement cable was installed. I got a brief 10 minute window where I really, honestly, thought the system was going to recover, as the cable was replaced, and I got an email:

Event 7210 has occurred on storage system FCNMM084200078 for device N/A at 10/06/10 19:52:11
The event description is: The write cache is temporarily disabled because the SPS battery is testing or charging. When the battery is fully charged, the write cache will automatically be enabled. You can change the battery test time for this system if required.

Well, testing and charging are certainly better than "failed horribly and lying smoking on the ground, husks of their former selves". Alas, it was not to be. The next message:

Event 720f has occurred on storage system FCNMM084200078 for device N/A at 10/06/10 20:01:57
The event description is: The storage system has disabled its write cache, most likely because of a hardware issue. See the storage system Attention Required page for details.

Yay, back to that again. When will my storage come up? I honestly don't know, and what's worse, I'm so tired that I can't even care that much right now. Yesterday's 10 hours was the shortest day I've had since last Friday. I'm just wore out.

Writing this blog entry was pretty cathartic, though, so if my whining does nothing else, I feel a little better. Thanks for reading if you made it this far.

Sometimes I think the slogan should be

SysAdmin: Not all rainbows and puppies

11 Responses to “Not all rainbows and puppies”

  1. oscar said:

    I love that new slogan. And yes, I wholeheartedly agree with you that sometimes the high level of engineering in these devices seem to complicate things a bit when they do break. And the parts issue is fascinating. You'd think that living in NY you'd have that part in < 3 hours from somewhere local.

  2. Matt Simmons said:

    Oscar: I'm amazed too, trust me. I voiced my incredulity to pretty much every EMC employee I've talked to in the last 4 days, which I think was 6 of them.

  3. Janåke Rönnblom said:

    If one was comfortable with the idea of loosing some data due to power problems you could always forcible enable the cache in the AX4-5 again. This can be done if you have the full Navisphere but I'm not sure about the Express edition...

    -J

  4. Matt Simmons said:

    J: I'd be very interested in learning how to do that with Express right about now. How do you do it with the full edition?

  5. Janåke Rönnblom said:

    Hi Matt,

    Check out the "HA Vault Cache" setting. I had a failure on our old CX300 a long time ago and at that time that option helped us. We did however have a disk fail that caused the writecache to disable. I have to admit it was a few years ago so my memory maybe failing me ;)

    You might be able to change them using the navicli but you should probably check with EMC before.

    http://storagenerve.com/2009/01/17/clariion-cache-navicli-commands/

    -J

  6. Tweets that mention Not all rainbows and puppies | Standalone Sysadmin -- Topsy.com said:

    [...] This post was mentioned on Twitter by Matt Simmons, Planet SysAd. Planet SysAd said: Standalone Sysadmin: Not all rainbows and puppies http://bit.ly/9xPiNh [...]

  7. Twirrim said:

    Sadly that doesn't strike me as unusual levels of service from EMC. They've grown so large they appear to have reached the state where they are successful in-spite of themselves.

  8. Jeremy L. Gaddis said:

    Matt -- as someone who is currently looking at the AX4 I am definitely interested in hearing how this gets resolved. Please do keep us updated!

  9. DoDigaro said:

    Our MSA1000 (baby SAN, I know - we're looking at a proper one for early next year!) has a similar problem. It's got active/active controllers and the battery backup in one of them is having issues, so it's disabled them in just that controller. The cache appears to be happily working away, however! Still waiting on parts from HP..

  10. Daniel Howard said:

    I spent a day at a data center on the other side of the country trying to get an answer from a Sun StorageTek as to how one installs the Linux SCSI drivers provided with the install CD. The vendor's support was stupifyingly horrible and I needed a great deal of catharsis to get over that event.

    It sounds like EMC is doing you a bit better on your current adventure, but I know you'll be pleased once the present trouble is behind you. Good luck!

    -danny

  11. Are you suggesting coconuts migrate? | Standalone Sysadmin said:

    [...] storage didn’t want to enable the SPS (which caused a near-panic on my part, because this was not the first time) , but eventually everything came up, and right now, my servers are chugging along [...]

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*