Followup to EMC AX4 SAN Trouble

Date October 8, 2010

I wrote about the difficulties I had after migrating my AX4 from one rack to another in my last entry. Now that the hardship is over, I suppose I should talk about what apparently happened.

The root cause was apparently the failure of the battery backup (SPS) or the sense cable. Both ended up getting replaced, however through the process of replacing them, apparently the on-site technician and/or the EMC engineer had a disagreement as to how the cables should be wired.

Over the course of several hours, I think they must have hit every possible configuration for those devices and the cables they had on hand. Most infuriatingly, mid-process, the EMC engineer actually made the battery work on the "A" controller by changing the configuration. Of course, that presented a possible problem...if it works on the A, was the B controller bad, or had we just not tried this particular permutation?

We switched the battery back to B (where it is actually supposed to be), tried the same configuration, and yet nothing happened. At this point, it was late, and we'd been doing this for hours. We decided to switch it back to A for the purposes of making the cache work, then debug the B problem the next day...except when we switched it back to A, it didn't work either.

I'm not entirely certain what cable misconfiguration was put in place by the Customer Engineer (CE) who EMC hired, but he did something wrong. How can I be so sure?

The crew that I was talking with on the phone (the EMC tech and EMC engineer) were local to India, at this point, it was 7:30am there, and they were going off-shift. They asked me if they could wait until the next day, but this was Wednesday evening here, and my users had been without this SAN for 5 days. I wasn't about to let it get to 6 if I could help it, so I asked them to instead, hand me off to one of their colleagues who had just come on shift.

Within an hour, I had someone who was not only new to the case, but also fresh because he had just come to work. This was very helpful, because his lack of experience with my setup made us re-evaluate the current situation. The on-site CE had abandoned us to go to the Phillies game, so we ended up having a local remote-hands guy from the colocation do the plugging and unplugging.

Working backward through the wiring and debug logs, we figured out that the power cable going from the battery backup wasn't even plugged in. Both sides of the AX4 were plugged straight into PDUs (the same PDU, actually). After this was fixed, a reboot of the storage system brought up the cache, and things returned to normal.

All in all, I'm glad that it's over, and I wish I would have been on-site to do the wiring myself, but what still irks me the most is that EMC needed 2 days to get me a sense cable. I still can't fathom why keeping a small supply domestically would kill them. Anyway, it's over now, and the array is back at 100%.

Since I first wrote about this a couple of days ago, a number of people have contacted me expressing concern because they are considering getting AX4s. I can tell you that this is the first real issue I've had with my arrays. The first time I ordered one was through Dell, and they gave us a single kit that was already put together, and we pretty much had to plug it in and we were done with it.

We bought our second array through CDW, and they had the audacity to tell us that a single-processor configuration would be acceptable for a warm-standby. Do not be convinced by any charlatan feeding you information of this sort. The single-processor configuration is utterly worthless. Get the DP configuration, and I would recommend getting a pair of SPSs for it, too. The single SPS that we have works, but this entire exercise that I went through would not have been necessary had we gotten a second SPS to go with the first.

As for the performance, I find it acceptable (at least, so long as the cache is enabled). I was stupid a few years ago when I set it up originally, and I made one disk pool that encompassed 10 of the disks (the other 2 being spares, since you can't do much with 1 disk, and it yells at you a lot if you have no spare). Fortunately, I'm smarter now, and with the addition of a disk array enclosure, I'm in the middle of migrating my LUNs onto more-properly designed pools.

Anyway, so that is the end of the current saga of the broken SAN array. Feel free to comment with questions or commiseration :-)

  • Lee W

    Dude... your CE left to *go to a Phillies game*? Your phone-droid "went off-shift"? Back when I rocked US phone support for Celerra in the early 00's, your ass stayed on a case until it was fixed or it was apparent you couldn't do any more troubleshooting. Ugh. I never was a fan of EMC's products or support then (Go Netapp!), but after listening to this you best believe I'm an even *harder* sell when the EMC rep comes calling.

  • http://www.anthonyldechiaro.com Anthony D

    Sorry to hear you had so many problems with your SAN. I can't believe the tech left your case, that's just absurd. In my former life working for one of the large investment firms we used to have EMC techs come on-site all the time and never had any major issues with them, then again I was working ops and not on the storage team. If anything we had more problems with NetApp. Still, I'd expect more from EMC then this. :-/

  • Pingback: Tweets that mention Followup to EMC AX4 SAN Trouble | Standalone Sysadmin -- Topsy.com()

  • Brian

    Matt,

    Sorry to hear about your SAN issues... An important item I highly stress when purchasing a product... find out where the phone support technicians are located.... I try to support American jobs and people I clearly understand what they're saying!

  • http://www.standalone-sysadmin.com Matt Simmons

    Thanks everyone...I really appreciate the sympathy.

  • http://blog.rootmytoaster.org Kenny

    Would you be able to, yourself, buy spare sense cables to keep on hand just in case one goes bad again?

  • http://virtualgeek.typepad.com/ Chad Sakac

    Disclosure - EMCer here.

    I'm really sorry about the negative experience. In the hopes of making some good of it - can you fwd the case # to me? I'll investigate internally.

    Again, for what it's worth - an apology - and a commit to try to see what I can do to help.

    Chad

  • John

    I will never by AX class gear from EMC again (possibly never anything from them again). When you call their support they have a special option for AX users - AKA THE BASEMENT. Bah!

  • http://sysadminpunk.com Rick Russell

    Every time the EMC guys meet us at our colo: we have 1 guy that comes in and preps everything and provides the doc, 1 guy that comes in and takes us to lunch and another guy that puts in drives or plugs cables in. Then they send in a Tech who looks at everything, then yells a couple of expletives and re-does the work again. Usually this is the 3rd or 4th trip after we've got all the necessary hardware on site, and after we've paid almost $30,000 for an additional set of drives. I HATE EMC.

  • Digitalspic

    This kind of stuff is what made us migrate from EMC to Netapp. Local support is great, remote support is great and best of all...i dont have to deal with EMC's whack internal politics.