October 8, 2010
I wrote about the difficulties I had after migrating my AX4 from one rack to another in my last entry. Now that the hardship is over, I suppose I should talk about what apparently happened.
The root cause was apparently the failure of the battery backup (SPS) or the sense cable. Both ended up getting replaced, however through the process of replacing them, apparently the on-site technician and/or the EMC engineer had a disagreement as to how the cables should be wired.
Over the course of several hours, I think they must have hit every possible configuration for those devices and the cables they had on hand. Most infuriatingly, mid-process, the EMC engineer actually made the battery work on the "A" controller by changing the configuration. Of course, that presented a possible problem...if it works on the A, was the B controller bad, or had we just not tried this particular permutation?
We switched the battery back to B (where it is actually supposed to be), tried the same configuration, and yet nothing happened. At this point, it was late, and we'd been doing this for hours. We decided to switch it back to A for the purposes of making the cache work, then debug the B problem the next day...except when we switched it back to A, it didn't work either.
I'm not entirely certain what cable misconfiguration was put in place by the Customer Engineer (CE) who EMC hired, but he did something wrong. How can I be so sure?
The crew that I was talking with on the phone (the EMC tech and EMC engineer) were local to India, at this point, it was 7:30am there, and they were going off-shift. They asked me if they could wait until the next day, but this was Wednesday evening here, and my users had been without this SAN for 5 days. I wasn't about to let it get to 6 if I could help it, so I asked them to instead, hand me off to one of their colleagues who had just come on shift.
Within an hour, I had someone who was not only new to the case, but also fresh because he had just come to work. This was very helpful, because his lack of experience with my setup made us re-evaluate the current situation. The on-site CE had abandoned us to go to the Phillies game, so we ended up having a local remote-hands guy from the colocation do the plugging and unplugging.
Working backward through the wiring and debug logs, we figured out that the power cable going from the battery backup wasn't even plugged in. Both sides of the AX4 were plugged straight into PDUs (the same PDU, actually). After this was fixed, a reboot of the storage system brought up the cache, and things returned to normal.
All in all, I'm glad that it's over, and I wish I would have been on-site to do the wiring myself, but what still irks me the most is that EMC needed 2 days to get me a sense cable. I still can't fathom why keeping a small supply domestically would kill them. Anyway, it's over now, and the array is back at 100%.
Since I first wrote about this a couple of days ago, a number of people have contacted me expressing concern because they are considering getting AX4s. I can tell you that this is the first real issue I've had with my arrays. The first time I ordered one was through Dell, and they gave us a single kit that was already put together, and we pretty much had to plug it in and we were done with it.
We bought our second array through CDW, and they had the audacity to tell us that a single-processor configuration would be acceptable for a warm-standby. Do not be convinced by any charlatan feeding you information of this sort. The single-processor configuration is utterly worthless. Get the DP configuration, and I would recommend getting a pair of SPSs for it, too. The single SPS that we have works, but this entire exercise that I went through would not have been necessary had we gotten a second SPS to go with the first.
As for the performance, I find it acceptable (at least, so long as the cache is enabled). I was stupid a few years ago when I set it up originally, and I made one disk pool that encompassed 10 of the disks (the other 2 being spares, since you can't do much with 1 disk, and it yells at you a lot if you have no spare). Fortunately, I'm smarter now, and with the addition of a disk array enclosure, I'm in the middle of migrating my LUNs onto more-properly designed pools.
Anyway, so that is the end of the current saga of the broken SAN array. Feel free to comment with questions or commiseration :-)