Weekend Fun: New NetApp Installed

One of the larger differences between a position in Academia and a commercial business is that after-hours or weekend work is far less frequent. Every once in a while, though, there’s something that you need to do that can’t be done during the day. This weekend was one of those times.

Our central fileserver duties were, up until this weekend, run by a NetApp FAS3140 filer. It provided all of the NFS shares and iSCSI LUNs to all the machines around the college. It had a few different issues that made us want to replace it. namely it being long in the tooth, plus only having a single head / controller (I’m still not all that familiar with the NetApp nomenclature).

We replaced it with another filer yesterday, a 3210 with two controllers.

I’m really glad that I presented the Infrastructure Migration talk recently, because I used a lot of the things I talked about in those slides.

Part of the support agreement with our NetApp reseller was the actual installation, testing, and turn-up of the new filer, so our checklist basically had two major parts: prepare for the outage and recover from it. As you know or can imagine, when the central file server goes down, there’s a non-trivial amount of work done to prepare the infrastructure.

You could approach this problem one of two ways. You could actually do it both ways in order to self-check for correctness.

The first way, and the way that I started, was to say, “alright, what’s the least-important machines that rely on the filer?”. Those need to turn off first. Then, the next important, then the next important, and so on. Importance is kind of a arbitrary judgement though; what I was really asking was, “what relies on this, but has nothing that relies on it?”. These were things like desktop machines. Because desktop machines have nothing that rely on them (except users, and the users had been warned previously several times), they were to be the first to get shut-off.

The other way, which is probably more correct, is to start at the center and say, “what relies on the NetApp directly?”. Create a list, then iterate through that list, asking the same question, “what relies on this?”, and repeat until you’re out of dependent systems. I didn’t take this route because it generally takes more time and I started late. Next time, I imagine we’ll make the checklist farther ahead in time, something I’ll bring up at the post-mortem.

Overall, things went relatively smoothly. Of course, things almost always go smoothly. It’s the whole “bringing it back up” that creates wrinkles, but it honestly didn’t go badly. There were a couple of undocumented places on really old Solaris boxes which referenced the previous filer by name, as opposed to by CNAME (each of the major shares now has a CNAME that the clients point at…something like homedirs.domain.tld, but since this wasn’t exclusively documented, we had to fix it manually.

Overall, I’m pretty happy, and now we’ve got a shiny new filer, and still have a disk shelf on the old one, so I can get a little more familiar with NetApp without breaking production ;-)

If you have any questions or suggestions of things that we could fix, please let me know by commenting below. Beware that this purchase was planned before I got here (in fact, they showed me the boxed-up filer during my interview months ago), so I won’t be able to answer any “why did you pick this” type questions.

  • Nice! Not very familiar with NetApp but getting ready for end of lease of our much hated EMC equipment and we have to start evaluating now. So pardon my very basic and ignorant question, did you migrate any data in this case from one filter to another or was it just disk migrations?

  • Neil: Actually, what we did was have the old NetApp disown the disks (without erasing them), then added them to the new head unit. Because so much of the NetApp config is just plain text files, we could essentially copy the exports to the new unit and it picked up where the old one left off. Easy-peasy.

  • I should also say that we did add some disk storage to the new one first, so we could do things like storage vMotion on VMware shares.

  • Matt,

    That reiterative process you mentioned where you ask What relies on this.. is
    called dependency mapping. We’ve got an specialized product for doing
    exactly this. (see Blueprints video)

    If you main dependency models our your environment, you can simply query the model to determine all the upstream dependencies (impact analysis), and it’ll save you a lot of work during crunch time when you’re trying to knock out a project like replacing a storage system. Another benefit to having a current model is that you’ll have a place to record obscure dependencies like those Solaris boxes that might get over looked again down the road.

  • Andy

    We’ve got a new 3240 coming soon, although from what I’m hearing, it seems the people who designed and sold us it didn’t put much thought into the design :(

    Will be nice when we finally get off of local storage though!