August 27, 2012
One of the larger differences between a position in Academia and a commercial business is that after-hours or weekend work is far less frequent. Every once in a while, though, there's something that you need to do that can't be done during the day. This weekend was one of those times.
Our central fileserver duties were, up until this weekend, run by a NetApp FAS3140 filer. It provided all of the NFS shares and iSCSI LUNs to all the machines around the college. It had a few different issues that made us want to replace it. namely it being long in the tooth, plus only having a single head / controller (I'm still not all that familiar with the NetApp nomenclature).
We replaced it with another filer yesterday, a 3210 with two controllers.
I'm really glad that I presented the Infrastructure Migration talk recently, because I used a lot of the things I talked about in those slides.
Part of the support agreement with our NetApp reseller was the actual installation, testing, and turn-up of the new filer, so our checklist basically had two major parts: prepare for the outage and recover from it. As you know or can imagine, when the central file server goes down, there's a non-trivial amount of work done to prepare the infrastructure.
You could approach this problem one of two ways. You could actually do it both ways in order to self-check for correctness.
The first way, and the way that I started, was to say, "alright, what's the least-important machines that rely on the filer?". Those need to turn off first. Then, the next important, then the next important, and so on. Importance is kind of a arbitrary judgement though; what I was really asking was, "what relies on this, but has nothing that relies on it?". These were things like desktop machines. Because desktop machines have nothing that rely on them (except users, and the users had been warned previously several times), they were to be the first to get shut-off.
The other way, which is probably more correct, is to start at the center and say, "what relies on the NetApp directly?". Create a list, then iterate through that list, asking the same question, "what relies on this?", and repeat until you're out of dependent systems. I didn't take this route because it generally takes more time and I started late. Next time, I imagine we'll make the checklist farther ahead in time, something I'll bring up at the post-mortem.
Overall, things went relatively smoothly. Of course, things almost always go smoothly. It's the whole "bringing it back up" that creates wrinkles, but it honestly didn't go badly. There were a couple of undocumented places on really old Solaris boxes which referenced the previous filer by name, as opposed to by CNAME (each of the major shares now has a CNAME that the clients point at…something like homedirs.domain.tld, but since this wasn't exclusively documented, we had to fix it manually.
Overall, I'm pretty happy, and now we've got a shiny new filer, and still have a disk shelf on the old one, so I can get a little more familiar with NetApp without breaking production ;-)
If you have any questions or suggestions of things that we could fix, please let me know by commenting below. Beware that this purchase was planned before I got here (in fact, they showed me the boxed-up filer during my interview months ago), so I won't be able to answer any "why did you pick this" type questions.