Tag Archives: iscsi

Horray for new toys!

Last week was probably my craziest here so far. On Monday, we were off because there was some kind of blizzard thing, so it was a shorter week to begin with. On top of that, there were some things piled up from my week off with my gall bladder surgery, so I was trying to get those done, and then the icing on the cake was that we were evaluating a new storage solution for the VMware environment that we’ve got, so I had to get things going there, too. Of everything, that last one turned out the best.

So, the situation that we’re in right now is that our primary storage is a dual-controller NetApp 3210. We’re doing the whole Active/Active thing, and our disk aggregates are broken up across controllers.

Prior to 4 or 5 months ago, we were running roughly the same number of disks but in a single aggregate, attached to a single-headed FAS3140, so while it was probably a less sophisticated setup, IO was spread over a much larger number of spindles.

If you’re curious about the actual disk count, on the head that deals with virtualization, we’ve got 14 spindles, and on the other head, we’ve got 42. I’m not the storage admin, so I can’t tell you exactly why it’s set up like this, but there are reasons.

The 14 spindles that deal with virtualization just can’t keep up with the load that our classroom virtual environment is putting on it, so we’ve been looking at other options, from putting everything back on the same aggregate to getting a specific solution in place to offload the IO. We talked to Cambridge Computing, and they suggested that we take a look at Nimble Storage, who had been getting rave reviews from Cambridge’s customers.

I was familiar with Nimble because they’ve presented at Tech Field Day a few times. Architecturally speaking, their arrays are hybrid SSD/spinning disk, but they have a few interesting techniques that sound pretty decent. We talked to the local Nimble team, and they agreed to let us borrow their CS260 demo unit while they went to the company’s annual meeting in California, which was very cool, and we really appreciated.

So this past week, I needed to do some amount of stress testing of the array, which of course involved running cables and configuring the network and ESXi boxes. I’ve never played with iSCSI before (my former arrays were all fibre channel, and our NetApps are using NFS to present the datastores), so that was a fun new experience.

All in all, we were happy with the performance of the array (basically, we were network bound the entire time), but even being network bound, we were seing better performance by far than we were getting with the 14 spindles on the NetApp. I wish I’d had more time (or a smaller workload) to give it a more thorough beating, but even so, we decided to purchase an array. In a few weeks (as soon as University paperwork goes through), we’ll get our brand new CS220. Awesome!

I can’t wait, and I’ll definitely let you know what it’s like to get going with it. In the meantime, I’ve got to get the vSphere cluster ready to roll, so that means plenty of work. It’s going to be fun!

Weekend Fun: New NetApp Installed

One of the larger differences between a position in Academia and a commercial business is that after-hours or weekend work is far less frequent. Every once in a while, though, there’s something that you need to do that can’t be done during the day. This weekend was one of those times.

Our central fileserver duties were, up until this weekend, run by a NetApp FAS3140 filer. It provided all of the NFS shares and iSCSI LUNs to all the machines around the college. It had a few different issues that made us want to replace it. namely it being long in the tooth, plus only having a single head / controller (I’m still not all that familiar with the NetApp nomenclature).

We replaced it with another filer yesterday, a 3210 with two controllers.

I’m really glad that I presented the Infrastructure Migration talk recently, because I used a lot of the things I talked about in those slides.

Part of the support agreement with our NetApp reseller was the actual installation, testing, and turn-up of the new filer, so our checklist basically had two major parts: prepare for the outage and recover from it. As you know or can imagine, when the central file server goes down, there’s a non-trivial amount of work done to prepare the infrastructure.

You could approach this problem one of two ways. You could actually do it both ways in order to self-check for correctness.

The first way, and the way that I started, was to say, “alright, what’s the least-important machines that rely on the filer?”. Those need to turn off first. Then, the next important, then the next important, and so on. Importance is kind of a arbitrary judgement though; what I was really asking was, “what relies on this, but has nothing that relies on it?”. These were things like desktop machines. Because desktop machines have nothing that rely on them (except users, and the users had been warned previously several times), they were to be the first to get shut-off.

The other way, which is probably more correct, is to start at the center and say, “what relies on the NetApp directly?”. Create a list, then iterate through that list, asking the same question, “what relies on this?”, and repeat until you’re out of dependent systems. I didn’t take this route because it generally takes more time and I started late. Next time, I imagine we’ll make the checklist farther ahead in time, something I’ll bring up at the post-mortem.

Overall, things went relatively smoothly. Of course, things almost always go smoothly. It’s the whole “bringing it back up” that creates wrinkles, but it honestly didn’t go badly. There were a couple of undocumented places on really old Solaris boxes which referenced the previous filer by name, as opposed to by CNAME (each of the major shares now has a CNAME that the clients point at…something like homedirs.domain.tld, but since this wasn’t exclusively documented, we had to fix it manually.

Overall, I’m pretty happy, and now we’ve got a shiny new filer, and still have a disk shelf on the old one, so I can get a little more familiar with NetApp without breaking production ;-)

If you have any questions or suggestions of things that we could fix, please let me know by commenting below. Beware that this purchase was planned before I got here (in fact, they showed me the boxed-up filer during my interview months ago), so I won’t be able to answer any “why did you pick this” type questions.

Another storage option

I have been researching storage for the past few days. I’ve been concentrating on iSCSI, since I was trying to keep costs down, and a fiber switch is pretty expensive (especially if I want to use it).

While researching, I chanced upon a technology I hadn’t heard before: ATA Over Ethernet (AoE). Unlike iSCSI, which transmits the data over TCP, AoE does it via layer 2 frames. This has the implication that, like Fibre Channel, it can’t be routed across different networks. For most people, this is not a problem. For some, it’s a deal breaker.

In the same way that iSCSI can use software initiators (which turn a computer into an “iscsi server”), there is software available to create AoE “servers”. This would be useful if you’ve got a large machine with many available disk slots.

There are also AoE arrays on the market. Coraid sells some very large arrays. They even offer a ready-made High Availability NAS gateway.

There are drawbacks, of course. There doesn’t appear to be a lot (any?) inherent security in the protocol. If anyone reading has experience, I’d be very thankful for some comments as to how host control is done.

I’ve read comments that were a few years old claiming that it wasn’t as stable as iSCSI, but they offered no evidence towards that conclusion, so I have no way of checking to see if their complaints have been resolved.

In the end, I still don’t know what I will do, but the more I read, the bigger a blip it is becoming on my radar. What do you think?