VM Live Migration is the wrong tactic

Date September 29, 2009

I'm sorry. I know you probably paid a lot for that license, but if your infrastructure is relying on a machine's ability to transition between VM hosts without rebooting as the crux of your high availability plan, you might want to reconsider.

Yesterday, Rational Survivability (a great all-over-the-place IT blog) had a post titled The Emotion of VMotion. It didn't occur to me before reading this that my own previous search for a hypervisor that would do live migration was working directly against my own beliefs that uptime should only matter for services. Essentially, the infrastructure should be designed so that a single server down doesn't contribute to the loss of availability.

That being said, live migration is a neat idea, and eventually it's going to get to the point that it's nearly instantaneous. When that happens, failovers will be next to invisible. Maybe we'll have to reevaluate our approach in that case.

Until then, I read posts from people trying to rely on it to keep their infrastructures up and I worry that their approach is flawed.

Please, build your services for reliability, not just the underlying systems.

  • http://www.semicomplete.com/ Jordan Sissel

    Agreed.

    VMotion is a hugely expensive venture seems to bank on you buying it believing that the only piece of your infrastructure that can fail or need maintenance is your server hardware.

    Services fail. Networks go funky. Data gets corrupted or lost. Software crashes and misbehaves.

    Further, many folks I've worked with who seem to love VMs always seem to adhere to "the vm is the service" model. They hand-build a vm for each service they need to run, etc, which is perhaps why VMotion appears popular - because the machine is the service to many folks, but they ignore that VMotion only helps solve a fraction of the kinds of failures that occur. This vm-is-the-service model also seems to prevent folks from designing services for failure.

  • http://www.fuzzy-logic.org Lee W.

    So how does one do clustering for high-write-transaction (9k to 14k writes per hour during peak times) Postgres databases? It ain't active/passive failover (at least from what we've been trying to do so far), we can't stick a bunch of Postgres instances behind a hardware load balancer like we can web/mail servers, and the lots o' writes (tack on another couple of 0's for reads/hr) add another layer of difficulty onto that. The big win for VMotion-like functionality in my environment is that we're able to offload our database instances to ESX clusters and maintain high-availability in the event of power loss, hardware failure, etc. If you (or your readers) know of any Postgres addons that let you do clustering that stands up to lots of concurrent writes, let me know!

    --Lee

  • http://rickvanover.com Rick Vanover

    That is the fundamental belief of Hyper-V and the end state of System Center's broad integration. Well said point, Matt.

  • http://neckbeard.stfudonny.com jimb

    vmotion isn't really for failover. its for resource/workload balancing and for enabling you to do stuff like firmware/bios/hypervisor updates as rolling daytime work instead of maintenance-windows or when they really happen, "someday" (pronounced "neh-var").

    you start everything with the least resource you think it can get away with and you spread it out across your nodes kinda round-robin, then as individual services grow you allocate more resources to the vm and vmotion others around to rebalance things. pony up for DRS and its automated.

  • http://nsrd.wordpress.com/ Preston de Guise

    I don't think VMotion should be seen as a tool for guaranteeing uptime. A point I repeatedly make about uptime is that the midrange world has it wrong, they got it wrong and they keep getting it wrong in comparison to the mainframe world. The mainframe world was about negotiated uptime. Keeping systems up for the sake of big uptime numbers isn't sufficient justification to avoid rebooting. (In fact, I'd suggest that system administrators who do that have an arrogant disrespect for their end users.)

    VMotion is for load balancing, as another commenter has suggested, but it's also about maintenance of the infrastructure. For instance, one of the advantages that mirrored disks give us is the ability to swap out failing hardware without interrupting our system.

    VMware, and other hypervisors bring a layer of infrastructure that our systems depend on but shouldn't be affected by. Thus if there needs to be planned maintenance on a virtual server, it shouldn't have to affect the guests if there is sufficient infrastructure to support guest migration.

  • http://jeffhengesbach.blogspot.com/ Jeff Hengesbach

    jimb has it. VMotion is not in and of itsself a high availability feature. The moving of a VM between hosts facilitates non disruptive physical host updates and effective utilization of hardware by leveling the load. The other recent feature in vSphere that has a foundation requirement of VMotion is DPM. It can be a huge power saver by moving VM's to few hosts and powering on / off as need other hosts.

    I think the crux of the issue is people thinking VMotion is a high availability feature when it is not (akin to RAID is not a backup solution). VMware HA is there for entire host failures (with delayed recovery), FT for running a 1 to 1 copy of VMs for instant failovers, SRM, etc. There is still a shortfall in VMware's offerings of application(service) aware availability where the guest VM is running but its services are not accessible. None of these solve service upgrade downtime either - you really need a clustered service where rolling upgrades can take place without disrupting access to the service.

    HA is like backups - the constant evolution of solutions, definitions, and marketing speak really cloud people's perceptions / understandings.

  • http://blog.gurski.org emag

    Having just today managed to somehow crash a VM host, I can most emphatically agree that any live migration is _not_ HA. Especially when the most-impacted VM happens to be a DB server, with a 2TB fs, which managed to have the superblock & backup not be in sync when it came back... A 3 hours fsck later, and said DB server was back online. Even VM HA as I understand it (basically, crash-level recovery) wouldn't have helped. If we'd known the host would spontaneously reboot, we would have put it into maintenance mode before investigating why the Nagios server could no longer see the machine. Had we done that, no VMs would have rebooted, and yes, HA would have been preserved. But, well, that obviously didn't happen, and any migration strategy wouldn't have helped.

    We also discovered, slightly earlier (by several weeks) that plugging anything else into the rack PSUs would, um, overload them. Unfortunately, _all_ our VM hosts, both VMware and XenServer, are in said racks. The vendor never bothered to mention we were _that_ close to the total available power in a half-populated full-height rack. Ouch. The best I can say is the host died close enough to COB that only a few people noticed, and those were all in-house and sympathetic...

  • Dan Carley

    I agree with the principle of your post.

    Yet do bear in mind that virtualisation is, and probably always will be, a great enabler for legacy systems. Those black box machines which need to keep on serving but are beyond the scope of re-engineering.

    If you can P2V those services from a standalone physical box in the corner and neatly shoehorn them onto a more reliable platform then you are doing something right. Needs must, and all that.

  • http://tech.philipsellers.com Philip Sellers

    I think others have said the same, but I do think you have figured that out from what I read in your post. Live migration isn't a high availability feature. Its for planned downtime and for balancing workloads. When I think of high availability, I immediately think clustering, whether it is VMware HA, Microsoft Cluster Services or otherwise, and not live migration. The clustering buys you fail-over during failure. I think the big three hypervisors all provide a way to accomplish this - ESX, XenServer, and Hyper-V.

    Where I have found VMotion in ESX to be extremely useful is when workloads change on a hourly basis and you need to move things around to balance your workloads. For instance, during payroll week, my database server and the web server that runs the Payroll application might be really busy, but they may be idle much of the other time. So, during payroll, I may migrate some VM's off those hosts to allow them to have more resources and then bring back VM's after payroll is complete.

    Live migration buys you the ability to move VM's onto and off of a host to make better utilization of the resources without having to reboot the VM or pause it when making your migration (which is a headache to many applications - and users). VMware DRS does a good job of allowing dynamic, fully-automated movement of VM's as your workloads change. This is a lifesaver when a poorly coded app causes un-due stress on a host. Without the fully-automated method of live migration, the other VM's on that host would suffer performance hits.

  • Chris Kamler

    I, like many others, I assume, was surprised to learn that "High Availablity" isn't "5 9 availability" - HA just allows you to reboot on a different chassis. I am implementing it next month in my environment, but I am curious that if a server blue screens on chassis #1, and then it boots on chassis #2 - won't it be just as likely to blue screen over there?

    I agree that it's a marketing term for (literally) a check-box in the VM software. But it should not be the base of your existence...

  • http://jeffhengesbach.blogspot.com/ Jeff Hengesbach

    Chris - You are right - HA really just makes sure a VM is "on" and the OS heartbeat is responding. Whatever the issue that caused the initial blue screen may re-manifest - HA doesn't root solve application memory leaks, app lockups, etc.