I am researching XenServer from Citrix and (with 2 colleagues) comparing it with VMware ESX and Microsoft HyperV.
In our tests, it seems that Xen's live migration is using less resources than VMware's ESX and I would like to know why that is. I found an article from last year that references a paper from 2005, explaining what actually happens with the pages/memory during live migration.
This is an extract of that article about the memory transfer:
Push phase- The source VM continues running while certain pages are pushed across the network to the new destination. To ensure consistency, pages modified during this process must be re-sent.
Stop-and-copy phase The source VM is stopped, pages are copied across to the destination VM, then the new VM is started.
Pull phase The new VM executes and, if it accesses a page that has not yet been copied, this page is faulted in ("pulled") across the network from the source VM.
I was wondering if the memory transfer still happens in the same fashion as it did 4 years ago.
I'm not an expert on Xen migration, and I'm using the open-source Xen server. In my experience, the Xen server is very efficient with migration as long as your storage layer is fast -- in our experience, disk images as files on ocfs2 volume or (god forbid) NFS mount were much slower than block devices on a SAN with a shared locking volume on a NFS mount. We haven't had problems with disk corruption, but do tend to snapshot things (both LVM2 and VM state) before we start the migration on a very active system just to be sure.
According to "Running Xen: A Hands On Guide to the Art of Virtualization" by Matthews, Dow, et al., Prentice Hall 2008, page 484,
It looks like that's similar to the list of steps you described above, with the addition of the iterations. Note that the machine may be doing I/O at two places with the current state of live migration.
Unlike VMWare and HyperV, the nice thing about XenServer is that there's a ton of people that have been running it and trying their hardest to break it ten ways from Sunday in very serious production environments. Live migration is new to us and we don't do it in a production environment yet because we have redundancy concerns (non-trivial to scale to n machines at this point due to having our shared data partitions on ocfs2 volumes), but in our test environments we've been having fun bouncing machines all over the place.