Situation:
On an integrated All-In-One ESXi/ZFS-Storage server, where the storage VM uses bare metal disks and exports the filesystems via NFS (or iSCSI) back to ESXi, which uses it as pool storage for the other VMs, there exists a problem when time comes to update the storage VM, because numerous running VMs depend on it and will time out with NFS.AllPathsDown
or similar causes, which equals pulling the drive from a normal server without shutting it down.
Of course it is possible to shut down all VMs, but this becomes very time-consuming and also tedious (or has to be scripted). Moving the VMs to another host may be possible, but takes even longer and may not be possible in smaller setups, where a single machine is plenty. Suspending the VMs could work, but is also quite slow (sometimes slower than shutdown).
Possible solutions...
- A simple yet efficient solution seems to be to stop the VM processes via the CLI with
kill -STOP [pid]
after finding it withps -c | grep -v grep | grep [vmname]
, do the upgrade/restart of the storage VM, then continue the execution of the VM processes by usingkill -CONT [pid]
. - A similar solution might be the combination of fast reboot (available on Solaris/illumos via
reboot -f
or on Linux viakexec-reboot
) which takes seconds instead of minutes, and the NFS timeout in ESXi (on loss of NFS connection all I/O is suspended for I think 120 seconds, until it is assumed the storage is down permanently). If the reboot time is inside the ESXi NFS window, it should in theory be comparable to a disk that does not respond for a minute because of error correction, but then resumes normal operation.
... and problems?
Now, my questions are:
- Which method is preferable, or are they equally good/bad?
- What are unintended side effects in special cases like databases, Active Directory controllers, machines with users running jobs etc.?
- Where should one be careful? A comment on the linked blog mentions timekeeping problems may arise when the CPU is frozen, for example.
Edit: To clarify on the scope of this question
After receiving the first two answers, I think I have worded my question not clear enough or left out too much information for sake of brevity. I am aware of the following:
- It is not supported by VMware or anyone else, dont do this!: I did not mention this because the first link already tells it and also I would not have asked if this machine was managed by VMware support. It is a purely technical question, support stuff is out of scope here.
- If designing a new system today, some things could be done in other ways: Correct, but as the system has been running stable for some years, I prefer not to throw the baby out with the bathwater and start completely new, introducing new problems.
- Buy hardware X and you won't have this problem! True, I could buy 2 or 3 additional servers with similar cost and have a full HA setup. I know how this is done, it is not that hard. But this is not the situation here. If this was a viable solution in my case, I would not have asked the question in the first place.
- Just accept the delay of shut down and reboot: I know that this is a possibility, as it is what I'm doing currently. I have asked the question to either find better alternatives within the current setup, or to learn of substantiated technical reasons some of the methods outlined will have problems - "it is unpredictable" without any explanation why is not a substantiated answer in my book.
Therefore, to rephrase the questions:
- Which of those two methods is technically preferable and why, assuming the setup is fixed and the goal is to reduce downtime without introducing any negative side effects to data integrity?
- What are unintended side effects in special cases like
- active/idling/quiescent databases with users and/or applications accessing them
- Active Directory controllers on this machine and/or on other machines (on the same domain)
- general purpose machines idling or with users running jobs or running automated maintenance jobs like backups
- appliances like network monitoring or routers
- network time with or without using NTP on this server or on another or on multiple servers
- In which special cases is it advisable to not do this, because the downsides are greater than the advantage? Where should one be careful? A comment on the linked blog mentions timekeeping problems may arise when the CPU is frozen, for example, but does not provide any reasoning, proof or test results.
- What are the factual, technical differences between those two solutions and
- Stalled execution of VM processes because of CPU overload on the host
- Increased Wait time on disk I/O because of faulty disks or controllers, assuming it is below the NFS threshold?
Good question...
But why do you need to reboot the NFS server, anyway?
All-in-one designs aren't reasonable anymore. As a science experiment or small home-lab situation, sure. But like any solution, expect to build in downtime and maintenance windows when necessary.
So...
If you can't have this type of downtime, you should not be running an all-in-one storage and VM setup, or should should consider traditional SAN storage (or a low-cost version) and multiple VM hosts.
My suggestion would be to avoid this problem altogether. You mentioned that increased costs and a complete re-architecting are show stoppers, but what you could consider in this situation is to have two storage VMs on the host in a two-node failover cluster. This would allow you to patch either one of them (but not both at the same time) without affecting the availability of NFS or iSCSI served by the cluster. It still isn't a supported solution, but it does at least allow some flexibility in maintenance at the cost of increased resource overhead (mainly however much memory you give to the second storage VM) for storage.
If changing the architecture is completely unacceptable, then the safest option would be to shut down the VMs.
The next-best solution would be to enable hibernation in your VMs. Hibernation would ensure that all filesystems are quiesced, helping avoid possible corruption.
Next, you could take a snapshot of the VM with memory state, forcibly terminate the VM's process, then revert to the snapshot when done. This incurs a small window of possibly lost data, but I'm sure you would never try this outside of a maintenance window where any data loss would be unacceptable, so this should be fairly inconsequential. This solution is as quick as making a snapshot, ensures VMs don't complain about lost disks, but does incur potential data loss.
Lastly, if you want to pause the processes (and have tested that it actually does work), then I would strongly suggest that you sync all disks in the guest first (in Linux, this would be done with /bin/sync. Utility provided by SysInternals for Windows: http://technet.microsoft.com/en-us/sysinternals/bb897438.aspx), and perform your maintenance quickly so clocks don't get set back too far.
As for potential side effects, any AD connected machine must be (by default) within 5 minutes of the DC's time. Therefore, after any solution where the VM is not continually available other than a normal shutdown, I would suggest that you force the resumed guest to update its clock. On database servers, don't do these things when the server is busy, as it increases the chances of filesystem corruption.
The main risk in all of the options beyond a normal shutdown or highly-available storage is that of corruption. There will potentially be some I/O in a buffer that will be dropped which the application may mistakenly think was completed successfully. Worse yet, I/Os may have been re-ordered by a lower layer for a more optimal write pattern. This could allow data to have been partially written out of order. Perhaps the row count was incremented before a DB row's data was written, or a checksum updated before the checksummed data was physically changed. This can be mitigated by only allowing synchronous writes to your storage, but at the cost of performance.
Neither.
This is the cost of a terrible design, I wouldn't make this situation worse by doing anything but shutting down your VMs, working on the storage VM then restarting the other VMs. I'd also get someone to redesign your setup using a supported/supportable architecture.
It's inherently unpredictable, what might happen this time may not happen if you did this again. It's unsupportable.
It's difficult to answer this constructively.