I'm trying to ensure that the revert process is as reliable as possible in our lab manager environments. We frequently (daily) revert a 10-12 server workspace to a previous version, upgrade it, test on it. Every few weeks, I hit a new problem after a revert, and a team or two has to wait while I monkey around.
The servers are Win2K3 servers running various applications. They are members of an domain, external to the workspace.
Question: What do you do to achieve 100% reliable revert? Any suprises in store for me, beyond the following? Better resolutions to these problems?
Note: Unfortunately, fenced workspaces are not practical in this scenario. These are unfenced environments. Saving the workspace to a to configuration and cloning tends to be far too slow for daily use - even though we've made disks as small as practical (10GB per machine).
Snags hit:
Machine changes password (?) or some other credentials with DC every x weeks. The snapshot can't connect to domain. Prevent by
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters] "DisablePasswordChange"=dword:00000001
Computers revert from snapshot with incorrect times, times out of sync between machines. Chaos ensues. Final resolution: Make sure host is running NTP client - one of ours was not, ensure clients sync to host. Per VMWare, that was the root of many of our problems.
Summary good answers
- Take snapshot with power off. Prevents a variety of problems, including NTP.
I would recommend taking your snapshots when the VMs are powered off.
Even though vmware allows you to snapshot a live server, if you don't snapshot the memory it will be like a power failure has occured - and if you DO snapshot the memory you will have problems with things like NTP not running immediately.
The nice thing about vmware snapshots is that reverting is very fast. They don't so much make a "snapshot of the server" in the traditional sense - rather they start a new file containing any changes made to the disk AFTER the snapshot was taken. Hence, reverting means just discarding the "delta" file. (the drawback is that if you want to keep a snapshot around for a long time - the delta file continues to grow)
If you sync VM time to host and host to DC, then it may not be a problem.
You can set vmware tools to execute scripts after resume & co. Create script to clean up : ipconfig /flushdns ipconfig /renew net stop w32time && net start w32time
You may use klist from the resource kit to delete/puge kerberos ticket and so to get a new one against your DC
I'd recommend running the DC in your configuration if at all possible. Get everything configured the way you like, power off the whole configuration, undeploy it and capture it to the library. If you capture the machines while they're running, you could run into errors (or be unable to) when you try to redeploy them if your environment is non-homegenous (i.e. trying to deploy a running machine that was started on an Intel processor to an AMD processor).
Rather then revert you'll just deploy a fresh copy of the entire configuration, and you could even deploy multiple fenced copies of this configuration.
Remember of course that all machines in the configuration have to be deployed to the same physical host ultimately.