I'm testing the resilience of one of our test systems. We have 2 x DB (SQL 2008 R2 on Server 2008 R2, running in ESXi VMs) arranged in a Failover Cluster.
Shutting down the Active SQL Server service doesn't do much - the service is not restarted and no failover occurs; I under stand this is by design - the system assumes that the admin had good reason to turn off the service and so will sit quietly.
However, we can simulate a failure in a number of ways - I found the simplest was to just kill the SQL service in Task Manager. Our cluster is set to allow one failure in 6 hrs, so after this first failure, it tries to restart the service - which succeeds. Kill the service a second time (within 6 hrs) and the cluster manager will decide to fail the DB over to the passive server. So far so good...
If will kill the service on the second server, it restarts again. But when we kill the service for a second time, it doesn't fail back over to the first server.
I'm assuming that this is also by design; it makes sense, in that why fail back over to a server that itself wasn't stable enough only minutes earlier? This sounds logical, but is it true? And if so, does to obey the same timeout period (i.e. 6 hrs), and can this be reset?
Basically, before I tell my colleagues the failover features are working, I just want to confirm/clarify my understanding and assumptions.
Some other things you can test:
Try shutting down the boxes (even turn off the power to get a better simulation). Also unplug the network cables and disable the connection between the servers.
(though admittedly, it is usually the software that seems to cause a failover)
to set restart policies:
Open Cluster Administrator.
In the console tree, click the Resources folder.
In the details pane, click the resource you want.
On the File menu, click Properties.
On the Advanced tab, make the changes you want.
Seems like you want to look at the following settings: time-out, failover threshold, and failover period for resources. The time-out controls how long the Cluster service waits for the resource to shut down. The failover threshold and period control how many times the Cluster service attempts to fail over a resource in a particular period of time.