If I'm running VMWare 5.x on FC [edit] LUNs presented from a Netapp running ontap 8.x in 7 mode, what steps should I take to ensure that the VMs can withstand a worst possible case cluster node failover? HBA settings? ESX settings? Modifications to the VMs' settings? I know this is all handled with the host kit from Netapp, but we want to know what it's doing and/or do it ourselves.
A cf takeover on Netapp is usually quite fast, but can in some cases run up to 3 minutes.
Mainly you want to make sure your VM's will be OK with no disk IO. Installing the VMware tools will take care of this with 4.x and newer. You also want to set the right settings on your ESX hosts related to timeouts, these are detailed on the netapp site in TR-3749
Really it does not matter which protocol you are using for cluster fail overs since the guests will see the same thing - their disks will be offline/read only for a brief period. This script can help you set the NetApp recommended settings on the VMs but I am pretty sure the VMware tools install does it for you.
On the ESX/ESXi side it depends on what versions of ESX/OnTap you are running but it's not something intended for you to edit. You can force MRU/RR etc but that is only how it discovers the paths to it's LUNs (unless you are doing something odd in your switching should be pretty much set and forget). You can't actually speed up the failover itself, it's a function of how long it takes for the controller to boot and sync and then replay the data in NVRAM. VMware & NetApp's PSP/SATP work together to manage this stuff. For more info you can use esxcli to look at (and edit a limited about) of these settings. Try esxcli storage nmp and esxcli storage san fc to dig in but personally I wouldn't mess with this stuff unless I had a specific need, even if you manage to squeeze some performance improvements out you may end up breaking compatibility with future releases etc.
For the HBA's it depends, the main thing to look at there is queue depth but that depends on how you have your workload configured. On the filer side you shouldn't need to change the queue settings, it's supposed to adjust itself based on workload.
If you're using FC I highly recommend at least trying NFS as well, it can make your life so much easier.