We're lucky, every server we have has multiple NICs/HBAs/CNAs connected to multiple switches and this approach has kept our platform up on numerous occasions. That said we ran into a problem last week that I'm not sure how to fix.
We had a switch that was carrying a good chunk of our traffic crash (the details aren't important but it was a Cisco 6509, it had a hard CPU crash and didn't come back up automatically). Unfortunately it left its line cards working (i.e. L1 & L2 up) but lost all of its uplinks. The servers connected were the following;
- Windows Server 2003 32-bit EE SP2 with Veritas Storage Foundation
- Oracle Enterprise Linux 5.3 64-bit
- VMWare ESXi 4.0
- NetApp 3040 running OnTap 7.3.2
All of these machines failed to detect the crashed switch and kept sending traffic its way rather than detecting the failure and moving their traffic to the another switch.
I need help looking at my options for better multipathing, this can't be the first time this has happened - there must be other ways of doing this (polling the HSRP interfaces for instance) - can you help?
Thanks in advance.
If the switches between your Cisco 6509 and your servers are also Cisco you have an option to shut down all the ports if one (or more) ports goes down. You set a set of "upstream" ports and "downstream" ports. If all the upstream ports go down, the switch will take down the downstream ports.
It is called link state tracking and it is designed for situations like yours.
You will find a little info on this page.