We have been experiencing slow timeouts for unreachable hosts is extremely slow. Recent testing in our lab shows it may be a delay reporting negative ARP lookups. Dumping traffic during attempts to open a telnet connection to a local zone which was down for patching showed the following.
If the source was Linux three ARP requests were sent at 1 second intervals, and the connection failed in just over three seconds.
If the source was a Solaris server an initial five ARP requests were sent to the broadcast address at 1 second intervales. 5 seconds later more ARP requests were sent. ARP requests continued with increasing pause times until the connection failed after 3 minutes and 44 seconds. Tests were run from a global zone to a local zone on a different global. Both global zones are running on Sparc hardware. The devices are connected via level 2 switching equipment.
Are there any tunables which will result in a fast (3 to 5 seconds) ARP failure? Are there any other tunables which will cause connections to unreachable (downed) hosts to fail faster?
We appear to have the same or similar behavior between a variety of servers running on Sparc. As far as I can tell, Solaris is trying very hard get an address by ARPing the address, and does not time out very quickly if no host is replying to the ARP request.
Did you consider running
ndd /dev/arp \?
to see a list of ARP related kernel configurables?