We have been experiencing slow timeouts for unreachable hosts is extremely slow. Recent testing in our lab shows it may be a delay reporting negative ARP lookups. Dumping traffic during attempts to open a telnet connection to a local zone which was down for patching showed the following.
If the source was Linux three ARP requests were sent at 1 second intervals, and the connection failed in just over three seconds.
If the source was a Solaris server an initial five ARP requests were sent to the broadcast address at 1 second intervales. 5 seconds later more ARP requests were sent. ARP requests continued with increasing pause times until the connection failed after 3 minutes and 44 seconds. Tests were run from a global zone to a local zone on a different global. Both global zones are running on Sparc hardware. The devices are connected via level 2 switching equipment.
Are there any tunables which will result in a fast (3 to 5 seconds) ARP failure? Are there any other tunables which will cause connections to unreachable (downed) hosts to fail faster?
We appear to have the same or similar behavior between a variety of servers running on Sparc. As far as I can tell, Solaris is trying very hard get an address by ARPing the address, and does not time out very quickly if no host is replying to the ARP request.