Hello denizens of Server Fault
I have an irritating problem with a LAN of about 100 computers, 2 Windows domain servers, and 12 VoIP phones. Since their installation around a year ago, every week or so, we notice a VoIP phone resetting itself - occasionally in the middle of a call. Simultaneously there are often signs of temporary loss of connection on computers: freezes in explorer while accessing network shares, errors in our administration software due to loss of connection to the database server.
I have been doing some Wireshark monitoring on the connection between the VoIP PBX and the rest of the network. Wireshark picks up a clump of retransmitted TCP packets at the times when we record phone restarts. The Wireshark log shows about 2 clusters of retransmissions a day ranging from 5 packets to hundreds. Those in each cluster are mainly between the PBX and some set of the VoIP phones, but not always the same set. Often retransmissions at the same time are to phones connected to the same switch, but sometimes retransmissions occur together to phones at opposite ends of the network. There are usually some coincident retransmissions in passing TCP traffic, for example between client machines and the file servers.
The spikes in retransmissions and phone resets do not correlate well with when the network is heavily loaded. They seem to occur slightly more during the day, but most in the evening, when traffic should be decreasing. They occur reasonably often late at night when most computers are turned off and traffic should be lowest.
Do you have any ideas that might help diagnose the cause of problems like this? One thing I have not yet tried, but should have, is updating the firmware of all the switches.
TCP retransmissions are usually due to network congestion. Look for a large number of broadcast packets at the time the issue occurs. If the percentage of broadcast traffic in your capture is above about 3% of the total traffic captured, then you definitely have congestion. Look for both physical layer (ARP) and network layer (name resolution) broadcasts on the network. If you find a high volume of broadcast traffic you can trace it to the source from the capture data.
Gathering traffic statistics for your switches may show you have periods where you are running at or near capacity. This can lead to retries when responses don't come back within the inital timeout (often 3 seconds). This increases congestion momentarily until congestion mitigation mechanisms kick in.
Look for people using streaming media as that can soak up bandwith quickly.
You may be able to mitigate the problem for the phones by traffic shaping. This will just move the problem to other users.
Sounds like a spanning tree loop or a broadcast storm to me, especially if the retransmissions and the issues are localized to the same switch (which differs). When it happens, what are the port states on your L2 device? Probably a bad switch or bad root bridge priorities? Interesting problem.
You probably have solved this since it has been so long but essentially you need to enable "port fast" on the ports that have endpoints (voip phones,workstations, servers). A phone can send PDUs so if that guy reboots it will cause an STP convergence to occur thus causing the FDB table to be flushed and all devices to go through the 4/5 step STP fun. By putting ports with endpoint in "port fast" they skip the waiting and go right to forwarding mode.
Hopefully your phones are on a different subnet and VLAN from the other computers?
It could also be a faulty piece of equipment like a faulty switch. Do the retransmissions correlate to phones/computers on one particular switch or part of the network?
Just to extend my answer a little. Not all switches are created equal, even if they have the same specs. Some are able to cope with much higher load than others because they have faster processors inside. It could be that your switches are not quite up to grade.
I'd start by putting some of your most troublesome VOIP phones onto their own physical switch and see whether the resets on those continue. If it goes away then you're on the road to solving it very soon.