I'm trying to diagnose a network related problem - please understand these points before suggesting an answer (apologies if more information is required, I will add anything people ask).
- We have a server only network (5 app server, 4 db servers, few other servers) that appears to be suffering packet loss between servers
- I can see this happening on wireshare - there are a lot of TCP Retransmissions, TCP_Out-of-Order, TCP DupACK and I think some TCP_ZeroWindow packets too.
- There appears to be a lot of Bad Checksums on the IP protocol
- I think the network adapters have a very constant and high (90-100%) load due to the extra retries caused by this packet loss
- As the external requests on this network increase (to the app servers) the network performance decreases
- the app servers generate their own traffic when used by the external request
- The external requests come through a core router and the network is on it's own segment
- This high load "magically" dissapeared after 1-2 days, I say magically as we where only monitoring at the adapters at the time the load dropped, there is still packet loss showing in wireshark, albeit a lesser amount.
- Nothing points to a compromised server.
- Unfortunately we don't have physical access to any of the hardware
- We can't disrupt the current service
Given the above, what is the best way to determine what is causing the packet loss (we expect it to be a managed switch).
Is there any software that can provide us with empirical evidence of what is causing the issues?
Thanks in advance
In my experience Wireshark can return unreliable results on interfaces that are using hardware TCP-Offload. Duplicate packets are one of the symptoms of that.
That said, if you're using a span/mirror port to grab your captures duplicate acks on the wire are a significant problem.
Duplicate ACKs, out-of-orders, and retransmits are signals that the TCP stack on something is not behaving right. Correlating which network nodes are prone to throwing the errors will help isolate which hosts need further investigating. Any differences in network captures between a span/mirror port capture and a wireshark session on that specific node should help highlight problems it may be happening. If you see some, investigate updating the network drivers as those are frequently the easiest fix for that kind of issue (Broadcom is sadly notorious for this). Second to that, updating the firmware for the NICs can help as well.
If everything there looks healthy, you could just be seeing the normal flailing about wildly that TCP does when there is just plain too much traffic to handle.
TCP Zero-Window is also a sign of an unhealthy TCP/IP stack, though in my experience that sometimes occurs when two different TCP/IP stacks aren't getting along together. Such as can happen with Windows 2008 and certain older TCP/IP stacks in the Linux space.