What are the top N tools / methodologies used to diagnose and repair network issues?
Given a LAN, for example, where users are able to consistently ping an outside server, but any data intensive connections are flaky; how would you begin solving the network issues?
I imagine issues like congestion, bandwidth constraints, throughput constraints, etc. are all factors, but I don't know how to diagnose those issues.
I'm especially interested in LAN environments (rather than WAN)
This is a very broad subject but as a start, a network\packet sniffer is an invaluable tool. There are many available. I tend to use multiple tools depending on what type of network problem it is. I use Wireshark primarily to troubleshoot problems between two specific devices (client to server, server to server, etc), and I use Colasoft Capsa for troubleshooting congestion and utilization issues. Wireshark is very good at displaying the nuts and bolts of network conversations but can be a little overwhelming when trying to "visualize" a general congestion or utilization problem. Colasoft Capsa is much better for getting a "visualized" look at the problem.
If you're experiencing slow network communication\performance (due to congestion and\or utilization) you'll want to look for the following:
A large number of network broadcasts (either at the network layer or the physical layer) and\or a large number of TCP retransmissions and slow acknowledgements.
A large number of broadcasts can be an indication of a misconfigured or faulty NIC, host, switch, software, driver, etc. and can also be an indication of a malware infection somewhere in the network. Broadcasts can cause congestion and in turn high utilization of network links (switch ports) as every host must listen to the broadcast traffic to determine if the traffic is intended for itself.
TCP retransmissions and slow acknowledgements are also an indication of network congestion. The congestion causes packets to be slow to be transmitted and received and causes ACK's to be recieved "late" forcing the transmitting host to retransmit the packet it's waiting for the ACK for. If you have congestion such that it's the cause of retransmissions and slow ack's then you can be assured it's causing performance problems.
I have used mtr (linux screen based traceroute) with a slow ping rate (1 to 2) per minute to monitor response to the end point. If one or more switches are getting congested they show dropped pings or slow responses. This may be a result of duplex issues.
Under some conditions I have seen Cisco routers appear to run half-duplex on links hard-coded to full-duplex. Duplex issues can be tested by transfering a large file both ways. It will be much slower in one direction than the other. This has made package installations over the network impossible.
Dumping error counters on all the interfaces (host and router) along the way can be helpful. Error counters should be 0 or low values and not increasing at a noticable rate.
not sure if your a windows house but the following blog will be helpful also get the tcp analyzer from MS research its pretty cool.
http://blogs.technet.com/b/netmon/
http://research.microsoft.com/en-us/projects/tcpanalyzer/
If you're experiencing slow-down of data-intensive TCP transmissions, you MAY be suffering from "global synchronization", where all TCP talkers on the line end up with roughly synchronous increases in TCP window size (the amount of outstanding bytes allowed before an ACK is received).
The larger the TCP transmit window is, the faster data is pumped between A and B. When you have multiple hosts sending TCP (in the same direction) through a single link, the bandwidth used on this link will fluctuate depending on the window size of all hosts. When the link is not congested, the TCP window size will increase, until packets start dropping (generally for all TCP sessions at the same time) and the TCP window size will be set to the lowest size it can be, uncongesting the link. Then, with no congestion ion place, all senders will increase the window size in sync.
If you look on a short-interval utilisation graph, this should be obvious from a "saw-tooth" ramping up and sharply dropping off, once full link capacity is used. You'll probably want something that polls your WAN link usage more frequently than every five minutes, off-hand I'd say that per-minute stats might be enough, but you may have to drop down to every 5-10 seconds.
The best way I know of to eliminate this is to enable weighted random early discard, essentially throwing packets away before the WAN link is congested, so as to make the TCP windows for all hosts on the LAN uncoupled from each other.