I am an analyst programmer at my organisation and am finding some sort of intermitent time-out issue when using CVS and HTTP requests within our network.
After the time-out the request does complete though it takes just over 60-seconds, which is why I'm guessing that it's some sort of time-out fail-over problem happening.
I wish to try and figure out how to find if possible what the issue is, I'm assuming there's a bad rout being made somewhere or there's something wrong with one of the DNS servers. The infrastructure team has told me there isn't any issue with the network, which personally I'm thinking is a cop-out.
I have root access to two Linux (RHEL 5.4) machines.
Please excuse me if this task is obvious as I'm a software developer not a network engineer.
UPDATE
I thought I might mention that this problem occurs between clients and the CVS server and clients using VPN and the HTTP server. Our VPN clients do not reverse resolve and I've asked the network engineers to ammend that but they don't see that as being a problem.
Often places will screw up their reverse records. You can tell you've got screwed up reverse records because if you run something like
netstat -a
and it takes a really long time to run and you get back a bunch of IP addresses in the rfc1918 address space. Not having reverse records in this space by itself isn't really a problem, but it is a problem if your DNS people forward their DNS requests to the providers or to a broken DNS server.A quick way to verify if it is a DNS issue is to log onto the system and lookup an IP of someone connected to the system (look at netstat -a and look for established connections) and then run
if you've got an older system, you may need to type
In either case, the result may be something like "can't find that address" but the answer needs to come back quickly. DNS timeouts can be on the order of seconds, and if you have 3 DNS servers in your resolv.conf, your server is going to try each one before it gives up. This can easily add up to a really annoying amount of time.
A quick way to illustrate the problem to your boss is to run
netstat -an
and then runnetstat -a
and then say "if our DNS was working properly, these would both run in almost exactly the same amount of time.If it is a reverse-record issue, you can probably "fix" the problem by turning off reverse lookups in your applications. In this situation, it may be easier than getting another group involved.
There is also the remote possibility that there is a duplex mismatch between your servers and their switches. That can be tested by looking at the output of (windows) netstat -e or (unix) netstat -i. You're looking for "errors" or "collisions". If you see "collisions" then your end is mis-configured; it is half duplex and should be full duplex. If you see "errors" the switch end is half duplex and you're full duplex. Both counters should be zero, or at least small and not increasing. These problems can be really hard to track down because the link will work pretty well if it is unloaded and totally fall apart when there is lots of traffic.
If the request completes, then it's not a timeout issue. If it were a timeout issue the request would never complete, hence the name "timeout". Do you mean that some requests timeout and some complete after a long period of time, because that makes more sense than what you've stated in your post.
As far as tracking down the issue, there are a lot of areas to look at. Here's a few suggestions to get you started:
Run a tracert from a client machine to the server in question. Count how many hops it goes through. Each hop is a router of some sort. If the tracert goes directly from your client machine to the server, then there are no routers in the path.
Run a pathping from a client machine to the server in question and look for latency and packet loss between the two.
Install a packet sniffer on the server and start a capture. Submit a request from the client and look at the output of the packet sniffer on the server. If you see a siginificant delay between the request and reply in the sniffer output then it's a server issue. If there's no significant delay then it's a network issue.