Disclaimer: I am a developer, not a system admin, please be gentle.
Where I work we are having a lot of intermittent network problems. Sometimes the DNS will fail, but access to servers can be done via IP, sometimes access via IP fails. So far as we can tell nothing has been changed on the servers, firewalls, managed switches etc. Also, frustratingly the faults don't cause issues with all users all of the time, but so far as we can tell, all the users have had problems at some point.
- The servers are not reporting any faults.
- The physical network seems fine (it's a small site).
- The firewalls aren't reporting anything out of the ordinary.
- The managed switches have passwords that are only stored in the sys admin's head (an issue we know!)
Our in house sysadmin is unavailable at the moment, so it's left to the developers to try and figure something out.
So, given that I have almost no clue, where do I start?
Update
I've tried the tracrt/ping combo and it looks like it's an internal issue. The external stuff seems to be fairly consistent, but the internal bits are proving to be flaky.
Traceroute to an internet site you know will be up. eg google.com. Then run a constant ping against 3 targets, your router, your routers default gateway and google.com.
This should at least tell you if your losing any packets along the way or if it's your internet or internal network having the problem.
After that post back if/when you've got the next answer.
It sounds like something is dropping connections somewhere.
Best advice though would be track down your sysadmin, thats why he/she is there...
It kinda sounds like you've got either a bad interface on a switch/server, or a rogue traffic source on the network. Without the ability to capture some spanned traffic or see interface stats, actually tracking either of those down would be neigh impossible. Have you added any new devices lately? Especially, in my personal order of suspicious devices: network devices, servers attached to more than one network, printers.
However, a lone sysadmin that has gone on vacation and left the shop with no visibility into the network is a very bad situation. Some things to discuss once he/she returns:
I was the sole network administrator for a multi-million dollar company for over 7 years (I have minions now =) and on-call 24/7/365 for pretty much that entire time and can say, pretty definitively, that if you've made yourself the only person that can do a certain thing, you can rest assured that you will be called whenever that thing needs doing.
The one thing you can 100% rely on is the probability that whatever can break when you're the only that can fix it is the the thing that is absolutely guaranteed to break when you leave for vacation.
Without access to your switches, your options are a bit limited in chasing down network issues. I'd start by checking interfaces on the servers; look for dropped packets or collisions. You could also use Wireshark or tcpdump to look at the actual traffic and see what's happening when your DNS servers aren't talking, but all of this is more efficiently accomplished when you can monitor things from the network end rather than the server end. If you really needed to, you could reset the passwords on the switches, but be prepared to deal with the wrath of your admin when he gets back...
Isolate the Problem:
The best you can is to try to isolate the problem I think. If you have multiple switches, are the problems happening to machines hooked up to only one of the switches? If it is happening to all the switches, and is not purely a DNS problem, then I would look at the router, or the connection between the switches and the router. It is possible it might be some sort of broadcast storm like problem, but that is less likely I think, and you probably are not going to fix it if it is. Has has been mentioned, tcpdump/wireshark and interface errors can help in this process as well.
Power Cycle Everything (Risky):
A second risky option is to just power cycle everything, or things one at a time to see if fixes the problem. I say this is risky because with lots of network equipment there is a running config, and a saved config. If the admin forgot to commit the running config to the startup config last time they did something, you will likely be in trouble after the reboot.