Today we had a number of machines stop getting internet access. After a lot of troubleshooting, the common thread is that they all had their dhcp lease renewed today (we're on 8 day leases here).
Everything you would expect looks good after the lease renewal: they have a valid IP address, dns server, and gateway. They have access to internal resources (file shares, intranet, printers, etc). A little more troubleshooting reveals they are unable to ping or tracert to our gateway, but they can get to our core layer3 switch just in front of the gateway. Assigning a static IP to the machine works as a temporary solution.
One final wrinkle is that so far reports have only come in for clients on the same vlan as the gateway. Our administrative staff and faculty is on the same vlan as the servers and printers, but phones, key fob/cameras, students/wifi, and labs each have their own vlans and as far as I've seen nothing on any of the other vlans has had a problem yet.
I have a separate ticket in with the gateway vendor, but I suspect they'll take the easy out and tell me the problem is elsewhere on the network, so I'm asking here as well. I've cleared arp caches on the gateway and core switch. Any ideas welcome.
Update:
I tried pinging from the gateway back to some affected hosts, and the odd thing is that I did get a response: from a completely different IP address. I tried a few more at random and eventually got this:
Fri Sep 02 2011 13:08:51 GMT-0500 (Central Daylight Time) PING 10.1.1.97 (10.1.1.97) 56(84) bytes of data. 64 bytes from 10.1.1.105: icmp_seq=1 ttl=255 time=1.35 ms 64 bytes from 10.1.1.97: icmp_seq=1 ttl=255 time=39.9 ms (DUP!)
10.1.1.97 is the actual intended target of the ping. 10.1.1.105 is supposed to be a printer in another building. I have never seen a DUP in a ping response before.
My best guess at the moment is a rogue wifi router in one of our dorm rooms on the 10.1.1.0/24 subnet with a bad gateway.
...continued. I've now powered down the offending printer, and pings to an affected host from the gateway just fail completely.
Update 2:
I check arp tables at an effected machine, the gateway, and every switch between them. At each point, the entries for those devices were all correct. I didn't verify every entry in the table, but every entry that could possibly impact traffic between the host and the gateway was okay. ARP is not the problem.
Update 3:
Things are working at the moment, but I can't see anything I did to fix them and so I have no idea whether this might be just a temporary lull. Anyway, there's not much I can do to diagnose or troubleshoot now, but I'll update more if it breaks again.
"My best guess at the moment is a rogue wifi router in one of our dorm rooms on the 10.1.1.0/24 subnet with a bad gateway."
This happened in my office. The offending device turned out to be a rogue android device:
http://code.google.com/p/android/issues/detail?id=11236
If the android device gets the gateway's IP from another network via DHCP, it may join your network and start responding to ARP requests for the gateway IP with it's MAC. Your use of the common 10.1.1.0/24 network increases the probability of this rogue scenario.
I was able to check the ARP cache on an affected workstation on the network. There, I observed an ARP flux problem where the workstation would flip-flop between the correct MAC and a MAC address from some rogue device. When I looked up the suspicious MAC the workstation had for the gateway, it came back with a Samsung prefix. The astute user with the troubled workstation replied that he knew who had a Samsung device on our network. Turned out to be the CEO.
As already discussed in the comment section getting a packet capture is really critical. However there also a really great tool called arpwatch:
http://ee.lbl.gov/
(or http://sid.rstack.org/arp-sk/ for windows)
This tool will email you or just keep a log of all the new MAC Addresses seen on the network as well as any changes for MAC addresses for IPs on a given subnet(flip-flops). For this issue you had it would have detected both the current theories by either reporting that there were flip-flops going on for IPs changing MACs, or you would see a new MAC for the rogue DHCP router when it first started communicating with hosts. The one down side with the tool is that you need to have the host connected to all the networks you monitor, but it is a small price for the great information it can provide to help diagnose these sorts of issues.
A quick way in detecting the typical rogue DHCP servers is to ping the gateway that it serves up and then examine the its MAC in the corresponding ARP table. If the switching infrastructure is a managed one, then the MAC can also be tracked down to the port hosting it and the port can be either shut down or traced back to the location of the offending device for further redress.
The use of DHCP Snooping on switches which support it can also be an effective option in protecting a network from rogue DHCP servers as well.