We are having an issue with our Catalyst 6500 switch where we suspect that the ARP cache is being corrupted. This presents itself with the following symptoms:
When you attempt to ping a system which has not been resolved before, the first ping response times out, and each proceeding one succeeds: Pinging foo.network.com [xxx.xx.xx.xx] with 32 bytes of data: Request timed out. Reply from xxx.xx.xx.xx: bytes=32 time=5ms TTL=55 Reply from xxx.xx.xx.xx: bytes=32 time=3ms TTL=55 Reply from xxx.xx.xx.xx: bytes=32 time=3ms TTL=55
When the corruption issues occurs, every other ping times out: Pinging foo.network.com [xxx.xx.xx.xx] with 32 bytes of data: Reply from xxx.xx.xx.xx: bytes=32 time=5ms TTL=55 Request timed out. Reply from xxx.xx.xx.xx: bytes=32 time=5ms TTL=55 Request timed out.
Clearing the ARP cache temporarily resolves the issue. To clear the ARP cache we use the commands: clear arp cache clear ip cache This fixes it, but it is sure to happen again.
Details on the switch:
IOS (tm) s72033_rp Software (s72033_rp-PK9SV-M), Version 12.2(17d)SXB8, RELEASE SOFTWARE (fc2)
cisco WS-C6509-E (R7000) processor (revision 1.1)
Any help appreciated, Thanks
CLARIFICATION: We have the network that we manage, and then we are plugged into the corporate network. All requests to machines inside of the network that we manage work fine. We are only having problems with machines on the other network.
I would suggest you to open a case to Cisco.
They will be able to check for know bugs on your IOS version and will ask you configuration details that you may don't want to publish here. (but if you want you can put the result of a sh tech somewhere it could help us)
Also doest it append after a reboot or did it start to get corrupt after a long uptime ?
You're seeing this problem with PINGs from the switch's CLI, or from a PC connected to the switch?
Is this switch providing layer 3 (routing) functions?
Are these PINGs your showing having problems between two devices on the same subnet, or across subnets?
Does the log on the switch ("show log hist", I believe) show anything amiss?
Is the issue affecting packet delivery to only the a couple of device(s), or are you seeing it affecting a number of devices?
I had a similiar issue to this at a Customer site a few years back. I captured the output of a "show mac-" prior to the issue occurring, and then during the issue occurring, and compared looking for devices that appeared to be on different ports prior to the outage starting and after.
I found that there was an embedded device on the LAN (a clock, in this case) that would periodically transmit a batch of frames with a "spoofed" source address, confusing the switch's bridging table and causing the switch to send frames out the wrong port for awhile. I was able to see it in the "show mac-" output by noticing that devices that should not have been changing ports appeared to be doing so.
Sounds like fun to troubleshoot! Wish I were there... >smile<
Edit:
Thanks for the comments.
"show log hist" shows a persistent log. As long as you're not clearing the log, any messages reported there will still be there after you clear the arp cache on the switch.
Is there any other router between your 6509 and the corporate datacenter where the problem-devices live?
Are you using any dynamic routing protocols?
Here's what my gut says:
I'm going to strongly recommend that you save a copy of "show mac-" and "show arp" before a failure occurs and again when a failure is occurring (it should only take a moment to capture them with something like PuTTY, so you can get on with clearing the arp cache quickly).
I realize you can't easily post these captures here, but I'd recommend that you throw them into a spreadsheet or database and match up MAC address against ports in one report, and MAC addresses against IP address in another. If you compare "before" and "during", I predict you're going to see some differences.
Assuming there's a router between your 6509 and the corporate data center, I predict that you're going to find that router's MAC address to be "moving" between ports, or its IP address moving between MAC addresses.
If there's no router and the corporate data-center machines are talking to this 6509 at layer 2 I'll predict that the devices themselves might show some "moving" between ports, or moving IP addresses between MAC addresses.
If you run a sniffer on the client being ping'd do you see all of the pings or only half of them?
What happens if you source the pings from different interfaces on the 6500? Does it happen for hosts that the 6500 is the default gateway for?
What does the mac address table look like? How about a traceroute? And a 'ping -r9 '?
Don't rule out an IOS bug, but it could also be a lot of other things...
It can be case of ARP spoofing. If some one is trying to spoof all the address on the network including gateway the spoofing machine will get too much traffic and hence may not be able to transfer all data to correct addresses after reading it. Or spoofing machine can intentionally drop additional packets.
Run wireshark. Then use "arp -d" to delete arp entries of all IP address on your subnet. Then try to ping few IPs on your subnet. Then stop capturing packets from wireshark and just analyze ARP traffic. If you see multiple ARP responses for each IP you pinged like IP 172.16.1.1 is at xx:xx:xx:xx:xx:xx0 followed by IP address 172.16.1,1 is at yy:yy:yy:yy:yy:yy. Then it is definitely case of ARP spoofing and there is nothing wrong with the switch.
In case this does not work. Try upgrading switch IOS to latest version.
I have to agree with Peter and Evan. This sounds more like a bouncing route/port than a cache attack. Especially on a 65xx. To amplify Evan's comment, be sure to get the (working) arp table, but the only entry you'll really need is the next-hop router. Have you ruled out multi-path problems? I saw someone ask if you were running a dynamic routing protocol (or multiple gateways w/floating static routes) but I haven't seen your answer. Good luck!