I work as a Network Admin in a big company. Lately we noticed an issue with our network infrastructure. Basically our network backend lies on a Catalyst as main L3 backend switch, and few Cisco Nexus switches as edge L2 switches, connected to that Catalyst.
The issue appears as we try to sniff traffic on one of our hosts - we then (always) see unicast traffic between other hosts.
I'll try to be more elaborate: Assuming I'm on the host 10.0.0.1, with mac MAC, I run the command -
tcpdump -i eth0 ether host not MAC and host not 10.0.0.1 and not broadcast and not multicast
I will always see traffic between other hosts.
I read a Cisco article about Unicast Flooding, however - the "phenomenon" occurs not only when passing between VLANs in our network, but also on the on the very same VLAN. Is it possible that it happens when passing between switches in the same VLAN (our VLANs span on many switches)? All switches are connected by a trunk to the Catalyst...
Any ideas?
Thanks.
Edit:
It seems that we found the source of our problems.
Basically, each time one of the switches gets a frame with a MAC address it doesn't recognize - it floods it to all ports. This is normal - and the way things should go. However, in our current settings, a MAC entry in the switch should "live" for 30 minutes. If a MAC was not seen for 30 minutes, it will be deleted from the switch until seen again. If a packet is sent to that MAC and it's not in the table - all ports will be flooded in order to find the destination MAC port (we expect to get an answer from one of the ports).
We found one of the destination MACs and looked for it in the switch MAC table. The table didn't contain the MAC while the network was flooded. We tried ARPing the address related to that MAC - and the flood stopped (as the MAC re-appeared in the MAC table).
However, after a few seconds, the MAC disappeared from the MAC table again and the flood started again.
It seems that the flood issue derives from an issue with the MAC tables on our switches. It seems as if they "forget" MAC addresses quickly than they should (MACs should stay for 30 minutes) and flood all packets with that MAC.
A quick prequel-
ARP table - A L3 device (router, host, etc) maintains a mapping between a given IP address and a corresponding MAC address.
CAM table - This may be known by other names in particular switch platforms, but the upshot is that a given L2 switching device maintains a mapping between a given hardware address and one or more physical switch ports.
What's happening in the case above is called unicast flooding. This is a condition where the router still has a live ARP entry even though the switch's CAM table has flushed the corresponding entry. As a result, when the router receives a packet for a given host it is simply forwarded to the switch without first sending an ARP request (the IP : MAC mapping is still cached). The switch, however, no longer knows the port to which this MAC address is mapped (this entry having been aged out earlier). If the switch doesn't have a CAM entry for a given unicast MAC then it will flood packets for that MAC to all ports until it sees a response (i.e. the response to an ARP request).
For obscure reasons ARP and CAM timers are generally quite different on Cisco switches. The values vary somewhat but the mismatch continues through the most modern Nexus devices. Best practice is to set the ARP and CAM timers to similar values - ideally with the CAM table set to 5 seconds or so longer than ARP. It's better for the router to re-ARP than for the switch to have to flood. Setting both values to ~600 seconds (10 minutes) generally isn't too bad, but some environments might want to go a bit longer if excessive ARP traffic is seen on the router.
Yes, ARP broadcasts (throw up wireshark and you will see "Who's got blah blah blah. Tell blah.").
The client should be responding back with "It's me!" So, I would investigate the client that is failing to respond.