I have the below, simplified configuration:
Essentially, I have an ESXi host with two physical network adapters. Each adapter plugs in to a different switch. Each switch is connected via a trunk port. A PC is connected to one of the switches. A vSwitch with a VMKernel port and VM ports is configured to use both physical NICs in an Active/Active configuration:
I have run esxtop
and can see that the ESXi host has chosen the physical NIC connected to Switch 2 for the VMKernel port. From the PC, if I ping the management IP address of the ESXi host the pings are intermittent. They go up and down.
If I show the mac address-table on each switch, I see that Switch 2 always has the VMKernel's MAC address assigned to the switch port connected to the ESXi host. But, Switch 1 continually adds and removes the VMKernel's MAC address on it's respective physical port. Anytime Switch 1 has the VMKernel's MAC assigned to its physical port, the pings fail.
The reason for the failure is obvious. The reason why Switch 1 is routinely picking up the MAC address of the ESXi VMKernel port is the question. The ESXi host has chosen the interface connected to Switch 2 to be the active port. The interface connected to Switch 1 should be inactive. But, it would appear that it is possibly responding to ARP requests?
It's worth noting that none of the VMs on this host have this problem. They are all reachable and are present in only one MAC table at a time. This problem specifically affects the VMKernel port.
What about this configuration is wrong? I am looking for some type of documentation or explanation on top of a solution to this issue. I know that setting the VMKernel port to be Active/Standby mode will probably solve the issue. But, I can't find anything documented why this current configuration is a problem.
UPDATES:
- I disabled CDP on the vSwitch thinking that it might be causing communication over the inactive NIC.
- I overrode the vSwitch settings for the VMKernel port and set it to use explicit failover and Active/Standby. I also placed the standby NIC in the unused pool. None of it helped. What did solve the issue was changing the port order around. So, when the port connected to Switch 1 becomes active, I do not see the issue. The MAC address does not become active on Switch 2 at all. These are two significantly different NIC cards, and I'm wondering if this isn't some kind of driver issue.
Something has to be causing the VMKernel MAC address to be seen on Switch 1's port, but it comes and goes every several seconds.
Switch configs for STP and ports: Switch 1
!
spanning-tree mode rapid-pvst
spanning-tree portfast edge default
spanning-tree extend system-id
!
interface Port-channel1
switchport access vlan 11
switchport trunk encapsulation dot1q
switchport mode trunk
!
interface GigabitEthernet1/0/7
switchport access vlan 11
switchport mode access
!
interface GigabitEthernet1/0/23
switchport access vlan 11
switchport trunk encapsulation dot1q
switchport mode trunk
channel-group 1 mode desirable
!
interface GigabitEthernet1/0/24
switchport access vlan 11
switchport trunk encapsulation dot1q
switchport mode trunk
channel-group 1 mode desirable
Switch 2
!
spanning-tree mode rapid-pvst
spanning-tree portfast edge default
spanning-tree extend system-id
!
interface Port-channel1
switchport access vlan 11
switchport trunk encapsulation dot1q
switchport mode trunk
!
interface GigabitEthernet1/0/3
switchport access vlan 11
switchport mode access
!
interface GigabitEthernet1/0/23
switchport access vlan 11
switchport trunk encapsulation dot1q
switchport mode trunk
channel-group 1 mode desirable
!
interface GigabitEthernet1/0/24
switchport access vlan 11
switchport trunk encapsulation dot1q
switchport mode trunk
channel-group 1 mode desirable
The management vmk in ESXI assumes the MAC address of the Nic in the first PCI slot during the initial set-up. This is how it has worked forever. This can break things only when the physical device also starts sending packets. This normally does not happen, physical Nics do not send traffic, they pass traffic along. This behavior also needs to be paid attention to if you decide to move physical Nics from one host to another, this brings down 2 host connections when the physical switch freaks out. My guess is that this Nic started reporting CDP/LLDP traffic and this is when your switch sees the MAC duplication. The easiest solution is to rebuild the vmk through the command line. This will need to be done from a direct console access (DCUI) (KVM, ILO, IDRAC, etc...).
Here are the commands; (Adjust the IP's/subnet mask/portgroup name etc... to fit your needs.)
esxcli network ip interface remove --interface-name=vmk0
esxcli network vswitch standard portgroup add -p Management_Network -v vSwitch0
esxcli network ip interface add --interface-name=vmk0 --portgroup-name=Management_Network
esxcli network vswitch standard portgroup set -p Management_Network --vlan-id 50
esxcli network ip interface ipv4 set --interface-name=vmk0 --ipv4=192.168.50.116 --netmask=255.255.255.0 --gateway=192.168.50.1 --type=static
esxcli network ip interface tag add -i vmk0 -t Management
This will rebuild the management vmk with a VMware MAC address to eliminate this issue. However, I would recommend that you reach out to the hardware vendor/manufacturer for the process of shutting the CDP/LLDP coming from the physical card. This will resolve this one ESXi host issue, but you will end up with it happening to others if you allow the card(s) to continue to perform this function. If this was as big a problem as you had originally thought, VMware would not be a giant company, this is not very common...
I have run an extremely similar setup for many years without any issues.
How have you configured the switch ports? You shouldn't do anything special (no (M)LAG/LACP) since ESXi takes care of everything. It's fine to stack the switches, just don't aggregate the ports, configure any link-state mirroring or similar.
Switch2 should have the VMkernel port's MAC on the ESXi-facing port and switch1 on the switch2-facing port, permanently.
The MAC flapping back and forth might be caused by another issue like frequent STP topology changes (which aren't usually visible by the ESXi but might still). Check the switches' logs for any anomalies.
Without any LAG that could only happen if the host actually sent frames with the VMK port's MAC to switch1. It won't normally do that unless the link to switch2 failed.
For the VMK port, yes. There may be VM traffic attached to the same port group.
ARP or not, frames with the VMK port MAC do not originate from the other port without reason.
The switch port config that you posted shows that you are using a port channel on the catalyst switches.
Just don't do that! With standalone ESXi hosts this is not supported. ESXi takes care of load balancing and failover on its own within software only. If you absolutely want to use external switch based port channels then this requires that you use vCenter and a distributed switch.
See https://kb.vmware.com/s/article/82609 and https://kb.vmware.com/s/article/1001938 for further details.