We use LACP (mode 4) bonds extensively in our environment, and I occasionally run into problems with new deployments where cables get crossed, or switch ports are misconfigured causing bad LACP port states.
One thing that I've been using to troubleshoot is the value of the partner oper key. These generally tend to match, and when they don't, it makes me suspect a possible crossed cable problem. I've been trying to research it, but have been having a hard time finding a definitive answer. So, is it reasonable to expect oper keys accross a LACP channel group to always share the same oper key, or are there cases where they might differ in a correctly configured group?
For example:
# grep -A6 "partner lacp pdu" /proc/net/bonding/bond0
details partner lacp pdu:
system priority: 32768
system mac address: 70:e4:23:92:42:b7
oper key: 205
port priority: 32768
port number: 92
port state: 61
--
details partner lacp pdu:
system priority: 32768
system mac address: 70:e4:23:92:42:b7
oper key: 206
port priority: 32768
port number: 94
port state: 13
In this example, I know the state of the 2nd partner is bad - I'm just trying to to come up with a good way of determining "why" it's bad.
I just logged into 400 servers all using LACP mode 4. Two interfaces, 25G up/down for 50G total. 2x Cisco 9600 LACP mode 4 set in a port channel to combine both ports. One cable goes into a different switch to have, power, switch, cable, rack and interface redundancy.
oper key is the same across the board.
I am including a working bond below.
Several things come to mind for your question,
This could be addressed by using a standard cabling practice. All of our cables that go down the left side of the rack, all plug into the left side of the switch (or in this case, one rack over.) and all the cables run on the right side go to the right side of the switch. So that looks like, server 1 has a cable to port 1 or port 48. This would help you as it creates a standard model to count from. Server 5 would be port 43 and port 5. Easy to track, easy to communicate.
Another thought, we use mac addresses to track down LACP members. I could log into a server using radssh + racadm (out of band access) or radssh (utilizing ssh) to bulk log into all my servers and pull the list of (not bond0, we want the actual members ) mac addresses. Hand that completed list of mac addresses to a network team and compare the list of members to the list of mac addresses.