I have 10 servers with two CPUs each and one Mellanox 100G Infiniband NIC per CPU. Each NIC is connected to a single Mellanox 36 port 100G IB switch.
My RDMA application runs as one process per NUMA node and binds to the local NIC to avoid cross CPU traffic. Each node/process needs to connect to every other node using RC mode.
The problem I ran into is, it appears the default OpenSM routing forces me to use a certain NIC to reach a certain target node. So I would have to use both NICs from both NUMA nodes to reach all other nodes. That means I would need two PDs also, having to register all the memory twice.
Is there any way to allow a single NIC to be able to connect to any other NIC/port on the network?
Essentially I would like to make OpenSM think that each NIC is on it's own server, ie. pretending that no QPI traffic is possible.
See: https://docs.mellanox.com/display/MLNXOFEDv461000/OpenSM
Once MinHop matrices exist, each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID.
Relevant code: https://github.com/linux-rdma/opensm/blob/844ab3b7edaad983449b5d3a4a773088b8daa299/opensm/osm_ucast_mgr.c#L201
https://community.mellanox.com/s/question/0D51T00006RVtlU/rdmacm-connection-setup-issues
Running
sudo ibacm
on all servers solved the issue, don't ask me why...