I'm trying to compare latencies of different node interconnects for a cluster. The goal is to minimize the memory access latency.
I have obtained some benchmarks regarding one of the hardware implementations of NUMA architecture with many CPUs. This indicates that:
- The latency of memory access directly connected to the CPU's socket is about 90ns.
- The latency of memory access connected to other CPU's socket which is connected by UPI to the CPU's socket is about 140ns (so one "hop" of UPI adds about 50ns).
- The latency of memory access via the considered NUMA interconnect is 370ns (so one "hop" of this interconnect adds about 280ns).
NUMA interconnects are quite specialized solutions, not possible to be used with the majority of hardware vendors. "Standard" interconnectors are InfiniBand, Ethernet and FibreChannel.
I'm looking for the latencies these interconnectors provide for memory accesses.
For example in the specification of one of EDR Infiniband switches it states that it offers "90ns port-to-port latency". If I understand correctly, port-to-port latency refers to the one introduced by the switch itself. To this latency we should add the NIC latency that is about 600ns (according to this), so this is about 90+2x600=1290[ns] of interconnector-related latency. (BTW the value 600ns seems suspiciously high compared to 90ns. Why is it so high?)
We should also expect some latency to be introduced by cables (passive copper or optical fiber). I guess it depends on its length, but I'm not sure what is the order of it. Light travels 1 meter in around 3ns, is it a good estimate?
The missing part is the time to access memory by NIC. I guess we should consider separate cases with RDMA and via CPU. Am I missing something else? Is my above reasoning correct?
My major question is: What is the expected latency in accessing memory within a different node of a cluster using "standard" interconnectors like InfiniBand, Ethernet or FibreChannel?
The reason I'm asking is that I'm trying to decompose my problem described in Current single system image solutions to smaller sub-problems.
Your single node numbers of 90 ns local vs 370 ns other socket seems reasonable. However, I think the 600 ns of Infiniband is supposed to be end to end, through a switch to a different frame.
600 ns for a remote datagram is very fast. Local memory access is usually on the order of 100 ns. And same node different socket might be 200 ns more.
Single image multiple node computers have memory access by either RDMA in software, or through hardware interconnects in a NUMA system.
InfiniBand is one transport for RDMA. Circa 2014 Mellanox claimed 500 ns end to end for Infiband EDR. Guessing here, but their marketing could be mixing numbers. 600 ns typical end to end quoted on the NICs, plus 150 ns per extra switch on the path.
Or, yes NUMA interconnects for multiple node systems are a specialized thing, but they do exist. For x86, there was SGI UV family. NUMAlink 7 interconnect claimed 500 ns remote node access. On POWER platform, IBM can wire up nodes with NVLink, although I don't know the latency of that.
Regarding your selection of commodity transport of Ethernet or Infiniband, likely that limits you to RDMA aware applications. NUMA hardware to support transparent single image systems tends to be custom.