HP's whitepaper on their QLogic (fka Broadcom) NetXtreme II adapters, which includes the specific NIC that I'm testing, states (page 7) that their small packet performance for packets up to 256 bytes/packet is above 5,000,000 packets/sec.
In my tests with an app where I disabled all processing except the mere UDP receive part, I can go up to 120,000 packets/sec only. The packets are evenly distributed over 12 multicast groups.
I noticed that there is one core (out of 12 cores each on the 2 sockets) whose load gradually increases when I crank up the UDP send rate and maxes out at around 120,000. But I don't know what that core is doing and why. It is not a single-thread bottleneck in my app, because it does not matter if I run a single instance of the app for all multicast groups, or 12 instances which handle 1 multicast group each. So the bottleneck is not my receiver app.
MSI is enabled (verified via the "resources by type" view in the device manager) and RSS too is enabled in the NIC settings, with 8 queues. So what is clinging to that one core? All NIC offloading features are currently on, but turning them off did not help.
So where could the bottleneck be?
System details:
- ProLiant BL460c Gen9
- Intel Xeon E5-2670 v3 (2 x 12cores)
- HP FlexFabric 10Gb 2-port 536FLB NIC
- Windows 2012 R2
Which unfortunately did not mean that RSS was being employed, as
showed:
After running (btw without rebooting)
RSS started working and the load that formerly heaped on that one poor core now gets evenly distributed over many cores on one of the 2 NUMA nodes.
I haven't verified if that would allow me to handle the Mpps loads advertised, but the ceiling was lifted sufficiently to benchmark what I needed to.