I'm wondering about the fact, that PCI architecture is hierarchical. Therefore, even if I have two PCIx4 slots, it's possible that I will not be able to fully utilize it, because the slots will connect in one node, which can have bandwidth insufficient to handle 2x PCIex4.
The background for my question is: I'm trying to utilize eight PCIe 1GBit interfaces. I have two cards packed with 2 ports, and one card packed with 4 ports. I'm able to get the maximum on 4 NIC interfaces. After activating 5th port, the performance slightly drops on each five interfaces. The same after activating 6th, 7th and 8ht interface.
Main question is: How to obtain PCIe structure on a machine, "draw" it, see it's nodes and connections, and deduce weakest node in that tree?
Each PCIe (v1) lane should comfortably handle 2 x 1Gbe links.
At this point DDR2 can comfortably handle the data rate of 10Gbe.
In general the PCIe layout of a machine is dictated by the chipset, and if you can find a diagram from Intel or whoever you should be able to work out where any bottlenecks are.
For example:
If you're running a Unix-like operating system, the probe messages a boot time will often enumerate everything on the PCI bus, and give you an indication of how it is arranged. Linux systems have an
lspci
command that will do the same.If you need more bandwidth on your bus, you may want to look into getting a Supermicro server with UIO slots. Using an AOC-UG-14 would let you keep at least one 4-port gigabit Ethernet card off of the PCI bus. 2U hosts can have two UIO slots, plus I believe three additional PCI slots, which might well allow you to build a machine with 12 GigE ports working at full performance.
It's not just PCI architecture that is involved here, but your FSB, memory bandwidth, and internal bandwidth across all chipsets. Take note of Wazoox's comment - even fairly recent Xeon platforms performed badly at high line rates.
From reading your other comments I understand that you're doing packet generation in software, and pushing this out via your gige nics. If you're not sensible about how you're generating the data, you could well be saturating your memory bandwidth. DDR2 will handle 10Gb, bit if you're doing multiple copies in memory while generating the packets, you're actually doing a lot more internal traffic.
Also, if all 8 cores are pegged, then you aren't keeping up on any of them. Whether it's interrupt loading or poor code paths in your packet generation process, something is getting in your way. I'd suggest resolving this issue first. Profile your code and find out if there's anything obvious taking most of your time.
If that doesn't help, and depending on your usage requirements, you could consider some real network processing/capture/transmit cards, such as Endace's DAG cards (I'd suggest the DAG 7.5 G2/G4 for PCI-e). These aren't interrupt driven, so there is no added processing load due to interrupts. They aren't network cards as such, so you'll have to construct the entire packet and payload and handle layer2 as well , but that isn't that expensive.
Disclaimer: I work for Endace.