My goal is to configure our CentOS ("free" RHEL) 5.x servers for custom low-latency network programs. I would like to experiment with binding ethernet NIC interrupt handling to the same CPU on which the program runs (to hopefully improve cache utilization). The first step in this process is to determine the NIC's IRQ.
Here is the contents of /proc/interrupts on one server (note that I deleted CPUs 2 through 14 for brevity):
CPU0 CPU1 CPU15
0: 600299726 0 0 IO-APIC-edge timer
1: 3 0 0 IO-APIC-edge i8042
8: 1 0 0 IO-APIC-edge rtc
9: 0 0 0 IO-APIC-level acpi
12: 4 0 0 IO-APIC-edge i8042
50: 0 0 0 IO-APIC-level uhci_hcd:usb6, uhci_hcd:usb8
58: 6644 25103 0 IO-APIC-level ioc0
66: 0 0 0 IO-APIC-level ata_piix
74: 221 533830 0 IO-APIC-level ata_piix
98: 35 0 2902361 PCI-MSI-X eth1-0
106: 61 11 3841 PCI-MSI-X eth1-1
114: 28 0 61452 PCI-MSI-X eth1-2
122: 24 1586 22 PCI-MSI-X eth1-3
130: 2912 0 337 PCI-MSI-X eth1-4
138: 21 0 28 PCI-MSI-X eth1-5
146: 21 0 56 PCI-MSI-X eth1-6
154: 34 1 1 PCI-MSI-X eth1-7
209: 23 0 0 IO-APIC-level ehci_hcd:usb1
217: 0 0 0 IO-APIC-level ehci_hcd:usb2, uhci_hcd:usb5, uhci_hcd:usb7
225: 0 0 0 IO-APIC-level uhci_hcd:usb3
233: 0 0 0 IO-APIC-level uhci_hcd:usb4
NMI: 7615 2989 2931
LOC: 600328144 600328099 600327086
ERR: 0
MIS: 0
Why are there multiple entries for "eth1" in the form of "eth1-X"?
Furthermore, the contents of "/sys/class/net/eth1/device/irq" is "90". But there's no 90 in the interrupt list above.
So let's say I look at just "eth1-0", which is IRQ 98. The contents of /proc/irq/98/smp_affinity is:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00008000
That's a list of numbers, rather than just one number.
So how do I set eth1's smp_affinity?
None of the online examples and documentation I could find mentioned any cases like this; they always have exactly one "ethX" entry in /proc/interrupts; the indicated interrupt matches /sys/class/net/ethX/device/irq; and there is only one number in /proc/irq/N/smp_affinity.
FWIW, I'll add that this application is extremely latency sensitive. To the point where we disable C-states and processor frequency scaling (because those features induce too much latency). Micro seconds make a difference here.
Edit: I stumbled across the following web page http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html that, although it is about cpuset, it has a section titled "Mask Format", which I assume is the same as what I am seeing in the /proc/irq//smp_affinity file. Quoting:
This format displays each 32-bit word in hexadecimal (using ASCII characters "0" - "9" and "a" - "f"); words are filled with leading zeros, if required. For masks longer than one word, a comma separator is used between words. Words are displayed in big-endian order, which has the most significant bit first. The hex digits within a word are also in big-endian order.
The number of 32-bit words displayed is the minimum number needed to display all bits of the bitmask, based on the size of the bitmask.
Examples of the Mask Format:
00000001 # just bit 0 set 40000000,00000000,00000000 # just bit 94 set 00000001,00000000,00000000 # just bit 64 set 000000ff,00000000 # bits 32-39 set 00000000,000E3862 # 1,5,6,11-13,17-19 set
A mask with bits 0, 1, 2, 4, 8, 16, 32, and 64 set displays as:
00000001,00000001,00010117
The first "1" is for bit 64, the second for bit 32, the third for bit 16, the fourth for bit 8, the fifth for bit 4, and the "7" is for bits 2, 1, and 0.
Because there are multiple tx/rx queues. These queues are often a hash of (local addr, port, remote addr, port) and some other stuff. Suppressing the multiple queues might make it easier to make your application more deterministic, assuming you have few traffic sources. Or you could look up the algorithm and avoid ephemeral ports, if that's easier.
Are you using a realtime kernel? Are you leveraging
cgroups
orcpusets
to isolate your application? If you're on a stock distribution kernel, you're leaving a good amount of latency gains on the table. Also, I see 16 CPU-cores. That would indicate that HyperThreading is enabled. How do you know if you're binding to a real versus logical core?Check if you have a directory
/sys/class/net/eth1/device/msi_irqs/
. If so, ignore the content of/sys/class/net/eth1/device/irq
. This network device has multiple rx/tx queues and therefore multiple IRQs. These IRQs correspond to the names of the files in the/sys/class/net/eth1/device/msi_irqs/
directory.