Ping a Specific Port

Question

Hubro

Asked: 2021-02-05 15:13:58 +0800 CST2021-02-05 15:13:58 +0800 CST 2021-02-05 15:13:58 +0800 CST

How can I override IRQ affinity for NVME devices

772

I am trying to move all interrupts over to cores 0-3 to keep the rest of my cores free for high speed, low latency virtualization.

I wrote a quick script to set IRQ affinity to 0-3:

#!/bin/bash

while IFS= read -r LINE; do
    echo "0-3 -> \"$LINE\""
    sudo bash -c "echo 0-3 > \"$LINE\""
done <<< "$(find /proc/irq/ -name smp_affinity_list)"

This appears to work for USB devices and network devices, but not NVME devices. They all produce this error:

bash: line 1: echo: write error: Input/output error

And they stubbornly continue to produce interrupts evenly across almost all my cores.

If I check the current affinities of those devices:

$ cat /proc/irq/81/smp_affinity_list 
0-1,16-17
$ cat /proc/irq/82/smp_affinity_list
2-3,18-19
$ cat /proc/irq/83/smp_affinity_list
4-5,20-21
$ cat /proc/irq/84/smp_affinity_list
6-7,22-23
...

It appears "something" is taking full control of spreading IRQs across cores and not letting me change it.

It is completely critical that I move these to other cores, as I'm doing heavy IO in virtual machines on these cores and the NVME drives are producing a crap load of interrupts. This isn't Windows, I'm supposed to be able to decide what my machine does.

What is controlling IRQ affinity for these devices and how do I override it?

I am using a Ryzen 3950X CPU on a Gigabyte Auros X570 Master motherboard with 3 NVME drives connected to the M.2 ports on the motherboard.

(Update: I am now using a 5950X, still having the exact same issue)

Kernel: 5.12.2-arch1-1

Output of lspci -v related to NVME:

01:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0, IOMMU group 14
    Memory at fc100000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme

04:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0, IOMMU group 25
    Memory at fbd00000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme

05:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 40, NUMA node 0, IOMMU group 26
    Memory at fbc00000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme

$ dmesg | grep -i nvme
[    2.042888] nvme nvme0: pci function 0000:01:00.0
[    2.042912] nvme nvme1: pci function 0000:04:00.0
[    2.042941] nvme nvme2: pci function 0000:05:00.0
[    2.048103] nvme nvme0: missing or invalid SUBNQN field.
[    2.048109] nvme nvme2: missing or invalid SUBNQN field.
[    2.048109] nvme nvme1: missing or invalid SUBNQN field.
[    2.048112] nvme nvme0: Shutdown timeout set to 10 seconds
[    2.048120] nvme nvme1: Shutdown timeout set to 10 seconds
[    2.048127] nvme nvme2: Shutdown timeout set to 10 seconds
[    2.049578] nvme nvme0: 8/0/0 default/read/poll queues
[    2.049668] nvme nvme1: 8/0/0 default/read/poll queues
[    2.049716] nvme nvme2: 8/0/0 default/read/poll queues
[    2.051211]  nvme1n1: p1
[    2.051260]  nvme2n1: p1
[    2.051577]  nvme0n1: p1 p2

3 Answers

Voted

Andrew H · Answer 1 · 2021-05-12T04:36:11+08:00

Andrew H

2021-05-12T04:36:11+08:002021-05-12T04:36:11+08:00

The simplest solution to this problem is probably just to switch from using IRQ/interrupt mode to polling mode for the NVMe driver.

Add this to /etc/modprobe.d/nvme.conf:

options nvme poll_queues=4

then run update-initramfs -u, reboot, and you should see a vast reduction in IRQs for NVMe devices. You can also play around with the poll queue count in sysfs and other NVMe driver tweakables (modinfo NVMe should give you a list of params you can adjust)

That said, this is all highly dependent on what kernel version you’re running…

4

mforsetti · Answer 2 · 2021-05-12T21:59:17+08:00

Best Answer

mforsetti

2021-05-12T21:59:17+08:002021-05-12T21:59:17+08:00

What is controlling IRQ affinity for these devices?

Linux kernel since v4.8 is automatically using MSI/MSI-X interrupt masking in NVMe drivers; and with IRQD_AFFINITY_MANAGED, automatically manages MSI/MSI-X interrupts in kernel.

See these commits:

90c9712fbb388077b5e53069cae43f1acbb0102a - NVMe: Always use MSI/MSI-X interrupts
9c2555835bb3d34dfac52a0be943dcc4bedd650f - genirq: Introduce IRQD_AFFINITY_MANAGED flag

Seeing your kernel version and your devices capabilities via lspci -v output, apparently it is the case.

and how do I override it?

Besides disabling the flags and recompiling the kernel, probably disable MSI/MSI-X to your PCI bridge (instead of devices):

echo 1 > /sys/bus/pci/devices/$bridge/msi_bus

Note that there will be performance impact on disabling MSI/MSI-X. See this kernel documentation for more details.

Instead of disabling MSI/MSI-X, a better approach would be keeping MSI-X but also enable polling mode in NVMe driver. See Andrew H's answer.

3

Simon Richter · Answer 3 · 2021-05-13T00:46:52+08:00

That is intentional.

NVMe devices are supposed to have multiple command queues with associated interrupts, so interrupts can be delivered to the CPU that requested the operation.

For an emulated virtual disk, this is the CPU running the I/O thread, which then decides if the VM CPU needs to be interrupted to deliver the emulated interrupt.

For a PCIe passthrough disk, this is the VM CPU, which leaves the VM, enters the host interrupt handler, which notices that the interrupt is destined for the virtual CPU, and enqueues it so it is delivered to the VM on the VM enter after the handler returns, so we still get only one interruption of the VM context.

This is pretty much as optimal as it gets. You can pessimize this by delivering the IRQ to another CPU that will then notice that the VM needs to be interrupted, and queue an inter-processor interrupt to direct it where it needs to go.

For I/O that does not belong to a VM, the interrupt should go to a CPU that is not associated with a VM.

For this to work properly, the CPU mapping for the VMs needs to be somewhat static.

There is also the CPU isolation framework you could take a look at, but that is probably too heavy-handed.

How can I override IRQ affinity for NVME devices

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?