MSI and MSI-X: PCIe Message-Signaled Interrupts
Replacing legacy INTx with per-queue interrupt vectors for NVMe, NIC, and GPU
Why MSI-X?
Legacy PCI interrupts (INTx, INTA-INTD) have fundamental limitations: - Shared: multiple devices share one IRQ line → spurious interrupts, no isolation - Level-triggered: interrupt line held until acknowledged → cannot miss (line stays asserted until cleared); edge-triggered interrupts can be missed if the transition occurs while masked - Limited: only 4 lines per PCI bus
MSI (Message-Signaled Interrupts) support was added to Linux in 2.6.8 (2004) by Tom Long Nguyen (Intel) and solves these by using memory writes instead of dedicated wires:
Legacy INTx:
Device → asserts pin → CPU interrupt controller → CPU
MSI/MSI-X:
Device → writes message to memory address (MSI address reg) → CPU
This write IS the interrupt — the memory write triggers the CPU
MSI-X extends MSI to 2048 vectors per device (vs 32 for MSI), and each vector can have independent affinity.
MSI capability structure
PCI config space:
[PCI header]
[Capabilities list]
└── MSI capability (cap ID = 0x05):
offset size field
00 1 Capability ID (0x05)
01 1 Next Pointer
02-03 2 Message Control
[0]: MSI Enable
[3:1]: Multiple Message Capable (log2 of requested vectors)
[6:4]: Multiple Message Enable
[7]: 64-bit capable
04-07 4 Message Address (low 32 bits, 4-byte aligned)
08-0B 4 Message Address High (if 64-bit)
0C-0D 2 Message Data
CPU writes: APIC ID + vector number
/* Reading MSI capability from kernel: */
u16 ctrl;
pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &ctrl);
int nvec = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1); /* max vectors */
MSI-X table
MSI-X uses a table in one of the device's BARs:
MSI-X capability (cap ID = 0x11):
Message Control:
[10:0]: Table Size - 1 (up to 2048 entries)
[13]: Function Mask (mask all vectors)
[15]: MSI-X Enable
Table Offset/BIR:
[2:0]: BAR Index (which BAR contains the table)
[31:3]: Table Offset (byte offset into that BAR)
Pending Bit Array (PBA) Offset/BIR:
[2:0]: BAR Index
[31:3]: PBA Offset
MSI-X Table entry (16 bytes each):
[31:0]: Message Address Low
[63:32]: Message Address High
[95:64]: Message Data (APIC ID + vector)
[127:96]: Vector Control
[0]: Masked (1 = masked)
Linux kernel: requesting MSI-X vectors
/* drivers/net/ethernet/intel/igb/igb_main.c (simplified) */
static int igb_set_interrupt_capability(struct igb_adapter *adapter)
{
int numvecs, err;
/* Try to get one vector per TX queue + one per RX queue + misc */
numvecs = adapter->num_tx_queues + adapter->num_rx_queues + 1;
/* Request MSI-X vectors: */
/* pci_alloc_irq_vectors: tries MSI-X first, falls back to MSI, then INTx */
err = pci_alloc_irq_vectors(adapter->pdev,
1, /* min vectors */
numvecs, /* max vectors */
PCI_IRQ_MSIX | PCI_IRQ_MSI);
if (err < 0) {
/* Fall back to single interrupt */
pci_alloc_irq_vectors(adapter->pdev, 1, 1, PCI_IRQ_ALL_TYPES);
return 0;
}
adapter->num_q_vectors = err; /* actual vectors granted */
return 0;
}
/* Register handler for each vector: */
for (i = 0; i < adapter->num_q_vectors; i++) {
/* pci_irq_vector: get IRQ number for vector i */
int irq = pci_irq_vector(adapter->pdev, i);
snprintf(name, sizeof(name), "%s-q%d", netdev->name, i);
err = request_irq(irq, igb_msix_ring, 0, name,
adapter->q_vector[i]);
}
/* Free at cleanup: */
pci_free_irq_vectors(adapter->pdev);
Per-vector IRQ affinity
With MSI-X, each vector can be pinned to a different CPU:
# List MSI-X IRQs for an NVMe device:
ls /proc/irq/ | xargs -I{} cat /proc/irq/{}/smp_affinity_list 2>/dev/null
# Most IRQs map to many CPUs
# For MSI-X NVMe with 8 queues:
# irq 32 → cpu 0 (queue 0)
# irq 33 → cpu 1 (queue 1)
# ...
# irq 39 → cpu 7 (queue 7)
# Set affinity for vector 0 to CPU 0:
echo 0 > /proc/irq/32/smp_affinity_list # CPU 0 only (smp_affinity_list uses CPU numbers, not bitmasks)
# Or using irqbalance for automatic balancing:
systemctl start irqbalance
/* Kernel: set affinity hint (driver calls this): */
irq_set_affinity_hint(irq, cpumask);
/* irqbalance respects hints when distributing IRQs */
/* Or force affinity: */
irq_set_affinity(irq, cpumask);
Interrupt coalescing
To reduce interrupt rate (and CPU overhead), drivers aggregate multiple completions:
# Check current coalescing settings:
ethtool -c eth0
# Coalesce parameters for eth0:
# Adaptive RX: off TX: off
# rx-usecs: 50 ← generate interrupt after 50µs idle
# rx-frames: 0 ← or after 0 packets (disabled)
# tx-usecs: 50
# tx-frames: 0
# Reduce interrupt rate (more coalescing = higher latency, less CPU):
ethtool -C eth0 rx-usecs=100 tx-usecs=100
# Adaptive coalescing (auto-tune based on load):
ethtool -C eth0 adaptive-rx on adaptive-tx on
Kernel coalescing support
/* NIC driver implements: */
static int igb_set_coalesce(struct net_device *netdev,
struct ethtool_coalesce *ec,
struct kernel_ethtool_coalesce *kernel_coal,
struct netlink_ext_ack *extack)
{
struct igb_adapter *adapter = netdev_priv(netdev);
/* rx-usecs → RDTR (Receive Delay Timer Register) */
adapter->rx_itr_setting = ec->rx_coalesce_usecs;
/* Program hardware register: */
wr32(E1000_ITR, adapter->rx_itr_setting * 256000000 / 1000000);
/* Hardware: interrupt fires when queue idle for ITR microseconds */
return 0;
}
Checking MSI-X status
# Verify MSI-X is active:
lspci -vvv -s 03:00.0 | grep -A 10 "MSI-X"
# Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
# Vector table: BAR=4 offset=00002000
# PBA: BAR=4 offset=00003000
# Number of MSI-X vectors in use:
cat /proc/interrupts | grep nvme
# 32: 0 0 PCI-MSI 2621440-edge nvme0q0
# 33: 12345 67 PCI-MSI 2621441-edge nvme0q1
# ...
# Per-CPU interrupt counts for a device:
cat /proc/irq/32/affinity_hint # CPU affinity hint
cat /proc/irq/32/effective_affinity_list # actual CPU
# Stats per vector:
cat /sys/bus/pci/devices/0000:03:00.0/msi_irqs/
# 32 33 34 ... (one per MSI-X vector)
MSI vs MSI-X comparison
| INTx | MSI | MSI-X | |
|---|---|---|---|
| Max vectors | 4 | 32 | 2048 |
| Sharing | Yes (shared wire) | No | No |
| Per-vector affinity | No | No | Yes |
| Masking | INTx disable | All vectors | Per-vector |
| Table location | PCI pins | Config space | BAR |
| Usage | Legacy | Simple devices | NVMe, 10GbE, GPU |
Further reading
- PCI Drivers — PCIe driver basics including MSI setup
- IRQ Affinity — configuring IRQ CPU affinity
- NVMe Driver — uses one MSI-X vector per queue
- IRQ Descriptor — struct irq_desc and irq_chip
drivers/pci/msi/— MSI/MSI-X implementationDocumentation/PCI/msi-howto.rst— MSI usage guide for drivers